arch/arm64/include/asm/kvm_rme.h | 3 + arch/arm64/include/asm/mem_encrypt.h | 6 +- arch/arm64/include/asm/rhi.h | 39 + arch/arm64/include/asm/rmi_cmds.h | 173 ++++ arch/arm64/include/asm/rmi_smc.h | 210 ++++- arch/arm64/include/asm/rsi.h | 5 +- arch/arm64/include/asm/rsi_cmds.h | 129 +++ arch/arm64/include/asm/rsi_smc.h | 60 ++ arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/rhi.c | 35 + arch/arm64/kernel/rsi.c | 26 +- arch/arm64/kvm/mmu.c | 45 + arch/arm64/kvm/rme-exit.c | 87 ++ arch/arm64/kvm/rme.c | 208 ++++- arch/arm64/mm/mem_encrypt.c | 10 + crypto/asymmetric_keys/x509_cert_parser.c | 9 + crypto/asymmetric_keys/x509_loader.c | 38 +- crypto/asymmetric_keys/x509_parser.h | 40 +- drivers/iommu/iommufd/device.c | 54 ++ drivers/iommu/iommufd/iommufd_private.h | 7 + drivers/iommu/iommufd/main.c | 13 + drivers/iommu/iommufd/viommu.c | 178 +++- drivers/pci/tsm.c | 229 ++++- drivers/vfio/pci/vfio_pci_core.c | 20 +- drivers/virt/coco/Kconfig | 5 +- drivers/virt/coco/Makefile | 7 +- drivers/virt/coco/arm-cca-guest/Kconfig | 10 +- drivers/virt/coco/arm-cca-guest/Makefile | 3 + .../{arm-cca-guest.c => arm-cca.c} | 175 +++- drivers/virt/coco/arm-cca-guest/rsi-da.c | 576 ++++++++++++ drivers/virt/coco/arm-cca-guest/rsi-da.h | 73 ++ drivers/virt/coco/arm-cca-host/Kconfig | 17 + drivers/virt/coco/arm-cca-host/Makefile | 5 + drivers/virt/coco/arm-cca-host/arm-cca.c | 384 ++++++++ drivers/virt/coco/arm-cca-host/rmm-da.c | 857 ++++++++++++++++++ drivers/virt/coco/arm-cca-host/rmm-da.h | 108 +++ drivers/virt/coco/host/Kconfig | 6 - drivers/virt/coco/host/Makefile | 6 - drivers/virt/coco/{host => }/tsm-core.c | 27 + include/keys/asymmetric-type.h | 2 + include/keys/x509-parser.h | 55 ++ include/linux/device.h | 1 + include/linux/iommufd.h | 4 + include/linux/kvm_host.h | 1 + include/linux/pci-tsm.h | 37 +- include/linux/swiotlb.h | 4 + include/linux/tsm.h | 29 + include/uapi/linux/iommufd.h | 69 ++ 48 files changed, 3887 insertions(+), 200 deletions(-) create mode 100644 arch/arm64/include/asm/rhi.h create mode 100644 arch/arm64/kernel/rhi.c rename drivers/virt/coco/arm-cca-guest/{arm-cca-guest.c => arm-cca.c} (62%) create mode 100644 drivers/virt/coco/arm-cca-guest/rsi-da.c create mode 100644 drivers/virt/coco/arm-cca-guest/rsi-da.h create mode 100644 drivers/virt/coco/arm-cca-host/Kconfig create mode 100644 drivers/virt/coco/arm-cca-host/Makefile create mode 100644 drivers/virt/coco/arm-cca-host/arm-cca.c create mode 100644 drivers/virt/coco/arm-cca-host/rmm-da.c create mode 100644 drivers/virt/coco/arm-cca-host/rmm-da.h delete mode 100644 drivers/virt/coco/host/Kconfig delete mode 100644 drivers/virt/coco/host/Makefile rename drivers/virt/coco/{host => }/tsm-core.c (85%) create mode 100644 include/keys/x509-parser.h
This patch series implements support for Device Assignment in the ARM CCA architecture. The code changes are based on Alp12 specification published here [1]. The code builds on the TSM framework patches posted at [2]. We add extension to that framework so that TSM is now used in both the host and the guest. A DA workflow can be summarized as below: Host: step 1. echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override echo ${DEVICE} > /sys/bus/pci/drivers_probe step 2. echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect Now in the guest we follow the below steps step 1: echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind step 2: Move the device to TDISP LOCK state echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/lock step 3: Moves the device to TDISP RUN state echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/accept step 4: Load the driver again. echo ${DEVICE} > /sys/bus/pci/drivers_probe I'm currently working against TSM v3, as TSM v4 lacks the necessary callbacks—bind, unbind, and guest_req—required for guest interactions. The implementation also makes use of RHI interfaces that fall outside the current RHI specification [5]. Once the spec is finalized, the code will be aligned accordingly. For now, I’ve retained validate_mmio and vdev_req exit handling within KVM. This will transition to a guest_req-based mechanism once the specification is updated. At that point, all device assignment (DA)-specific VM exits will exit directly to the VMM, and will use the guest_req ioctl to handle exit reasons. As part of this change, the handlers realm_exit_vdev_req_handler, realm_exit_vdev_comm_handler, and realm_exit_dev_mem_map_handler will be removed. Full patchset for the kernel and kvmtool can be found at [3] and [4] [1] https://developer.arm.com/-/cdn-downloads/permalink/Architectures/Armv9/DEN0137_1.1-alp12.zip [2] https://lore.kernel.org/all/20250516054732.2055093-1-dan.j.williams@intel.com [3] https://git.gitlab.arm.com/linux-arm/linux-cca.git cca/tdisp-upstream-post-v1 [4] https://git.gitlab.arm.com/linux-arm/kvmtool-cca.git cca/tdisp-upstream-post-v1 [5] https://developer.arm.com/documentation/den0148/latest/ Aneesh Kumar K.V (Arm) (35): tsm: Add tsm_bind/unbind helpers tsm: Move tsm core outside the host directory tsm: Move dsm_dev from pci_tdi to pci_tsm tsm: Support DMA Allocation from private memory tsm: Don't overload connect iommufd: Add and option to request for bar mapping with IORESOURCE_EXCLUSIVE iommufd/viommu: Add support to associate viommu with kvm instance iommufd/tsm: Add tsm_op iommufd ioctls iommufd/vdevice: Add TSM Guest request uAPI iommufd/vdevice: Add TSM map ioctl KVM: arm64: CCA: register host tsm platform device coco: host: arm64: CCA host platform device driver coco: host: arm64: Create a PDEV with rmm coco: host: arm64: Device communication support coco: host: arm64: Stop and destroy the physical device coco: host: arm64: set_pubkey support coco: host: arm64: Add support for creating a virtual device coco: host: arm64: Add support for virtual device communication coco: host: arm64: Stop and destroy virtual device coco: guest: arm64: Update arm CCA guest driver arm64: CCA: Register guest tsm callback cca: guest: arm64: Realm device lock support KVM: arm64: Add exit handler related to device assignment coco: host: arm64: add RSI_RDEV_GET_INSTANCE_ID related exit handler coco: host: arm64: Add support for device communication exit handler coco: guest: arm64: Add support for collecting interface reports coco: host: arm64: Add support for realm host interface (RHI) coco: guest: arm64: Add support for fetching interface report and certificate chain from host coco: guest: arm64: Add support for guest initiated TDI bind/unbind KVM: arm64: CCA: handle dev mem map/unmap coco: guest: arm64: Validate mmio range found in the interface report coco: guest: arm64: Add Realm device start and stop support KVM: arm64: CCA: enable DA in realm create parameters coco: guest: arm64: Add support for fetching device measurements coco: guest: arm64: Add support for fetching device info Lukas Wunner (3): X.509: Make certificate parser public X.509: Parse Subject Alternative Name in certificates X.509: Move certificate length retrieval into new helper arch/arm64/include/asm/kvm_rme.h | 3 + arch/arm64/include/asm/mem_encrypt.h | 6 +- arch/arm64/include/asm/rhi.h | 39 + arch/arm64/include/asm/rmi_cmds.h | 173 ++++ arch/arm64/include/asm/rmi_smc.h | 210 ++++- arch/arm64/include/asm/rsi.h | 5 +- arch/arm64/include/asm/rsi_cmds.h | 129 +++ arch/arm64/include/asm/rsi_smc.h | 60 ++ arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/rhi.c | 35 + arch/arm64/kernel/rsi.c | 26 +- arch/arm64/kvm/mmu.c | 45 + arch/arm64/kvm/rme-exit.c | 87 ++ arch/arm64/kvm/rme.c | 208 ++++- arch/arm64/mm/mem_encrypt.c | 10 + crypto/asymmetric_keys/x509_cert_parser.c | 9 + crypto/asymmetric_keys/x509_loader.c | 38 +- crypto/asymmetric_keys/x509_parser.h | 40 +- drivers/iommu/iommufd/device.c | 54 ++ drivers/iommu/iommufd/iommufd_private.h | 7 + drivers/iommu/iommufd/main.c | 13 + drivers/iommu/iommufd/viommu.c | 178 +++- drivers/pci/tsm.c | 229 ++++- drivers/vfio/pci/vfio_pci_core.c | 20 +- drivers/virt/coco/Kconfig | 5 +- drivers/virt/coco/Makefile | 7 +- drivers/virt/coco/arm-cca-guest/Kconfig | 10 +- drivers/virt/coco/arm-cca-guest/Makefile | 3 + .../{arm-cca-guest.c => arm-cca.c} | 175 +++- drivers/virt/coco/arm-cca-guest/rsi-da.c | 576 ++++++++++++ drivers/virt/coco/arm-cca-guest/rsi-da.h | 73 ++ drivers/virt/coco/arm-cca-host/Kconfig | 17 + drivers/virt/coco/arm-cca-host/Makefile | 5 + drivers/virt/coco/arm-cca-host/arm-cca.c | 384 ++++++++ drivers/virt/coco/arm-cca-host/rmm-da.c | 857 ++++++++++++++++++ drivers/virt/coco/arm-cca-host/rmm-da.h | 108 +++ drivers/virt/coco/host/Kconfig | 6 - drivers/virt/coco/host/Makefile | 6 - drivers/virt/coco/{host => }/tsm-core.c | 27 + include/keys/asymmetric-type.h | 2 + include/keys/x509-parser.h | 55 ++ include/linux/device.h | 1 + include/linux/iommufd.h | 4 + include/linux/kvm_host.h | 1 + include/linux/pci-tsm.h | 37 +- include/linux/swiotlb.h | 4 + include/linux/tsm.h | 29 + include/uapi/linux/iommufd.h | 69 ++ 48 files changed, 3887 insertions(+), 200 deletions(-) create mode 100644 arch/arm64/include/asm/rhi.h create mode 100644 arch/arm64/kernel/rhi.c rename drivers/virt/coco/arm-cca-guest/{arm-cca-guest.c => arm-cca.c} (62%) create mode 100644 drivers/virt/coco/arm-cca-guest/rsi-da.c create mode 100644 drivers/virt/coco/arm-cca-guest/rsi-da.h create mode 100644 drivers/virt/coco/arm-cca-host/Kconfig create mode 100644 drivers/virt/coco/arm-cca-host/Makefile create mode 100644 drivers/virt/coco/arm-cca-host/arm-cca.c create mode 100644 drivers/virt/coco/arm-cca-host/rmm-da.c create mode 100644 drivers/virt/coco/arm-cca-host/rmm-da.h delete mode 100644 drivers/virt/coco/host/Kconfig delete mode 100644 drivers/virt/coco/host/Makefile rename drivers/virt/coco/{host => }/tsm-core.c (85%) create mode 100644 include/keys/x509-parser.h -- 2.43.0
On Mon, Jul 28, 2025 at 07:21:37PM +0530, Aneesh Kumar K.V (Arm) wrote: > This patch series implements support for Device Assignment in the ARM CCA > architecture. The code changes are based on Alp12 specification published here > [1]. Robin and I were talking about CCA and DMA here: https://lore.kernel.org/r/6c5fb9f0-c608-4e19-8c60-5d8cef3efbdf@arm.com What do you think about pulling some of this out and trying to independently push a series getting the DMA API layers ready for device assignment? I think there will be some discussion on these points, it would be good to get started. Jason
Aneesh Kumar K.V (Arm) wrote: > This patch series implements support for Device Assignment in the ARM CCA > architecture. The code changes are based on Alp12 specification published here > [1]. > > The code builds on the TSM framework patches posted at [2]. We add extension to > that framework so that TSM is now used in both the host and the guest. > > A DA workflow can be summarized as below: > > Host: > step 1. > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind > echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override > echo ${DEVICE} > /sys/bus/pci/drivers_probe > > step 2. > echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect Just for my own understanding... presumably there is no ordering constraint for ARM CCA between step1 and step2, right? I.e. The connect state is independent of the bind state. In the v4 PCI/TSM scheme the connect command is now: echo $tsm_dev > /sys/bus/pci/devices/$DEVICE/tsm/connect > Now in the guest we follow the below steps I assume a signifcant amount of kvmtool magic happens here to get the TDI into a "bind capable" state, can you share that command? I had been assuming that everyone was prototyping with QEMU. Not a problem per se, but the memory management for shared device assignment / bounce buffering has had a quite of bit of work on the QEMU side, so just curious about the difference in approach here. Like, does kvmtool support operating the device in shared mode with bounce buffering and page conversion (shared <=> private) support? In any event, happy to see mutiple simultaneous consumers of this new kernel infrastructure. > step 1: > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind > > step 2: Move the device to TDISP LOCK state > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/lock Ok, so my stance has recently picked up some nuance here. As Jason mentions here: http://lore.kernel.org/20250410235008.GC63245@ziepe.ca "However it works, it should be done before the driver is probed and remain stable for the duration of the driver attachment. From the iommu side the correct iommu domain, on the correct IOMMU instance to handle the expected traffic should be setup as the DMA API's iommu domain." I agree with that up until the point where the implication is userspace control of the UNLOCKED->LOCKED transition. That transition requires enabling bus-mastering (BME), configuring the device into an expected state, and *then* locking the device. That means userspace is blindly hoping that the device is in a state where it will remain quiet on the bus between BME and LOCKED, and that the previous unbind left the device in a state where it is prepared to be locked again. The BME concern may be overblown given major PCI drivers blindly set BME without validating the device is in a quiesced state, but the "device is prepped for locking" problem seems harder. 2 potential ways to solve this, but open to other ideas: - Userspace only picks the iommu domain context for the device not the lock state. Something like: private > /sys/bus/pci/devices/${DEVICE}/tsm/domain ...where the default is "shared" and from that point the device can not issue DMA until a driver attaches. Driver controls UNLOCKED->LOCKED->RUN. - Userspace is not involved in this transition and the dma mapping API is updated to allow a driver to switch the iommu domain at runtime, but only if the device has no outstanding mappings and the transition can only happen from ->probe() context. Driver controls joining secure-world-DMA and UNLOCKED->LOCKED->RUN. Clearly the first option is less work in the kernel, but in both options the driver is in control of when BME is set relative to being ready for the LOCKED transition. > step 3: Moves the device to TDISP RUN state > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/accept This has the same concern from me about userspace being in control of BME. It feels like a departure from typical expectations. At least in the case of a driver setting BME the driver's probe routine is going to get the device in order shortly and otherwise have error handlers at the ready to effect any needed recovery. Userspace just leaves the device enabled indefinitely and hopes. Now, the nice thing about the scheme as proposed in this set is that userspace has all the time in the world between "lock" and "accept" to talk to a verifier. With the driver in control there would need to be something like a usermodehelper to notify userspace that the device is in the locked state and to go ahead and run the attestation while the driver waits*. * or driver could decide to not wait, especially useful for debug and development > step 4: Load the driver again. > echo ${DEVICE} > /sys/bus/pci/drivers_probe TIL drivers_probe Maybe want to recommend: echo ${DEVICE} > /sys/bus/pci/drivers/${DRIVER}/bind ...to users just in case there are multiple drivers loaded for the device for the "shared" vs "private" case? > I'm currently working against TSM v3, as TSM v4 lacks the necessary > callbacks—bind, unbind, and guest_req—required for guest interactions. For staging purposes I wanted to put the "connect" flow to bed before moving on to the guest side. > The implementation also makes use of RHI interfaces that fall outside the > current RHI specification [5]. Once the spec is finalized, the code will be aligned > accordingly. > > For now, I’ve retained validate_mmio and vdev_req exit handling within KVM. This > will transition to a guest_req-based mechanism once the specification is > updated. > > At that point, all device assignment (DA)-specific VM exits will exit directly > to the VMM, and will use the guest_req ioctl to handle exit reasons. As part of > this change, the handlers realm_exit_vdev_req_handler, > realm_exit_vdev_comm_handler, and realm_exit_dev_mem_map_handler will be > removed. > > Full patchset for the kernel and kvmtool can be found at [3] and [4] > > [1] https://developer.arm.com/-/cdn-downloads/permalink/Architectures/Armv9/DEN0137_1.1-alp12.zip > > [2] https://lore.kernel.org/all/20250516054732.2055093-1-dan.j.williams@intel.com > > [3] https://git.gitlab.arm.com/linux-arm/linux-cca.git cca/tdisp-upstream-post-v1 > [4] https://git.gitlab.arm.com/linux-arm/kvmtool-cca.git cca/tdisp-upstream-post-v1 > [5] https://developer.arm.com/documentation/den0148/latest/ Thanks for this and the help reviewing PCI/TSM so far! I want to get this into tsm.git#staging so we can start to make hard claims ("look at the shared tree!") of hardware vendor consensus.
On Thu, Jul 31, 2025 at 07:07:17PM -0700, dan.j.williams@intel.com wrote: > Aneesh Kumar K.V (Arm) wrote: > > Host: > > step 1. > > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind > > echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override > > echo ${DEVICE} > /sys/bus/pci/drivers_probe > > > > step 2. > > echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect > > Just for my own understanding... presumably there is no ordering > constraint for ARM CCA between step1 and step2, right? I.e. The connect > state is independent of the bind state. > > In the v4 PCI/TSM scheme the connect command is now: > > echo $tsm_dev > /sys/bus/pci/devices/$DEVICE/tsm/connect What does this do on the host? It seems to somehow prep it for VM assignment? Seems pretty strange this is here in sysfs and not part of creating the vPCI function in the VM through VFIO and iommufd? Frankly, I'm nervous about making any uAPI whatsoever for the hypervisor side at this point. I don't think we have enough of the solution even in draft format. I'd really like your first merged TSM series to only have uAPI for the guest side where things are hopefully closer to complete.. > > step 1: > > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind > > > > step 2: Move the device to TDISP LOCK state > > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/lock > > Ok, so my stance has recently picked up some nuance here. As Jason > mentions here: > > http://lore.kernel.org/20250410235008.GC63245@ziepe.ca > > "However it works, it should be done before the driver is probed and > remain stable for the duration of the driver attachment. From the > iommu side the correct iommu domain, on the correct IOMMU instance to > handle the expected traffic should be setup as the DMA API's iommu > domain." I think it is not just the dma api, but also the MMIO registers may move location (form shared to protected IPA space for example). Meaning any attached driver is completely wrecked. > I agree with that up until the point where the implication is userspace > control of the UNLOCKED->LOCKED transition. That transition requires > enabling bus-mastering (BME), Why? That's sad. BME should be controlled by the VM driver not the TSM, and it should be set only when a VM driver is probed to the RUN state device? > and *then* locking the device. That means userspace is blindly > hoping that the device is in a state where it will remain quiet on the > bus between BME and LOCKED, and that the previous unbind left the device > in a state where it is prepared to be locked again. Yes, but we broadly assume this already in Linux. Drivers assume their devices are quiet when they are bound the first time, we expect on unbinding a driver quiets the device before removing. So broadly I think you can assume that a device with no driver is quiet regardless of BME. > 2 potential ways to solve this, but open to other ideas: > > - Userspace only picks the iommu domain context for the device not the > lock state. Something like: > > private > /sys/bus/pci/devices/${DEVICE}/tsm/domain > > ...where the default is "shared" and from that point the device can > not issue DMA until a driver attaches. Driver controls > UNLOCKED->LOCKED->RUN. What? Gross, no way can we let userspace control such intimate details of the kernel. The kernel must auto set based on what T=x mode the device driver binds into. > - Userspace is not involved in this transition and the dma mapping API > is updated to allow a driver to switch the iommu domain at runtime, > but only if the device has no outstanding mappings and the transition > can only happen from ->probe() context. Driver controls joining > secure-world-DMA and UNLOCKED->LOCKED->RUN. I don't see why it is so complicated. The driver is unbound before it reaches T=1 so we expect the device to be quiet (bigger problems if not). When the PCI core reaches T=1 it tells the DMA API to reconfigure things for the unbound struct device. Then we bind a driver as normal. Driver controls nothing. All existing T=0 drivers "just work" with no source changes in T=1 mode. DMA API magically hides the bounce buffering. Surely this should be the baseline target functionality from a Linux perspective? So we should not have "driver controls" statements at all. Userspace prepares the PCI device, driver probes onto a T=1 environment and just works. > > step 3: Moves the device to TDISP RUN state > > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/accept > > This has the same concern from me about userspace being in control of > BME. It feels like a departure from typical expectations. It is, it is architecturally broken for BME to be controlled by the TSM. BME is controlled by the guest OS driver only. IMHO if this is a real worry (and I don't think it is) then the right answer is for physical BME to be set on during locking, but VIRTUAL BME is left off. Virtual BME is created by the hypervisor/tsm by telling the IOMMU to block DMA. The Guest OS should not participate in this broken design, the hypervisor can set pBME automatically when the lock request comes in, and the quality of vBME emulation is left up to the implementation, but the implementation must provide at least a NOP vBME once locked. > Now, the nice thing about the scheme as proposed in this set is that > userspace has all the time in the world between "lock" and "accept" to > talk to a verifier. Seems right to me. There should be NO trusted kernel driver bound until the verifier accepts the attestation. Anything else allows un accepted devices to attack the kernel drivers. Few kernel drivers today distrust their HW interfaces as hostile actors and security defend against them. Therefore we should be very reluctant to bind drivers to anything.. Arguably a CC secure kernel should have an allow list of audited secure drivers that can autoprobe and all other drivers must be approved by userspace in some way, either through T=1 and attestation or some customer-aware risk assumption. From that principal the kernel should NOT auto probe drivers to T=0 devices that can be made T=1. Userspace should handle attaching HW to such devices, and userspace can sequence whatever is required, including the attestation and verifying. Otherwise, if you say, have a TDISP capable mlx5 device and boot up the cVM in a comporomised host the host can probably completely hack your cVM by exploiting the mlx5 drivers's total trust in the HW interface while running in T=0 mode. You must attest it and switch to T=1 before binding any driver if you care about mitigating this risk. > With the driver in control there would need to be something like a > usermodehelper to notify userspace that the device is in the locked > state and to go ahead and run the attestation while the driver waits*. It doesn't make sense to require modification to all existing drivers in Linux! The starting point must have the core code do this sequence for every driver. Once that is working we can talk about if other flows are needed. > > step 4: Load the driver again. > > echo ${DEVICE} > /sys/bus/pci/drivers_probe > > TIL drivers_probe > > Maybe want to recommend: > > echo ${DEVICE} > /sys/bus/pci/drivers/${DRIVER}/bind > > ...to users just in case there are multiple drivers loaded for the > device for the "shared" vs "private" case? Generic userspace will have a hard time to know what the driver names are.. The driver_probe option looks good to me as the default. I'm not sure how generic code can handle "multiple drivers".. Most devices will be able to work just fine with T=0 mode with bounce buffers so we should generally not encourage people to make completely different drivers for T=0/T=1 mode. I think what is needed is some way for userspace to trigger the "locking configuration" you mentioned, that may need a special driver, but ONLY if the userspace is sequencing the device to T=1 mode. Not sure how to make that generic, but I think so long as userspace is explicitly controlling driver binding we can punt on that solution to the userspace project :) The real nastyness is RAS - what do you do when the device falls out of RUN, the kernel driver should pretty much explode. But lots of people would like the kernel driver to stay alive and somehow we FLR, re-attest and "resume" the kernel driver without allowing any T=0 risks. For instance you can keep your netdev and just see a lot of lost packets while the driver thrashes. But I think we can start with the idea that such RAS failures have to reload the driver too and work on improvements. Realistically few drivers have the sort of RAS features to consume this anyhow and maybe we introduce some "enhanced" driver mode to opt-into down the road. Jason
Jason Gunthorpe wrote: > On Thu, Jul 31, 2025 at 07:07:17PM -0700, dan.j.williams@intel.com wrote: > > Aneesh Kumar K.V (Arm) wrote: > > > Host: > > > step 1. > > > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind > > > echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override > > > echo ${DEVICE} > /sys/bus/pci/drivers_probe > > > > > > step 2. > > > echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect > > > > Just for my own understanding... presumably there is no ordering > > constraint for ARM CCA between step1 and step2, right? I.e. The connect > > state is independent of the bind state. > > > > In the v4 PCI/TSM scheme the connect command is now: > > > > echo $tsm_dev > /sys/bus/pci/devices/$DEVICE/tsm/connect > > What does this do on the host? It seems to somehow prep it for VM > assignment? Seems pretty strange this is here in sysfs and not part of > creating the vPCI function in the VM through VFIO and iommufd? vPCI is out of the picture at this phase. On the host this establishes an SPDM session and sets up link encryption (IDE) with the physical device. Leave VMs out of the picture, this capability in isolation is a useful property. It addresses the similar threat model that Intel Total Memory Encryption (TME) or AMD Secure Memory Encryption (SME) go after, i.e. interposer on a physical link capturing data in flight. With that established then one can go futher to do the full TDISP dance. > Frankly, I'm nervous about making any uAPI whatsoever for the > hypervisor side at this point. I don't think we have enough of the > solution even in draft format. I'd really like your first merged TSM > series to only have uAPI for the guest side where things are hopefully > closer to complete. Aligned. I am not comfortable merging any of this until we have that end to end reliably stable for a kernel cycle or 2. The proposal is soak all the vendor solutions together in tsm.git#staging. Now, if the guest side graduates out of that staging before the host side, I am ok with that. > > > step 1: > > > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind > > > > > > step 2: Move the device to TDISP LOCK state > > > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/lock > > > > Ok, so my stance has recently picked up some nuance here. As Jason > > mentions here: > > > > http://lore.kernel.org/20250410235008.GC63245@ziepe.ca > > > > "However it works, it should be done before the driver is probed and > > remain stable for the duration of the driver attachment. From the > > iommu side the correct iommu domain, on the correct IOMMU instance to > > handle the expected traffic should be setup as the DMA API's iommu > > domain." > > I think it is not just the dma api, but also the MMIO registers may > move location (form shared to protected IPA space for > example). Meaning any attached driver is completely wrecked. True. > > I agree with that up until the point where the implication is userspace > > control of the UNLOCKED->LOCKED transition. That transition requires > > enabling bus-mastering (BME), > > Why? That's sad. BME should be controlled by the VM driver not the > TSM, and it should be set only when a VM driver is probed to the RUN > state device? To me it is an unfortunate PCI specification wrinkle that writing to the command register drops the device from RUN to ERROR. So you can LOCK without setting BME, but then no DMA. > > and *then* locking the device. That means userspace is blindly > > hoping that the device is in a state where it will remain quiet on the > > bus between BME and LOCKED, and that the previous unbind left the device > > in a state where it is prepared to be locked again. > > Yes, but we broadly assume this already in Linux. Drivers assume their > devices are quiet when they are bound the first time, we expect on > unbinding a driver quiets the device before removing. > > So broadly I think you can assume that a device with no driver is > quiet regardless of BME. > > > 2 potential ways to solve this, but open to other ideas: > > > > - Userspace only picks the iommu domain context for the device not the > > lock state. Something like: > > > > private > /sys/bus/pci/devices/${DEVICE}/tsm/domain > > > > ...where the default is "shared" and from that point the device can > > not issue DMA until a driver attaches. Driver controls > > UNLOCKED->LOCKED->RUN. > > What? Gross, no way can we let userspace control such intimate details > of the kernel. The kernel must auto set based on what T=x mode the > device driver binds into. Flummoxed. Any way this gets sliced, userspace is asking for "private world attach" because it alone knows that this device is acceptable, and devices need to arrive in "shared world attach" mode. > > - Userspace is not involved in this transition and the dma mapping API > > is updated to allow a driver to switch the iommu domain at runtime, > > but only if the device has no outstanding mappings and the transition > > can only happen from ->probe() context. Driver controls joining > > secure-world-DMA and UNLOCKED->LOCKED->RUN. > > I don't see why it is so complicated. The driver is unbound before it > reaches T=1 so we expect the device to be quiet (bigger problems if > not). When the PCI core reaches T=1 it tells the DMA API to > reconfigure things for the unbound struct device. Then we bind a > driver as normal. > > Driver controls nothing. All existing T=0 drivers "just work" with no > source changes in T=1 mode. DMA API magically hides the bounce > buffering. Surely this should be the baseline target functionality > from a Linux perspective? I started this project with "all existing T=0 drivers 'just work'" as a goal and a virtue. I have been begrudgingly pulled away from it from the slow drip of complexity it appears to push into the PCI core. Now, I suspect the number of devices that are willing to spend gates and firmware on TDISP capabilities in the near term is small. The "just works" case is saved for either an L1 VMM to hide all this from an L2 guest, or a simplified TDISP specification that actually allows an OS PCI core to handle these details in a standard way. > So we should not have "driver controls" statements at all. Userspace > prepares the PCI device, driver probes onto a T=1 environment and just > works. The concern is neither userspace nor the PCI core have everything it needs to get the device to T=1. PCI core knows that the device is T=1 capable, but does not know how to preconfigure the device-specific lock state, needs to wait for attestation. Userpace knows how to attest/verify the device but really has no business running the device outside of binding a driver, and can not rely on the PCI core to have prepped the device's device-specific lock state. Userspace might be able to bind a new driver that leaves the device in a lockable state on unbind, but that is not "just works" that is, "introduce a new concept of skinny TDISP setup drivers that leave devices in LOCKED state on driver unbind, so that userspace can do the work to verify the device and move it to RUN before loading the main driver that expects the device arrives already running. Also, that main driver needs to be careful not to trigger typically benign actions like touch the command register to trip the device into ERROR state, or any device-specific actions that trip ERROR state but would otherwise be benign outside of TDISP." If locking the device was just a toggle it would be possible. As far as I can see it is a "prep+toggle" where "prep" needs a driver. > > > step 3: Moves the device to TDISP RUN state > > > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/accept > > > > This has the same concern from me about userspace being in control of > > BME. It feels like a departure from typical expectations. > > It is, it is architecturally broken for BME to be controlled by the > TSM. BME is controlled by the guest OS driver only. Agree. That "accept" attribute does not belong with TSM. That is where Aneesh has it in this RFC. "Accept" as an action is the combination of device entered the LOCKED state in a configuration the verifier is willing to accept and the mechanics of triggering the LOCKED->RUN transition. > IMHO if this is a real worry (and I don't think it is) then the right > answer is for physical BME to be set on during locking, but VIRTUAL > BME is left off. Virtual BME is created by the hypervisor/tsm by > telling the IOMMU to block DMA. > > The Guest OS should not participate in this broken design, the > hypervisor can set pBME automatically when the lock request comes in, > and the quality of vBME emulation is left up to the implementation, > but the implementation must provide at least a NOP vBME once locked. I can let go of the "BME without driver" worry, but that does nothing to solve the "device specific configuration required before lock" problem. > > Now, the nice thing about the scheme as proposed in this set is that > > userspace has all the time in the world between "lock" and "accept" to > > talk to a verifier. > > Seems right to me. There should be NO trusted kernel driver bound > until the verifier accepts the attestation. Anything else allows un > accepted devices to attack the kernel drivers. Few kernel drivers > today distrust their HW interfaces as hostile actors and security > defend against them. Therefore we should be very reluctant to bind > drivers to anything. > > Arguably a CC secure kernel should have an allow list of audited > secure drivers that can autoprobe and all other drivers must be > approved by userspace in some way, either through T=1 and attestation > or some customer-aware risk assumption. Yes, today, where nothing is T=1 capable for an L1 guest*, the onus is 100% on the distribution, not the kernel. I.e. trim kernel config and set modprobe policy to prevent unwanted drivers. * For L2 there are proposals like this, where if you already trust your paravisor also pre-trust all the devices it tells you to trust. [1]: http://lore.kernel.org/20250714221545.5615-1-romank@linux.microsoft.com > From that principal the kernel should NOT auto probe drivers to T=0 > devices that can be made T=1. Userspace should handle attaching HW to > such devices, and userspace can sequence whatever is required, > including the attestation and verifying. Agree, for PCI it would be simple to set a no-auto-probe policy for T=1 capable devices. > Otherwise, if you say, have a TDISP capable mlx5 device and boot up > the cVM in a comporomised host the host can probably completely hack > your cVM by exploiting the mlx5 drivers's total trust in the HW > interface while running in T=0 mode. > > You must attest it and switch to T=1 before binding any driver if you > care about mitigating this risk. Yes, userspace must have a chance to say "no" before a driver attempts to launch DMA to private memory after secrets have been deployed to the TVM. > > With the driver in control there would need to be something like a > > usermodehelper to notify userspace that the device is in the locked > > state and to go ahead and run the attestation while the driver waits*. > > It doesn't make sense to require modification to all existing drivers > in Linux! I do not want to burden the PCI core with TDISP compatibility hacks and workarounds if it turns out only a small handful of devices ever deploy a first generation TDISP Device Security Manager (DSM). L1 aiding L2, or TDISP simplicity improvements to allow the PCI core to handle this in a non-broken way, are what I expect if secure device assignment takes off. > The starting point must have the core code do this sequence > for every driver. Once that is working we can talk about if other > flows are needed. Do you agree that "device-specific-prep+lock" is the problem to solve? > > > step 4: Load the driver again. > > > echo ${DEVICE} > /sys/bus/pci/drivers_probe > > > > TIL drivers_probe > > > > Maybe want to recommend: > > > > echo ${DEVICE} > /sys/bus/pci/drivers/${DRIVER}/bind > > > > ...to users just in case there are multiple drivers loaded for the > > device for the "shared" vs "private" case? > > Generic userspace will have a hard time to know what the driver names > are.. > > The driver_probe option looks good to me as the default. > > I'm not sure how generic code can handle "multiple drivers".. Most > devices will be able to work just fine with T=0 mode with bounce > buffers so we should generally not encourage people to make completely > different drivers for T=0/T=1 mode. > > I think what is needed is some way for userspace to trigger the > "locking configuration" you mentioned, that may need a special driver, > but ONLY if the userspace is sequencing the device to T=1 mode. Not > sure how to make that generic, but I think so long as userspace is > explicitly controlling driver binding we can punt on that solution to > the userspace project :) > > The real nastyness is RAS - what do you do when the device falls out > of RUN, the kernel driver should pretty much explode. But lots of > people would like the kernel driver to stay alive and somehow we FLR, > re-attest and "resume" the kernel driver without allowing any T=0 > risks. For instance you can keep your netdev and just see a lot of > lost packets while the driver thrashes. Ideally the RUN->ERROR->UNLOCKED->LOCKED->RUN recovery can fit into the existing 'struct pci_error_handlers' regime in some farther out future. It was a "fun" discovery to see that virtual AER injection does not exist in QEMU (at least last time I checked) and assigned devices that throw physical AER events just kill the VM. > But I think we can start with the idea that such RAS failures have to > reload the driver too and work on improvements. Realistically few > drivers have the sort of RAS features to consume this anyhow and maybe > we introduce some "enhanced" driver mode to opt-into down the road. Hmm, having trouble not reading that back supporting my argument above: Realistically few devices support TDISP lets require enhanced drivers to opt-into TDISP for the time being.
On Fri, Aug 01, 2025 at 02:19:54PM -0700, dan.j.williams@intel.com wrote: > On the host this establishes an SPDM session and sets up link encryption > (IDE) with the physical device. Leave VMs out of the picture, this > capability in isolation is a useful property. It addresses the similar > threat model that Intel Total Memory Encryption (TME) or AMD Secure > Memory Encryption (SME) go after, i.e. interposer on a physical link > capturing data in flight. Okay, maybe connect is not an intuitive name for opening IDE sessions.. > I started this project with "all existing T=0 drivers 'just work'" as a > goal and a virtue. I have been begrudgingly pulled away from it from the > slow drip of complexity it appears to push into the PCI core. Do you have some examples? I don't really see what complexity there is if the solution it simply not auto bind any drivers to TDISP capable devices and userspace is responsible to manually bind a driver once it has reached T=1. This seems like the minimum possible simplicitly for the kernel as simply everything is managed by userspace, and there is really no special kernel behavior beyond switching the DMA API of an unbound driver on the T=0/1 change. > The concern is neither userspace nor the PCI core have everything it > needs to get the device to T=1. Disagree, I think userspace can have everything. It may need some per-device userspace support in difficult cases, but userspace can deal with it.. > PCI core knows that the device is T=1 capable, but does not know how > to preconfigure the device-specific lock state, Userspace can do this. Can we define exactly what is needed to do this "pre-configure the device specific lock state"? At the very worst, for the most poorly designed device, userspace would have to bind a T=0 driver and then unbind it. Again, I am trying to make something simple for the kernel that gets us to a working solution before we jump ahead to far more complex in the kernel models, like aware drivers that can toggle themselves between T=0/1. > Userspace might be able to bind a new driver that leaves the device in a > lockable state on unbind, but that is not "just works" that is, I wouldn't have the kernel leave the device in the locked state. That should always be userspace. The special driver may do whatever special setup is needed, then unbind and leave a normal unlocked device "prepped" for userspace locking without doing a FLR or something. Realistically I expect this to be a very rare requirement, I think this coming up just reflects the HW immaturity of some early TDISP devices. Sensible mature devices should have no need of a pre-locking step. I think we should design toward that goal as the stable future and only try to enable a hacky work around for the problematic early devices. I certainly am not keen on seeing significant permanent kernel complexity to support this device design defect. > driver that expects the device arrives already running. Also, that main > driver needs to be careful not to trigger typically benign actions like > touch the command register to trip the device into ERROR state, or any > device-specific actions that trip ERROR state but would otherwise be > benign outside of TDISP." As I said below, I disagree with this. You can't touch the *physical* command register but the cVM can certainly touch the *virtualized* command register. It up to the VMM To ensure this doesn't cause the device to fall out of RUN as part of virtualization. I'd also say that the VMM should be responsible to set pBME=1 even if vBME=0? Shouldn't it? That simplifies even more things for the guest. > > From that principal the kernel should NOT auto probe drivers to T=0 > > devices that can be made T=1. Userspace should handle attaching HW to > > such devices, and userspace can sequence whatever is required, > > including the attestation and verifying. > > Agree, for PCI it would be simple to set a no-auto-probe policy for T=1 > capable devices. So then it is just a question of what does a userspace component need to do. > I do not want to burden the PCI core with TDISP compatibility hacks and > workarounds if it turns out only a small handful of devices ever deploy > a first generation TDISP Device Security Manager (DSM). L1 aiding L2, or > TDISP simplicity improvements to allow the PCI core to handle this in a > non-broken way, are what I expect if secure device assignment takes off. Same feeling about pre-configuration :) > > The starting point must have the core code do this sequence > > for every driver. Once that is working we can talk about if other > > flows are needed. > > Do you agree that "device-specific-prep+lock" is the problem to solve? Not "the" problem, but an design issue we need to accommodate but not endorse. > > But I think we can start with the idea that such RAS failures have to > > reload the driver too and work on improvements. Realistically few > > drivers have the sort of RAS features to consume this anyhow and maybe > > we introduce some "enhanced" driver mode to opt-into down the road. > > Hmm, having trouble not reading that back supporting my argument above: > > Realistically few devices support TDISP lets require enhanced drivers to > opt-into TDISP for the time being. I would be comfortable if hitless RAS recovery for TDISP devices requires some kernel opt-in. But also I'm not sure how this should work from a security perspective. Should userspace also have to re-attest before allowing back to RUN? Clearly this is complicated. Also, I would be comfortable to support this only for devices that do not require pre-configuration. Jason
Jason Gunthorpe wrote: > On Fri, Aug 01, 2025 at 02:19:54PM -0700, dan.j.williams@intel.com wrote: > > > On the host this establishes an SPDM session and sets up link encryption > > (IDE) with the physical device. Leave VMs out of the picture, this > > capability in isolation is a useful property. It addresses the similar > > threat model that Intel Total Memory Encryption (TME) or AMD Secure > > Memory Encryption (SME) go after, i.e. interposer on a physical link > > capturing data in flight. > > Okay, maybe connect is not an intuitive name for opening IDE > sessions.. Part of the rationale for a generic name is the TSM is free to assert that the link is secure without IDE. Think integrated devices where there is no expectation the link can be observed. The host and guest side TSM operations are split into link/transport security and device/state security (private MMIO/DMA) concerns respectively. So maybe "secure_link" would be a better name for this host-side-only operation. > > I started this project with "all existing T=0 drivers 'just work'" as a > > goal and a virtue. I have been begrudgingly pulled away from it from the > > slow drip of complexity it appears to push into the PCI core. > > Do you have some examples? I don't really see what complexity there is > if the solution it simply not auto bind any drivers to TDISP capable > devices and userspace is responsible to manually bind a driver once it > has reached T=1. The example I have front of mind (confirmed by 2 vendors) is deferring the loading of guest-side device/state security capable firmware to the guest driver when the full device is assigned. In that scenario default device power-on firmware is capable of link/transport security, enough to get the device assigned. Guest needs to get the device/state security firmware loaded before TDISP state transitions are possible. I do think RAS recovery needs it too, but like you say below that should come with conditions. > This seems like the minimum possible simplicitly for the kernel as > simply everything is managed by userspace, and there is really no > special kernel behavior beyond switching the DMA API of an unbound > driver on the T=0/1 change. > > > The concern is neither userspace nor the PCI core have everything it > > needs to get the device to T=1. > > Disagree, I think userspace can have everything. It may need some > per-device userspace support in difficult cases, but userspace can > deal with it.. I do think userspace can / must deal with it. Let me come back with actual patches and a sample test case. I see a potential path to support the above "prep" scenario without the mess of TDISP setup drivers, or the ugly complexity of driver toggles or a usermodehelper. > > PCI core knows that the device is T=1 capable, but does not know how > > to preconfigure the device-specific lock state, > > Userspace can do this. Can we define exactly what is needed to do this > "pre-configure the device specific lock state"? At the very worst, for > the most poorly designed device, userspace would have to bind a T=0 > driver and then unbind it. > > Again, I am trying to make something simple for the kernel that gets > us to a working solution before we jump ahead to far more complex in > the kernel models, like aware drivers that can toggle themselves > between T=0/1. Agree. When I talked about wishing for the simple TDISP case that is userspace can always "just lock" and "driver bind" without needing to worry about "prep", i.e any "prep" is always implied by "lock". That should be the baseline. > > Userspace might be able to bind a new driver that leaves the device in a > > lockable state on unbind, but that is not "just works" that is, > > I wouldn't have the kernel leave the device in the locked state. That > should always be userspace. The special driver may do whatever special > setup is needed, then unbind and leave a normal unlocked device > "prepped" for userspace locking without doing a FLR or > something. Realistically I expect this to be a very rare requirement, > I think this coming up just reflects the HW immaturity of some early > TDISP devices. > > Sensible mature devices should have no need of a pre-locking step. I > think we should design toward that goal as the stable future and only > try to enable a hacky work around for the problematic early devices. I > certainly am not keen on seeing significant permanent kernel > complexity to support this device design defect. Yeah, that is the nightmare I had last night. I completed the thought exercise about driver toggle and said, "whoops, nope, Jason is right, we can't design for that without leaving a permanent mess to cleanup". The end goal needs to look like straight line typical driver probe path for TDISP capable devices. > > driver that expects the device arrives already running. Also, that main > > driver needs to be careful not to trigger typically benign actions like > > touch the command register to trip the device into ERROR state, or any > > device-specific actions that trip ERROR state but would otherwise be > > benign outside of TDISP." > > As I said below, I disagree with this. You can't touch the *physical* > command register but the cVM can certainly touch the *virtualized* > command register. It up to the VMM To ensure this doesn't cause the > device to fall out of RUN as part of virtualization. > > I'd also say that the VMM should be responsible to set pBME=1 even if > vBME=0? Shouldn't it? That simplifies even more things for the guest. True. Although, now I am going back on my PCI core burden concern to wonder if *it* should handle a vBME on behalf of the driver if only because it may want to force the device out of the RUN state on driver unbind to meet typical pci_disable_device() expectations. Alexey had this, I thought it was burdensome, now coming around. > > > From that principal the kernel should NOT auto probe drivers to T=0 > > > devices that can be made T=1. Userspace should handle attaching HW to > > > such devices, and userspace can sequence whatever is required, > > > including the attestation and verifying. > > > > Agree, for PCI it would be simple to set a no-auto-probe policy for T=1 > > capable devices. > > So then it is just a question of what does a userspace component need > to do. > > > I do not want to burden the PCI core with TDISP compatibility hacks and > > workarounds if it turns out only a small handful of devices ever deploy > > a first generation TDISP Device Security Manager (DSM). L1 aiding L2, or > > TDISP simplicity improvements to allow the PCI core to handle this in a > > non-broken way, are what I expect if secure device assignment takes off. > > Same feeling about pre-configuration :) > > > > The starting point must have the core code do this sequence > > > for every driver. Once that is working we can talk about if other > > > flows are needed. > > > > Do you agree that "device-specific-prep+lock" is the problem to solve? > > Not "the" problem, but an design issue we need to accommodate but not > endorse. I hear you, let me walk back from the cliff with patches. > > > > But I think we can start with the idea that such RAS failures have to > > > reload the driver too and work on improvements. Realistically few > > > drivers have the sort of RAS features to consume this anyhow and maybe > > > we introduce some "enhanced" driver mode to opt-into down the road. > > > > Hmm, having trouble not reading that back supporting my argument above: > > > > Realistically few devices support TDISP lets require enhanced drivers to > > opt-into TDISP for the time being. > > I would be comfortable if hitless RAS recovery for TDISP devices > requires some kernel opt-in. But also I'm not sure how this should > work from a security perspective. Should userspace also have to > re-attest before allowing back to RUN? Clearly this is complicated. > > Also, I would be comfortable to support this only for devices that do > not require pre-configuration. That seems reasonable. You want hitless RAS? Give us hitless init.
On Sat, Aug 02, 2025 at 04:50:50PM -0700, dan.j.williams@intel.com wrote: > > Do you have some examples? I don't really see what complexity there is > > if the solution it simply not auto bind any drivers to TDISP capable > > devices and userspace is responsible to manually bind a driver once it > > has reached T=1. > > The example I have front of mind (confirmed by 2 vendors) is deferring > the loading of guest-side device/state security capable firmware to the > guest driver when the full device is assigned. In that scenario default > device power-on firmware is capable of link/transport security, enough > to get the device assigned. Guest needs to get the device/state security > firmware loaded before TDISP state transitions are possible. Yeah, those are the only cases I know of too, and IMHO, they are just early devices. Clearly the clean answer is to put enough boot FW on the device's flash to get to T=1 mode, then have the trusted OS driver load the operating firmware from the trusted OS filesystem though the trusted bootloader T=1 device. You effectively attest the bootloader, and then if you trust the bootloader you know that when the device gets to T=1 it can be trusted to properly run the FW the trusted driver provides. Think about this more broadly, does the prep FW load idea make sense for something like SRIOV? No, it really doesn't. The hypervisor loaded FW that is running the PF should definately be strong enough to get to T=1 on the VM/VF side as well. The non-SRIOV cases are quite often whole machine assignment scenarios. But I'm sensing alot of that space is moving toward bare metal machines instead of VMs. I wonder if you can use all the CC machinery to attest and secure a bare metal host? > I do think RAS recovery needs it too, but like you say below that should > come with conditions. Especially RAS becomes simple because it basically follows the normal flows that existed prior to TDISP, with the exception of needing some attestation step. I don't know alot about CC attestation, but maybe we can have userspace provide the kernel with the accepted measurement and then for RAS the kernel can FLR, remeasure and if the measurement is exactly the same go back into T=1 automatically as part of the PCI core FLR logic. > I do think userspace can / must deal with it. Let me come back with > actual patches and a sample test case. I see a potential path to support > the above "prep" scenario without the mess of TDISP setup drivers, or > the ugly complexity of driver toggles or a usermodehelper. I don't see how, something nasty has to be done in the kernel to allow an attached driver to switch between T=1 and T=0 "views" of the device and lockstep those changes with userspace. This is not so simple and it really basically exactly the same as driver binding. I don't think we should be afraid of T=0 prep drivers in these early days. Something more complex could come later if it is really warranted and people really insist on continuing this unclean device design strategy. > Yeah, that is the nightmare I had last night. I completed the thought > exercise about driver toggle and said, "whoops, nope, Jason is right, we > can't design for that without leaving a permanent mess to cleanup". > The end goal needs to look like straight line typical driver probe path > for TDISP capable devices. Yeah, maybe it is worthwhile to someday try to figure out an alternative - keep in mind that critically this requires someone to also come with an intree driver that will use all these new APIs and capabilities!!! So lets get walking first and then someone can come with some proposal, complete with a driver implementing it, and it can be judged. This project is already so big, and I'm pretty sure if you start to also need entirely new operating modes for drivers the basics will just get bogged down in that discussion, and very likely killed anyhow due to a lack of user. Even if we decide that is prefered it is better to separate it and discuss it after the basics are merged. At least where I sit getting basic guest support is a big priority so I strongly want to strip it down to minimal as possible to make consistent progress steps. > True. Although, now I am going back on my PCI core burden concern to > wonder if *it* should handle a vBME on behalf of the driver if only > because it may want to force the device out of the RUN state on driver > unbind to meet typical pci_disable_device() expectations. Hiding some vBME in the PCI core might make sense if we can't get the VMM owners to agree to do it on the hypervisor side. It works better on the VMM side because there is always an IOMMU and the VMM can emulate BME by blocking DMA with the IOMMU. But I would not allow/expect kernel device drivers to have anything to do with the TDISP states. Getting into RUN is fully sequenced by userspace, getting out of run should also be sequenced only by userspace. Removing a driver does not change the trust state of the PCI device, so it shouldn't drop out of RUN. If userspace wishes to FLR the device after userspace asked to unbind it can, there are already sysfs controls for this IIRC. Basically, all this says that Linux drivers that want to be used with T=1 should be well behaved, fully quite all their DMA on remove, and have no *functional* need for BME to do anyhting. We pretty much already expect this of drivers today, so I don't see an issue with strongly requiring it for T=1. Keep in mind the flip side, almost no drivers are structured properly to forcibly quiet any DMA before pci_enable_device(). Some HW, like mlx5, can't do this at all without either using DMA to send a reset command or through FLR. > > I would be comfortable if hitless RAS recovery for TDISP devices > > requires some kernel opt-in. But also I'm not sure how this should > > work from a security perspective. Should userspace also have to > > re-attest before allowing back to RUN? Clearly this is complicated. > > > > Also, I would be comfortable to support this only for devices that do > > not require pre-configuration. > > That seems reasonable. You want hitless RAS? Give us hitless init. Yeah.. Realistically there are few drivers that can even do this today, mlx5 for example has such code (and it is hard!). There is alot of investment required in the driver's core subsystem to make this work. netdev and RDMA can support a 'rebirth' sort of flow where the driver can disconnect the SW APIs, FLR the device, then reconnect in some way. However, for example, I recently had a discussion with DRM guys about RAS and they are not even doing the basic locking/etc to be able to do this. :\ Jason
<dan.j.williams@intel.com> writes: > Jason Gunthorpe wrote: >> On Thu, Jul 31, 2025 at 07:07:17PM -0700, dan.j.williams@intel.com wrote: >> > Aneesh Kumar K.V (Arm) wrote: >> > > Host: >> > > step 1. >> > > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind >> > > echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override >> > > echo ${DEVICE} > /sys/bus/pci/drivers_probe >> > > >> > > step 2. >> > > echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect >> > >> > Just for my own understanding... presumably there is no ordering >> > constraint for ARM CCA between step1 and step2, right? I.e. The connect >> > state is independent of the bind state. >> > >> > In the v4 PCI/TSM scheme the connect command is now: >> > >> > echo $tsm_dev > /sys/bus/pci/devices/$DEVICE/tsm/connect >> >> What does this do on the host? It seems to somehow prep it for VM >> assignment? Seems pretty strange this is here in sysfs and not part of >> creating the vPCI function in the VM through VFIO and iommufd? > > vPCI is out of the picture at this phase. > > On the host this establishes an SPDM session and sets up link encryption > (IDE) with the physical device. Leave VMs out of the picture, this > capability in isolation is a useful property. It addresses the similar > threat model that Intel Total Memory Encryption (TME) or AMD Secure > Memory Encryption (SME) go after, i.e. interposer on a physical link > capturing data in flight. > > With that established then one can go futher to do the full TDISP dance. > >> Frankly, I'm nervous about making any uAPI whatsoever for the >> hypervisor side at this point. I don't think we have enough of the >> solution even in draft format. I'd really like your first merged TSM >> series to only have uAPI for the guest side where things are hopefully >> closer to complete. > > Aligned. I am not comfortable merging any of this until we have that end > to end reliably stable for a kernel cycle or 2. The proposal is soak all > the vendor solutions together in tsm.git#staging. > > Now, if the guest side graduates out of that staging before the host > side, I am ok with that. > >> > > step 1: >> > > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind >> > > >> > > step 2: Move the device to TDISP LOCK state >> > > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/lock >> > >> > Ok, so my stance has recently picked up some nuance here. As Jason >> > mentions here: >> > >> > http://lore.kernel.org/20250410235008.GC63245@ziepe.ca >> > >> > "However it works, it should be done before the driver is probed and >> > remain stable for the duration of the driver attachment. From the >> > iommu side the correct iommu domain, on the correct IOMMU instance to >> > handle the expected traffic should be setup as the DMA API's iommu >> > domain." >> >> I think it is not just the dma api, but also the MMIO registers may >> move location (form shared to protected IPA space for >> example). Meaning any attached driver is completely wrecked. > > True. > >> > I agree with that up until the point where the implication is userspace >> > control of the UNLOCKED->LOCKED transition. That transition requires >> > enabling bus-mastering (BME), >> >> Why? That's sad. BME should be controlled by the VM driver not the >> TSM, and it should be set only when a VM driver is probed to the RUN >> state device? > > To me it is an unfortunate PCI specification wrinkle that writing to the > command register drops the device from RUN to ERROR. So you can LOCK > without setting BME, but then no DMA. > This is only w.r.t clearing BME isn't ? According to section 11.2.6 DSM Tracking and Handling of Locked TDI Configurations Clearing any of the following bits causes the TDI hosted by the Function to transition to ERROR: • Memory Space Enable • Bus Master Enable Which implies the flow described in the cover-letter where driver enable the BME works? However clearing BME may be problematic? I did have a FIXME!!/comment in [1] vfio_pci_core_close_device(): #if 0 /* * destroy vdevice which involves tsm unbind before we disable pci disable * A MSE/BME clear will transition the device to error state. */ if (core_vdev->iommufd_device) iommufd_device_tombstone_vdevice(core_vdev->iommufd_device); #endif vfio_pci_core_disable(vdev); Currently, we destroy (TSM unbind) the vdevice after calling vfio_pci_core_disable(), which means BME is cleared before unbinding, and the TDI transitions to the ERROR state. [1] https://lore.kernel.org/all/20250728135216.48084-9-aneesh.kumar@kernel.org/ -aneesh
On Tue, Aug 05, 2025 at 10:37:01AM +0530, Aneesh Kumar K.V wrote: > > To me it is an unfortunate PCI specification wrinkle that writing to the > > command register drops the device from RUN to ERROR. So you can LOCK > > without setting BME, but then no DMA. > > This is only w.r.t clearing BME isn't ? > > According to section 11.2.6 DSM Tracking and Handling of Locked TDI Configurations > > Clearing any of the following bits causes the TDI hosted > by the Function to transition to ERROR: > > • Memory Space Enable > • Bus Master Enable Oh that's nice, yeah! > Which implies the flow described in the cover-letter where driver enable the BME works? > However clearing BME may be problematic? I did have a FIXME!!/comment in [1] > > vfio_pci_core_close_device(): > > #if 0 > /* > * destroy vdevice which involves tsm unbind before we disable pci disable > * A MSE/BME clear will transition the device to error state. > */ > if (core_vdev->iommufd_device) > iommufd_device_tombstone_vdevice(core_vdev->iommufd_device); > #endif > > vfio_pci_core_disable(vdev); Here is where I feel the VMM should be trapping this and NOPing it, or failing that the guest PCI Core should NOP it. With the ideal version being the TSM and VMM would be able to block the iommu as a functional stand in for BME. > Currently, we destroy (TSM unbind) the vdevice after calling > vfio_pci_core_disable(), which means BME is cleared before unbinding, > and the TDI transitions to the ERROR state. I don't think this ordering is deliberate, we can destroy the vdevice much earlier?? Jason
Jason Gunthorpe wrote: > On Tue, Aug 05, 2025 at 10:37:01AM +0530, Aneesh Kumar K.V wrote: > > > To me it is an unfortunate PCI specification wrinkle that writing to the > > > command register drops the device from RUN to ERROR. So you can LOCK > > > without setting BME, but then no DMA. > > > > This is only w.r.t clearing BME isn't ? > > > > According to section 11.2.6 DSM Tracking and Handling of Locked TDI Configurations > > > > Clearing any of the following bits causes the TDI hosted > > by the Function to transition to ERROR: > > > > • Memory Space Enable > > • Bus Master Enable > > Oh that's nice, yeah! That is useful, but an unmodified PCI driver is going to make separate calls to pci_set_master() and pci_enable_device() so it should still be the case that those need to be trapped out of the concern that writing back zero for a read-modify-write also trips the error state on some device that fails the Robustness Principle. I guess we could wait to solve that problem until the encountering the first device that trips ERROR when writing zero to an already zeroed bit. > > Which implies the flow described in the cover-letter where driver enable the BME works? > > However clearing BME may be problematic? I did have a FIXME!!/comment in [1] > > > > vfio_pci_core_close_device(): > > > > #if 0 > > /* > > * destroy vdevice which involves tsm unbind before we disable pci disable > > * A MSE/BME clear will transition the device to error state. > > */ > > if (core_vdev->iommufd_device) > > iommufd_device_tombstone_vdevice(core_vdev->iommufd_device); > > #endif > > > > vfio_pci_core_disable(vdev); > > Here is where I feel the VMM should be trapping this and NOPing it, or > failing that the guest PCI Core should NOP it. At this point (vfio shutdown path) the VMM is committed stopping guest operations with the device. So ok not to not NOP in this specific path, right? > With the ideal version being the TSM and VMM would be able to block > the iommu as a functional stand in for BME. The TSM block for BME is the LOCKED or ERROR state. That would be in conflict with the proposal that the device stays in the RUN state on guest driver unbind. I feel like either the device stays in RUN state and BME leaks, or the device is returned to LOCKED on driver unbind. Otherwise a functional stand-in for BME that also keeps the device in RUN state feels like a TSM feature request for a "RUN but BLOCKED" state.
On Tue, Aug 05, 2025 at 11:27:36AM -0700, dan.j.williams@intel.com wrote: > > > Clearing any of the following bits causes the TDI hosted > > > by the Function to transition to ERROR: > > > > > > • Memory Space Enable > > > • Bus Master Enable > > > > Oh that's nice, yeah! > > That is useful, but an unmodified PCI driver is going to make separate > calls to pci_set_master() and pci_enable_device() so it should still be > the case that those need to be trapped out of the concern that > writing back zero for a read-modify-write also trips the error state on > some device that fails the Robustness Principle. I hope we don't RMW BME and MSE in some weird way like that :( > > Here is where I feel the VMM should be trapping this and NOPing it, or > > failing that the guest PCI Core should NOP it. > > At this point (vfio shutdown path) the VMM is committed stopping guest > operations with the device. So ok not to not NOP in this specific path, > right? What I said in my other mail was the the T=1 state should have nothing to do with driver binding. So unbinding vfio should leave the device in the RUN state just fine. > > With the ideal version being the TSM and VMM would be able to block > > the iommu as a functional stand in for BME. > > The TSM block for BME is the LOCKED or ERROR state. That would be in > conflict with the proposal that the device stays in the RUN state on > guest driver unbind. This is a different thing. Leaving RUN says the OS (especially userspace) does not trust the device. Disabling DMA, on explict trusted request from the cVM, is entirely fine to do inside the T=1 state. PCI made it so the only way to do this is with the IOMMU, oh well, so be it. > I feel like either the device stays in RUN state and BME leaks, or the > device is returned to LOCKED on driver unbind. Stay in RUN is my vote. I can't really defend the other choice from a linux driver model perspective. > Otherwise a functional stand-in for BME that also keeps the device > in RUN state feels like a TSM feature request for a "RUN but > BLOCKED" state. Yes, and probably not necessary, more of a defence against bugs in depth kind of request. For Linux we would like it if the device can be in RUN and have DMA blocked off during all times when no driver is attached. Jason
Jason Gunthorpe wrote: > On Tue, Aug 05, 2025 at 11:27:36AM -0700, dan.j.williams@intel.com wrote: > > > > Clearing any of the following bits causes the TDI hosted > > > > by the Function to transition to ERROR: > > > > > > > > • Memory Space Enable > > > > • Bus Master Enable > > > > > > Oh that's nice, yeah! > > > > That is useful, but an unmodified PCI driver is going to make separate > > calls to pci_set_master() and pci_enable_device() so it should still be > > the case that those need to be trapped out of the concern that > > writing back zero for a read-modify-write also trips the error state on > > some device that fails the Robustness Principle. > > I hope we don't RMW BME and MSE in some weird way like that :( Yeah, I would like to say, "device, you get to keep the pieces if you transition to ERROR state on re-writing on already zeroed-bit." > > > Here is where I feel the VMM should be trapping this and NOPing it, or > > > failing that the guest PCI Core should NOP it. > > > > At this point (vfio shutdown path) the VMM is committed stopping guest > > operations with the device. So ok not to not NOP in this specific path, > > right? > > What I said in my other mail was the the T=1 state should have nothing > to do with driver binding. Guest driver unbind, agree. > So unbinding vfio should leave the device in the RUN state just fine. Perhaps my vfio inexperience is showing, but at the point where the VMM is unbinding vfio it is committed to destroying the guest's assigned device context, no? So should that not be the point where continuing to maintain the RUN state ends? > > > With the ideal version being the TSM and VMM would be able to block > > > the iommu as a functional stand in for BME. > > > > The TSM block for BME is the LOCKED or ERROR state. That would be in > > conflict with the proposal that the device stays in the RUN state on > > guest driver unbind. > > This is a different thing. Leaving RUN says the OS (especially > userspace) does not trust the device. > > Disabling DMA, on explict trusted request from the cVM, is entirely > fine to do inside the T=1 state. PCI made it so the only way to do > this is with the IOMMU, oh well, so be it. > > > I feel like either the device stays in RUN state and BME leaks, or the > > device is returned to LOCKED on driver unbind. > > Stay in RUN is my vote. I can't really defend the other choice from a > linux driver model perspective. > > > Otherwise a functional stand-in for BME that also keeps the device > > in RUN state feels like a TSM feature request for a "RUN but > > BLOCKED" state. > > Yes, and probably not necessary, more of a defence against bugs in > depth kind of request. For Linux we would like it if the device can be > in RUN and have DMA blocked off during all times when no driver is > attached. Ok, defense in depth, but in the meantime rely on unbound driver == DMA unmapped and device should be quiescent. Combine that with the fact that userspace PCI drivers should be disabled in cVMs should mean that guest can expect that an unbound TDI in the RUN state will remain quiet.
On Tue, Aug 05, 2025 at 12:06:11PM -0700, dan.j.williams@intel.com wrote: > > So unbinding vfio should leave the device in the RUN state just fine. > > Perhaps my vfio inexperience is showing, but at the point where the VMM > is unbinding vfio it is committed to destroying the guest's assigned > device context, no? So should that not be the point where continuing to > maintain the RUN state ends? Oh, sorry it gets so confusing.. VFIO *in the guest* should behave as above, like any other driver unbind leaves it in RUN. VFIO *in the host* should leave the RUN state at the soonest of: - cVM's KVM is destroyed - iommufd vdevice is destroyed - vfio device is closed And maybe more cases I didn't think of.. BME should happen strictly after all of the above and should not be the trigger that drops it out of RUN. > > Yes, and probably not necessary, more of a defence against bugs in > > depth kind of request. For Linux we would like it if the device can be > > in RUN and have DMA blocked off during all times when no driver is > > attached. > > Ok, defense in depth, but in the meantime rely on unbound driver == DMA > unmapped and device should be quiescent. Combine that with the fact that > userspace PCI drivers should be disabled in cVMs should mean that guest > can expect that an unbound TDI in the RUN state will remain quiet. "userspace PCI drivers" is VFIO in the guest which means you get FLRs to fence the DMA. If we end up where I suggested earlier for RAS that a FLR can check the attestation and if exactly matching reaccept it automatically then it would maintain the 'once accepted we stay in T=1 RUN state' idea. Jason
<dan.j.williams@intel.com> writes: > Aneesh Kumar K.V (Arm) wrote: >> This patch series implements support for Device Assignment in the ARM CCA >> architecture. The code changes are based on Alp12 specification published here >> [1]. >> >> The code builds on the TSM framework patches posted at [2]. We add extension to >> that framework so that TSM is now used in both the host and the guest. >> >> A DA workflow can be summarized as below: >> >> Host: >> step 1. >> echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind >> echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override >> echo ${DEVICE} > /sys/bus/pci/drivers_probe >> >> step 2. >> echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect > > Just for my own understanding... presumably there is no ordering > constraint for ARM CCA between step1 and step2, right? I.e. The connect > state is independent of the bind state. > > In the v4 PCI/TSM scheme the connect command is now: > > echo $tsm_dev > /sys/bus/pci/devices/$DEVICE/tsm/connect > >> Now in the guest we follow the below steps > > I assume a signifcant amount of kvmtool magic happens here to get the > TDI into a "bind capable" state, can you share that command? > lkvm run --realm -c 2 -m 256 -k /kselftest/Image -p "$KERNEL_PARAMS" -d ./rootfs-guest.ext2 --iommufd-vdevice --vfio-pci $DEVICE1 --vfio-pci $DEVICE2 > I had been assuming that everyone was prototyping with QEMU. Not a > problem per se, but the memory management for shared device assignment / > bounce buffering has had a quite of bit of work on the QEMU side, so > just curious about the difference in approach here. Like, does kvmtool > support operating the device in shared mode with bounce buffering and > page conversion (shared <=> private) support? In any event, happy to see > mutiple simultaneous consumers of this new kernel infrastructure. > -aneesh
© 2016 - 2025 Red Hat, Inc.