.../reserved-memory/reserved-memory.txt | 24 + arch/powerpc/platforms/pseries/svm.c | 4 +- drivers/iommu/dma-iommu.c | 12 +- drivers/of/address.c | 25 + drivers/of/device.c | 3 + drivers/of/of_private.h | 5 + drivers/xen/swiotlb-xen.c | 4 +- include/linux/device.h | 4 + include/linux/swiotlb.h | 32 +- kernel/dma/Kconfig | 14 + kernel/dma/direct.c | 51 +- kernel/dma/direct.h | 8 +- kernel/dma/swiotlb.c | 636 ++++++++++++------ 13 files changed, 582 insertions(+), 240 deletions(-)
This series implements mitigations for lack of DMA access control on
systems without an IOMMU, which could result in the DMA accessing the
system memory at unexpected times and/or unexpected addresses, possibly
leading to data leakage or corruption.
For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
not behind an IOMMU. As PCI-e, by design, gives the device full access to
system memory, a vulnerability in the Wi-Fi firmware could easily escalate
to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
full chain of exploits; [2], [3]).
To mitigate the security concerns, we introduce restricted DMA. Restricted
DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
specially allocated region and does memory allocation from the same region.
The feature on its own provides a basic level of protection against the DMA
overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system needs
to provide a way to restrict the DMA to a predefined memory region (this is
usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).
[1a] https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
[1b] https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
[2] https://blade.tencent.com/en/advisories/qualpwn/
[3] https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
[4] https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
Claire Chang (14):
swiotlb: Remove external access to io_tlb_start
swiotlb: Move is_swiotlb_buffer() to swiotlb.c
swiotlb: Add struct swiotlb
swiotlb: Refactor swiotlb_late_init_with_tbl
swiotlb: Add DMA_RESTRICTED_POOL
swiotlb: Add restricted DMA pool
swiotlb: Update swiotlb API to gain a struct device argument
swiotlb: Use restricted DMA pool if available
swiotlb: Refactor swiotlb_tbl_{map,unmap}_single
dma-direct: Add a new wrapper __dma_direct_free_pages()
swiotlb: Add is_dev_swiotlb_force()
swiotlb: Add restricted DMA alloc/free support.
dt-bindings: of: Add restricted DMA pool
of: Add plumbing for restricted DMA pool
.../reserved-memory/reserved-memory.txt | 24 +
arch/powerpc/platforms/pseries/svm.c | 4 +-
drivers/iommu/dma-iommu.c | 12 +-
drivers/of/address.c | 25 +
drivers/of/device.c | 3 +
drivers/of/of_private.h | 5 +
drivers/xen/swiotlb-xen.c | 4 +-
include/linux/device.h | 4 +
include/linux/swiotlb.h | 32 +-
kernel/dma/Kconfig | 14 +
kernel/dma/direct.c | 51 +-
kernel/dma/direct.h | 8 +-
kernel/dma/swiotlb.c | 636 ++++++++++++------
13 files changed, 582 insertions(+), 240 deletions(-)
--
v4:
- Fix spinlock bad magic
- Use rmem->name for debugfs entry
- Address the comments in v3
v3:
Using only one reserved memory region for both streaming DMA and memory
allocation.
https://lore.kernel.org/patchwork/cover/1360992/
v2:
Building on top of swiotlb.
https://lore.kernel.org/patchwork/cover/1280705/
v1:
Using dma_map_ops.
https://lore.kernel.org/patchwork/cover/1271660/
2.30.0.478.g8a0d178c01-goog
v5 here: https://lore.kernel.org/patchwork/cover/1416899/ to rebase onto Christoph's swiotlb cleanups.
On Tue, 2021-02-09 at 14:21 +0800, Claire Chang wrote: > This series implements mitigations for lack of DMA access control on > systems without an IOMMU, which could result in the DMA accessing the > system memory at unexpected times and/or unexpected addresses, possibly > leading to data leakage or corruption. Replying to an ancient (2021) thread which has already been merged... I'd like to be able to use this facility for virtio devices. Virtio already has a complicated relationship with the DMA API, because there were a bunch of early VMM bugs where the virtio devices where magically exempted from IOMMU protection, but the VMM lied to the guest and claimed they weren't. With the advent of confidential computing, and the VMM (or whatever's emulating the virtio device) not being *allowed* to arbitrarily access all of the guest's memory, the DMA API becomes necessary again. Either a virtual IOMMU needs to determine which guest memory the VMM may access, or the DMA API is wrappers around operations which share/unshare (or unencrypt/encrypt) the memory in question. All of which is complicated and slow, if we're looking at a minimal privileged hypervisor stub like pKVM which enforces the lack of guest memory access from VMM. I'm thinking of defining a new type of virtio-pci device which cannot do DMA to arbitrary system memory. Instead it has an additional memory BAR which is used as a SWIOTLB for bounce buffering. The driver for it would look much like the existing virtio-pci device except that it would register the restricted-dma region first (and thus the swiotlb dma_ops), and then just go through the rest of the setup like any other virtio device. That seems like it ought to be fairly simple, and seems like a reasonable way to allow an untrusted VMM to provide virtio devices with restricted DMA access. While I start actually doing the typing... does anyone want to start yelling at me now? Christoph? mst? :)
On Fri, Mar 21, 2025 at 03:38:10PM +0000, David Woodhouse wrote: > On Tue, 2021-02-09 at 14:21 +0800, Claire Chang wrote: > > This series implements mitigations for lack of DMA access control on > > systems without an IOMMU, which could result in the DMA accessing the > > system memory at unexpected times and/or unexpected addresses, possibly > > leading to data leakage or corruption. > > Replying to an ancient (2021) thread which has already been merged... > > I'd like to be able to use this facility for virtio devices. > > Virtio already has a complicated relationship with the DMA API, because > there were a bunch of early VMM bugs where the virtio devices where > magically exempted from IOMMU protection, but the VMM lied to the guest > and claimed they weren't. > > With the advent of confidential computing, and the VMM (or whatever's > emulating the virtio device) not being *allowed* to arbitrarily access > all of the guest's memory, the DMA API becomes necessary again. > > Either a virtual IOMMU needs to determine which guest memory the VMM > may access, or the DMA API is wrappers around operations which > share/unshare (or unencrypt/encrypt) the memory in question. > > All of which is complicated and slow, if we're looking at a minimal > privileged hypervisor stub like pKVM which enforces the lack of guest > memory access from VMM. > > I'm thinking of defining a new type of virtio-pci device which cannot > do DMA to arbitrary system memory. Instead it has an additional memory > BAR which is used as a SWIOTLB for bounce buffering. > > The driver for it would look much like the existing virtio-pci device > except that it would register the restricted-dma region first (and thus > the swiotlb dma_ops), and then just go through the rest of the setup > like any other virtio device. > > That seems like it ought to be fairly simple, and seems like a > reasonable way to allow an untrusted VMM to provide virtio devices with > restricted DMA access. > > While I start actually doing the typing... does anyone want to start > yelling at me now? Christoph? mst? :) I don't mind as such (though I don't understand completely), but since this is changing the device anyway, I am a bit confused why you can't just set the VIRTIO_F_ACCESS_PLATFORM feature bit? This forces DMA API which will DTRT for you, will it not? -- MST
On Fri, 2025-03-21 at 14:32 -0400, Michael S. Tsirkin wrote: > On Fri, Mar 21, 2025 at 03:38:10PM +0000, David Woodhouse wrote: > > On Tue, 2021-02-09 at 14:21 +0800, Claire Chang wrote: > > > This series implements mitigations for lack of DMA access control on > > > systems without an IOMMU, which could result in the DMA accessing the > > > system memory at unexpected times and/or unexpected addresses, possibly > > > leading to data leakage or corruption. > > > > Replying to an ancient (2021) thread which has already been merged... > > > > I'd like to be able to use this facility for virtio devices. > > > > Virtio already has a complicated relationship with the DMA API, because > > there were a bunch of early VMM bugs where the virtio devices where > > magically exempted from IOMMU protection, but the VMM lied to the guest > > and claimed they weren't. > > > > With the advent of confidential computing, and the VMM (or whatever's > > emulating the virtio device) not being *allowed* to arbitrarily access > > all of the guest's memory, the DMA API becomes necessary again. > > > > Either a virtual IOMMU needs to determine which guest memory the VMM > > may access, or the DMA API is wrappers around operations which > > share/unshare (or unencrypt/encrypt) the memory in question. > > > > All of which is complicated and slow, if we're looking at a minimal > > privileged hypervisor stub like pKVM which enforces the lack of guest > > memory access from VMM. > > > > I'm thinking of defining a new type of virtio-pci device which cannot > > do DMA to arbitrary system memory. Instead it has an additional memory > > BAR which is used as a SWIOTLB for bounce buffering. > > > > The driver for it would look much like the existing virtio-pci device > > except that it would register the restricted-dma region first (and thus > > the swiotlb dma_ops), and then just go through the rest of the setup > > like any other virtio device. > > > > That seems like it ought to be fairly simple, and seems like a > > reasonable way to allow an untrusted VMM to provide virtio devices with > > restricted DMA access. > > > > While I start actually doing the typing... does anyone want to start > > yelling at me now? Christoph? mst? :) > > > I don't mind as such (though I don't understand completely), but since > this is changing the device anyway, I am a bit confused why you can't > just set the VIRTIO_F_ACCESS_PLATFORM feature bit? This forces DMA API > which will DTRT for you, will it not? That would be necessary but not sufficient. The question is *what* does the DMA API do? For a real passthrough PCI device, perhaps we'd have a vIOMMU exposed to the guest so that it can do real protection with two-stage page tables (IOVA→GPA under control of the guest, GPA→HPA under control of the hypervisor). For that to work in the pKVM model though, you'd need pKVM to be talking the guest's stage1 I/O page tables to see if a given access from the VMM ought to be permitted? Or for confidential guests there could be DMA ops which are an 'enlightenment'; a hypercall into pKVM to share/unshare pages so that the VMM can actually access them, or SEV-SNP guests might mark pages unencrypted to have the same effect with hardware protection. Doing any of those dynamically to allow the VMM to access buffers in arbitrary guest memory (when it wouldn't normally have access to arbitrary guest memory) is complex and doesn't perform very well. And exposes a full 4KiB page for any byte that needs to be made available. Thus the idea of having a fixed range of memory to use for a SWIOTLB, which is fairly much what the restricted DMA setup is all about. We're just proposing that we build it in to a virtio-pci device model, which automatically uses the extra memory BAR instead of the restricted-dma-pool DT node. It's basically just allowing us to expose through PCI, what I believe we can already do for virtio in DT.
On Fri, Mar 21, 2025 at 06:42:20PM +0000, David Woodhouse wrote: > On Fri, 2025-03-21 at 14:32 -0400, Michael S. Tsirkin wrote: > > On Fri, Mar 21, 2025 at 03:38:10PM +0000, David Woodhouse wrote: > > > On Tue, 2021-02-09 at 14:21 +0800, Claire Chang wrote: > > > > This series implements mitigations for lack of DMA access control on > > > > systems without an IOMMU, which could result in the DMA accessing the > > > > system memory at unexpected times and/or unexpected addresses, possibly > > > > leading to data leakage or corruption. > > > > > > Replying to an ancient (2021) thread which has already been merged... > > > > > > I'd like to be able to use this facility for virtio devices. > > > > > > Virtio already has a complicated relationship with the DMA API, because > > > there were a bunch of early VMM bugs where the virtio devices where > > > magically exempted from IOMMU protection, but the VMM lied to the guest > > > and claimed they weren't. > > > > > > With the advent of confidential computing, and the VMM (or whatever's > > > emulating the virtio device) not being *allowed* to arbitrarily access > > > all of the guest's memory, the DMA API becomes necessary again. > > > > > > Either a virtual IOMMU needs to determine which guest memory the VMM > > > may access, or the DMA API is wrappers around operations which > > > share/unshare (or unencrypt/encrypt) the memory in question. > > > > > > All of which is complicated and slow, if we're looking at a minimal > > > privileged hypervisor stub like pKVM which enforces the lack of guest > > > memory access from VMM. > > > > > > I'm thinking of defining a new type of virtio-pci device which cannot > > > do DMA to arbitrary system memory. Instead it has an additional memory > > > BAR which is used as a SWIOTLB for bounce buffering. > > > > > > The driver for it would look much like the existing virtio-pci device > > > except that it would register the restricted-dma region first (and thus > > > the swiotlb dma_ops), and then just go through the rest of the setup > > > like any other virtio device. > > > > > > That seems like it ought to be fairly simple, and seems like a > > > reasonable way to allow an untrusted VMM to provide virtio devices with > > > restricted DMA access. > > > > > > While I start actually doing the typing... does anyone want to start > > > yelling at me now? Christoph? mst? :) > > > > > > I don't mind as such (though I don't understand completely), but since > > this is changing the device anyway, I am a bit confused why you can't > > just set the VIRTIO_F_ACCESS_PLATFORM feature bit? This forces DMA API > > which will DTRT for you, will it not? > > That would be necessary but not sufficient. The question is *what* does > the DMA API do? > > For a real passthrough PCI device, perhaps we'd have a vIOMMU exposed > to the guest so that it can do real protection with two-stage page > tables (IOVA→GPA under control of the guest, GPA→HPA under control of > the hypervisor). For that to work in the pKVM model though, you'd need > pKVM to be talking the guest's stage1 I/O page tables to see if a given > access from the VMM ought to be permitted? > > Or for confidential guests there could be DMA ops which are an > 'enlightenment'; a hypercall into pKVM to share/unshare pages so that > the VMM can actually access them, or SEV-SNP guests might mark pages > unencrypted to have the same effect with hardware protection. > > Doing any of those dynamically to allow the VMM to access buffers in > arbitrary guest memory (when it wouldn't normally have access to > arbitrary guest memory) is complex and doesn't perform very well. And > exposes a full 4KiB page for any byte that needs to be made available. > > Thus the idea of having a fixed range of memory to use for a SWIOTLB, > which is fairly much what the restricted DMA setup is all about. > > We're just proposing that we build it in to a virtio-pci device model, > which automatically uses the extra memory BAR instead of the > restricted-dma-pool DT node. > > It's basically just allowing us to expose through PCI, what I believe > we can already do for virtio in DT. I am not saying I am against this extension. The idea to restrict DMA has a lot of merit outside pkvm. For example, with a physical devices, limiting its DMA to a fixed range can be good for security at a cost of an extra data copy. So I am not saying we have to block this specific hack. what worries me fundamentally is I am not sure it works well e.g. for physical virtio cards. Attempts to pass data between devices will now also require extra data copies. Did you think about adding an swiotlb mode to virtio-iommu at all? Much easier than parsing page tables. -- MST
On 30 March 2025 18:06:47 BST, "Michael S. Tsirkin" <mst@redhat.com> wrote: >> It's basically just allowing us to expose through PCI, what I believe >> we can already do for virtio in DT. > >I am not saying I am against this extension. >The idea to restrict DMA has a lot of merit outside pkvm. >For example, with a physical devices, limiting its DMA >to a fixed range can be good for security at a cost of >an extra data copy. > >So I am not saying we have to block this specific hack. > >what worries me fundamentally is I am not sure it works well >e.g. for physical virtio cards. Not sure why it doesn't work for physical cards. They don't need to be bus-mastering; they just take data from a buffer in their own RAM. >Attempts to pass data between devices will now also require >extra data copies. Yes. I think that's acceptable, but if we really cared we could perhaps extend the capability to refer to a range inside a given BAR on a specific *device*? Or maybe just *function*, and allow sharing of SWIOTLB buffer within a multi-function device? I think it's overkill though. >Did you think about adding an swiotlb mode to virtio-iommu at all? >Much easier than parsing page tables. Often the guests which need this will have a real IOMMU for the true pass-through devices. Adding a virtio-iommu into the mix (or any other system-wide way of doing something different for certain devices) is problematic. The on-device buffer keeps it nice and simple, and even allows us to do device support for operating systems like Windows where it's a lot harder to do anything generic in the core OS.
On Sun, Mar 30, 2025 at 10:27:58PM +0100, David Woodhouse wrote: > On 30 March 2025 18:06:47 BST, "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> It's basically just allowing us to expose through PCI, what I believe > >> we can already do for virtio in DT. > > > >I am not saying I am against this extension. > >The idea to restrict DMA has a lot of merit outside pkvm. > >For example, with a physical devices, limiting its DMA > >to a fixed range can be good for security at a cost of > >an extra data copy. > > > >So I am not saying we have to block this specific hack. > > > >what worries me fundamentally is I am not sure it works well > >e.g. for physical virtio cards. > > Not sure why it doesn't work for physical cards. They don't need to be bus-mastering; they just take data from a buffer in their own RAM. I mean, it kind of does, it is just that CPU pulling data over the PCI bus stalls it so is very expensive. It is not by chance people switched to DMA almost exclusively. > >Attempts to pass data between devices will now also require > >extra data copies. > > Yes. I think that's acceptable, but if we really cared we could perhaps extend the capability to refer to a range inside a given BAR on a specific *device*? Or maybe just *function*, and allow sharing of SWIOTLB buffer within a multi-function device? Fundamentally, this is what dmabuf does. > I think it's overkill though. > > >Did you think about adding an swiotlb mode to virtio-iommu at all? > >Much easier than parsing page tables. > > Often the guests which need this will have a real IOMMU for the true > pass-through devices. Not sure I understand. You mean with things like stage 2 passthrough? > Adding a virtio-iommu into the mix (or any other > system-wide way of doing something different for certain devices) is > problematic. OK... but the issue isn't specific to no DMA devices, is it? > The on-device buffer keeps it nice and simple, I am not saying it is not. It's just a little boutique. > and even allows us to > do device support for operating systems like Windows where it's a lot > harder to do anything generic in the core OS. Well we do need virtio iommu windows support sooner or later, anyway. -- MST
On Sun, 2025-03-30 at 17:48 -0400, Michael S. Tsirkin wrote: > On Sun, Mar 30, 2025 at 10:27:58PM +0100, David Woodhouse wrote: > > On 30 March 2025 18:06:47 BST, "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > It's basically just allowing us to expose through PCI, what I believe > > > > we can already do for virtio in DT. > > > > > > I am not saying I am against this extension. > > > The idea to restrict DMA has a lot of merit outside pkvm. > > > For example, with a physical devices, limiting its DMA > > > to a fixed range can be good for security at a cost of > > > an extra data copy. > > > > > > So I am not saying we have to block this specific hack. > > > > > > what worries me fundamentally is I am not sure it works well > > > e.g. for physical virtio cards. > > > > Not sure why it doesn't work for physical cards. They don't need to > > be bus-mastering; they just take data from a buffer in their own > > RAM. > > I mean, it kind of does, it is just that CPU pulling data over the PCI bus > stalls it so is very expensive. It is not by chance people switched to > DMA almost exclusively. Yes. For a physical implementation it would not be the most high- performance option... unless DMA is somehow blocked as it is in the pKVM+virt case. In the case of a virtual implementation, however, the performance is not an issue because it'll be backed by host memory anyway. (It's just that because it's presented to the guest and the trusted part of the hypervisor as PCI BAR space instead of main memory, it's a whole lot more practical to deal with the fact that it's *shared* with the VMM.) > > > Attempts to pass data between devices will now also require > > > extra data copies. > > > > Yes. I think that's acceptable, but if we really cared we could > > perhaps extend the capability to refer to a range inside a given > > BAR on a specific *device*? Or maybe just *function*, and allow > > sharing of SWIOTLB buffer within a multi-function device? > > Fundamentally, this is what dmabuf does. In software, yes. Extending it to hardware is a little harder. In principle, it might be quite nice to offer a single SWIOTLB buffer region (in a BAR of one device) and have multiple virtio devices share it. Not just because of passing data between devices, as you mentioned, but also because it'll be a more efficient use of memory than each device having its own buffer and allocation pool. So how would a device indicate that it can use a SWIOTLB buffer which is in a BAR of a *different* device? Not by physical address, because BARs get moved around. Not even by PCI bus/dev/fn/BAR# because *buses* get renumbered. You could limit it to sharing within one PCI "bus", and use just dev/fn/BAR#? Or even within one PCI device and just fn/BAR#? The latter could theoretically be usable by multi-function physical devices. The standard struct virtio_pci_cap (which I used for VIRTIO_PCI_CAP_SWIOTLB) just contains BAR and offset/length. We could extend it with device + function, using -1 for 'self', to allow for such sharing? Still not convinced it isn't overkill, but it's certainly easy enough to add on the *spec* side. I haven't yet looked at how that sharing would work in Linux on the guest side; thus far what I'm proposing is intended to be almost identical to the per-device thing that should already work with a `restricted-dma-pool' node in device-tree. > > I think it's overkill though. > > > > > Did you think about adding an swiotlb mode to virtio-iommu at all? > > > Much easier than parsing page tables. > > > > Often the guests which need this will have a real IOMMU for the true > > pass-through devices. > > Not sure I understand. You mean with things like stage 2 passthrough? Yes. AMD's latest IOMMU spec documents it, for example. Exposing a 'vIOMMU' to the guest which handles just stage 1 (IOVA→GPA) while the hypervisor controls the normal GPA→HPA translation in stage 2. Then the guest gets an accelerated path *directly* to the hardware for its IOTLB flushes... which means the hypervisor doesn't get to *see* those IOTLB flushes so it's a PITA to do device emulation as if it's covered by that same IOMMU. (Actually I haven't checked the AMD one in detail for that flaw; most *other* 2-stage IOMMUs I've seen do have it, and I *bet* AMD does too). > > Adding a virtio-iommu into the mix (or any other > > system-wide way of doing something different for certain devices) is > > problematic. > > OK... but the issue isn't specific to no DMA devices, is it? Hm? Allowing virtio devices to operate as "no-DMA devices" is a *workaround* for the issue. The issue is that the VMM may not have full access to the guest's memory for emulating devices. These days, virtio covers a large proportion of emulated devices. So I do think the issue is fairly specific to virtio devices, and suspect that's what you meant to type above? We pondered teaching the trusted part of the hypervisor (e.g. pKVM) to snoop on virtqueues enough to 'know' which memory the VMM was genuinely being *invited* to read/write... and we ran away screaming. (In order to have sufficient trust, you end up not just snooping but implementing quite a lot of the emulation on the trusted side. And then complex enlightenments in the VMM and the untrusted Linux/KVM which hosts it, to interact with that.) Then we realised that for existing DT guests it's trivial just to add the `restricted-dma-pool` node. And wanted to do the same for the guests who are afflicted with UEFI/ACPI too. So here we are, trying to add the same capability to virtio-pci. > > The on-device buffer keeps it nice and simple, > > I am not saying it is not. > It's just a little boutique. Fair. Although with the advent of confidential computing and restrictions on guest memory access, perhaps becoming less boutique over time? And it should also be fairly low-friction; it's a whole lot cleaner in the spec than the awful VIRTIO_F_ACCESS_PLATFORM legacy, and even in the Linux guest driver it should work fairly simply given the existing restricted-dma support (although of course that shouldn't entirely be our guiding motivation). > > and even allows us to > > do device support for operating systems like Windows where it's a lot > > harder to do anything generic in the core OS. > > Well we do need virtio iommu windows support sooner or later, anyway. Heh, good luck with that :) And actually, doesn't that only support *DMA* remapping? So you still wouldn't be able to boot a Windows guest with >255 vCPUs without some further enlightenment (like Windows guests finally supporting the 15- bit MSI extension that even Hyper-V supports on the host side...)
On Fri, 2025-03-21 at 18:42 +0000, David Woodhouse wrote:
> On Fri, 2025-03-21 at 14:32 -0400, Michael S. Tsirkin wrote:
> > On Fri, Mar 21, 2025 at 03:38:10PM +0000, David Woodhouse wrote:
> > > On Tue, 2021-02-09 at 14:21 +0800, Claire Chang wrote:
> > > > This series implements mitigations for lack of DMA access control on
> > > > systems without an IOMMU, which could result in the DMA accessing the
> > > > system memory at unexpected times and/or unexpected addresses, possibly
> > > > leading to data leakage or corruption.
> > >
> > > Replying to an ancient (2021) thread which has already been merged...
> > >
> > > I'd like to be able to use this facility for virtio devices.
> > >
> > > Virtio already has a complicated relationship with the DMA API, because
> > > there were a bunch of early VMM bugs where the virtio devices where
> > > magically exempted from IOMMU protection, but the VMM lied to the guest
> > > and claimed they weren't.
> > >
> > > With the advent of confidential computing, and the VMM (or whatever's
> > > emulating the virtio device) not being *allowed* to arbitrarily access
> > > all of the guest's memory, the DMA API becomes necessary again.
> > >
> > > Either a virtual IOMMU needs to determine which guest memory the VMM
> > > may access, or the DMA API is wrappers around operations which
> > > share/unshare (or unencrypt/encrypt) the memory in question.
> > >
> > > All of which is complicated and slow, if we're looking at a minimal
> > > privileged hypervisor stub like pKVM which enforces the lack of guest
> > > memory access from VMM.
> > >
> > > I'm thinking of defining a new type of virtio-pci device which cannot
> > > do DMA to arbitrary system memory. Instead it has an additional memory
> > > BAR which is used as a SWIOTLB for bounce buffering.
> > >
> > > The driver for it would look much like the existing virtio-pci device
> > > except that it would register the restricted-dma region first (and thus
> > > the swiotlb dma_ops), and then just go through the rest of the setup
> > > like any other virtio device.
> > >
> > > That seems like it ought to be fairly simple, and seems like a
> > > reasonable way to allow an untrusted VMM to provide virtio devices with
> > > restricted DMA access.
> > >
> > > While I start actually doing the typing... does anyone want to start
> > > yelling at me now? Christoph? mst? :)
> >
> >
> > I don't mind as such (though I don't understand completely), but since
> > this is changing the device anyway, I am a bit confused why you can't
> > just set the VIRTIO_F_ACCESS_PLATFORM feature bit? This forces DMA API
> > which will DTRT for you, will it not?
>
> That would be necessary but not sufficient. ...
My first cut at a proposed spec change looks something like this. I'll
post it to the virtio-comment list once I've done some corporate
bureaucracy and when the list stops sending me python tracebacks in
response to my subscribe request.
In the meantime I'll hack up some QEMU and guest Linux driver support
to match.
diff --git a/content.tex b/content.tex
index c17ffa6..1e6e1d6 100644
--- a/content.tex
+++ b/content.tex
@@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
Currently these device-independent feature bits are defined:
\begin{description}
+ \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device
+ provides a memory region which is to be used for bounce buffering,
+ rather than permitting direct memory access to system memory.
\item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates
that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT
flag set, as described in \ref{sec:Basic Facilities of a Virtio
@@ -885,6 +888,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical
addresses to the device.
+A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST
+then pass only addresses within the Software IOTLB bounce buffer to the
+device.
+
A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered.
A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered.
@@ -921,6 +928,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not
accepted.
+A device MUST NOT offer VIRTIO_F_SWIOTLB if its transport does not
+provide a Software IOTLB bounce buffer.
+A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted.
+
If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use
buffers in the same order in which they have been available.
diff --git a/transport-pci.tex b/transport-pci.tex
index a5c6719..23e0d57 100644
--- a/transport-pci.tex
+++ b/transport-pci.tex
@@ -129,6 +129,7 @@ \subsection{Virtio Structure PCI Capabilities}\label{sec:Virtio Transport Option
\item ISR Status
\item Device-specific configuration (optional)
\item PCI configuration access
+\item SWIOTLB bounce buffer
\end{itemize}
Each structure can be mapped by a Base Address register (BAR) belonging to
@@ -188,6 +189,8 @@ \subsection{Virtio Structure PCI Capabilities}\label{sec:Virtio Transport Option
#define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
/* Vendor-specific data */
#define VIRTIO_PCI_CAP_VENDOR_CFG 9
+/* Software IOTLB bounce buffer */
+#define VIRTIO_PCI_CAP_SWIOTLB 10
\end{lstlisting}
Any other value is reserved for future use.
@@ -744,6 +747,36 @@ \subsubsection{Vendor data capability}\label{sec:Virtio
The driver MUST qualify the \field{vendor_id} before
interpreting or writing into the Vendor data capability.
+\subsubsection{Software IOTLB bounce buffer capability}\label{sec:Virtio
+Transport Options / Virtio Over PCI Bus / PCI Device Layout /
+Software IOTLB bounce buffer capability}
+
+The optional Software IOTLB bounce buffer capability allows the
+device to provide a memory region which can be used by the driver
+driver for bounce buffering. This allows a device on the PCI
+transport to operate without DMA access to system memory addresses.
+
+The Software IOTLB region is referenced by the
+VIRTIO_PCI_CAP_SWIOTLB capability. Bus addresses within the referenced
+range are not subject to the requirements of the VIRTIO_F_ORDER_PLATFORM
+capability, if negotiated.
+
+\devicenormative{\paragraph}{Software IOTLB bounce buffer capability}{Virtio
+Transport Options / Virtio Over PCI Bus / PCI Device Layout /
+Software IOTLB bounce buffer capability}
+
+Devices which present the Software IOTLB bounce buffer capability
+SHOULD also offer the VIRTIO_F_SWIOTLB feature.
+
+\drivernormative{\paragraph}{Software IOTLB bounce buffer capability}{Virtio
+Transport Options / Virtio Over PCI Bus / PCI Device Layout /
+Software IOTLB bounce buffer capability}
+
+The driver SHOULD use the offered buffer in preference to passing system
+memory addresses to the device. If the driver accepts the VIRTIO_F_SWIOTLB
+feature, then the driver MUST use the offered buffer and never pass system
+memory addresses to the device.
+
\subsubsection{PCI configuration access capability}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / PCI configuration access capability}
The VIRTIO_PCI_CAP_PCI_CFG capability
On Fri, Mar 28, 2025 at 05:40:41PM +0000, David Woodhouse wrote:
> On Fri, 2025-03-21 at 18:42 +0000, David Woodhouse wrote:
> > On Fri, 2025-03-21 at 14:32 -0400, Michael S. Tsirkin wrote:
> > > On Fri, Mar 21, 2025 at 03:38:10PM +0000, David Woodhouse wrote:
> > > > On Tue, 2021-02-09 at 14:21 +0800, Claire Chang wrote:
> > > > > This series implements mitigations for lack of DMA access control on
> > > > > systems without an IOMMU, which could result in the DMA accessing the
> > > > > system memory at unexpected times and/or unexpected addresses, possibly
> > > > > leading to data leakage or corruption.
> > > >
> > > > Replying to an ancient (2021) thread which has already been merged...
> > > >
> > > > I'd like to be able to use this facility for virtio devices.
> > > >
> > > > Virtio already has a complicated relationship with the DMA API, because
> > > > there were a bunch of early VMM bugs where the virtio devices where
> > > > magically exempted from IOMMU protection, but the VMM lied to the guest
> > > > and claimed they weren't.
> > > >
> > > > With the advent of confidential computing, and the VMM (or whatever's
> > > > emulating the virtio device) not being *allowed* to arbitrarily access
> > > > all of the guest's memory, the DMA API becomes necessary again.
> > > >
> > > > Either a virtual IOMMU needs to determine which guest memory the VMM
> > > > may access, or the DMA API is wrappers around operations which
> > > > share/unshare (or unencrypt/encrypt) the memory in question.
> > > >
> > > > All of which is complicated and slow, if we're looking at a minimal
> > > > privileged hypervisor stub like pKVM which enforces the lack of guest
> > > > memory access from VMM.
> > > >
> > > > I'm thinking of defining a new type of virtio-pci device which cannot
> > > > do DMA to arbitrary system memory. Instead it has an additional memory
> > > > BAR which is used as a SWIOTLB for bounce buffering.
> > > >
> > > > The driver for it would look much like the existing virtio-pci device
> > > > except that it would register the restricted-dma region first (and thus
> > > > the swiotlb dma_ops), and then just go through the rest of the setup
> > > > like any other virtio device.
> > > >
> > > > That seems like it ought to be fairly simple, and seems like a
> > > > reasonable way to allow an untrusted VMM to provide virtio devices with
> > > > restricted DMA access.
> > > >
> > > > While I start actually doing the typing... does anyone want to start
> > > > yelling at me now? Christoph? mst? :)
> > >
> > >
> > > I don't mind as such (though I don't understand completely), but since
> > > this is changing the device anyway, I am a bit confused why you can't
> > > just set the VIRTIO_F_ACCESS_PLATFORM feature bit? This forces DMA API
> > > which will DTRT for you, will it not?
> >
> > That would be necessary but not sufficient. ...
could you explain pls?
> My first cut at a proposed spec change looks something like this. I'll
> post it to the virtio-comment list once I've done some corporate
> bureaucracy and when the list stops sending me python tracebacks in
> response to my subscribe request.
the linux foundation one does this? maybe poke at the admins.
> In the meantime I'll hack up some QEMU and guest Linux driver support
> to match.
>
> diff --git a/content.tex b/content.tex
> index c17ffa6..1e6e1d6 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> Currently these device-independent feature bits are defined:
>
> \begin{description}
> + \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device
> + provides a memory region which is to be used for bounce buffering,
> + rather than permitting direct memory access to system memory.
> \item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates
> that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT
> flag set, as described in \ref{sec:Basic Facilities of a Virtio
> @@ -885,6 +888,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical
> addresses to the device.
>
> +A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST
> +then pass only addresses within the Software IOTLB bounce buffer to the
> +device.
> +
> A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered.
>
> A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered.
> @@ -921,6 +928,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not
> accepted.
>
> +A device MUST NOT offer VIRTIO_F_SWIOTLB if its transport does not
> +provide a Software IOTLB bounce buffer.
> +A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted.
> +
> If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use
> buffers in the same order in which they have been available.
>
> diff --git a/transport-pci.tex b/transport-pci.tex
> index a5c6719..23e0d57 100644
> --- a/transport-pci.tex
> +++ b/transport-pci.tex
> @@ -129,6 +129,7 @@ \subsection{Virtio Structure PCI Capabilities}\label{sec:Virtio Transport Option
> \item ISR Status
> \item Device-specific configuration (optional)
> \item PCI configuration access
> +\item SWIOTLB bounce buffer
> \end{itemize}
>
> Each structure can be mapped by a Base Address register (BAR) belonging to
> @@ -188,6 +189,8 @@ \subsection{Virtio Structure PCI Capabilities}\label{sec:Virtio Transport Option
> #define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
> /* Vendor-specific data */
> #define VIRTIO_PCI_CAP_VENDOR_CFG 9
> +/* Software IOTLB bounce buffer */
> +#define VIRTIO_PCI_CAP_SWIOTLB 10
> \end{lstlisting}
>
> Any other value is reserved for future use.
> @@ -744,6 +747,36 @@ \subsubsection{Vendor data capability}\label{sec:Virtio
> The driver MUST qualify the \field{vendor_id} before
> interpreting or writing into the Vendor data capability.
>
> +\subsubsection{Software IOTLB bounce buffer capability}\label{sec:Virtio
> +Transport Options / Virtio Over PCI Bus / PCI Device Layout /
> +Software IOTLB bounce buffer capability}
> +
> +The optional Software IOTLB bounce buffer capability allows the
> +device to provide a memory region which can be used by the driver
> +driver for bounce buffering. This allows a device on the PCI
> +transport to operate without DMA access to system memory addresses.
> +
> +The Software IOTLB region is referenced by the
> +VIRTIO_PCI_CAP_SWIOTLB capability. Bus addresses within the referenced
> +range are not subject to the requirements of the VIRTIO_F_ORDER_PLATFORM
> +capability, if negotiated.
why not? an optimization?
A mix of swiotlb and system memory might be very challenging from POV
of ordering.
> +
> +\devicenormative{\paragraph}{Software IOTLB bounce buffer capability}{Virtio
> +Transport Options / Virtio Over PCI Bus / PCI Device Layout /
> +Software IOTLB bounce buffer capability}
> +
> +Devices which present the Software IOTLB bounce buffer capability
> +SHOULD also offer the VIRTIO_F_SWIOTLB feature.
> +
> +\drivernormative{\paragraph}{Software IOTLB bounce buffer capability}{Virtio
> +Transport Options / Virtio Over PCI Bus / PCI Device Layout /
> +Software IOTLB bounce buffer capability}
> +
> +The driver SHOULD use the offered buffer in preference to passing system
> +memory addresses to the device.
Even if not using VIRTIO_F_SWIOTLB? Is that really necessary?
> If the driver accepts the VIRTIO_F_SWIOTLB
> +feature, then the driver MUST use the offered buffer and never pass system
> +memory addresses to the device.
> +
> \subsubsection{PCI configuration access capability}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / PCI configuration access capability}
>
> The VIRTIO_PCI_CAP_PCI_CFG capability
>
On Sun, 2025-03-30 at 09:42 -0400, Michael S. Tsirkin wrote:
> On Fri, Mar 28, 2025 at 05:40:41PM +0000, David Woodhouse wrote:
> > On Fri, 2025-03-21 at 18:42 +0000, David Woodhouse wrote:
> > > >
> > > > I don't mind as such (though I don't understand completely), but since
> > > > this is changing the device anyway, I am a bit confused why you can't
> > > > just set the VIRTIO_F_ACCESS_PLATFORM feature bit? This forces DMA API
> > > > which will DTRT for you, will it not?
> > >
> > > That would be necessary but not sufficient. ...
>
> could you explain pls?
There was more to that in the previous email which I elided for this
followup.
https://lore.kernel.org/all/d1382a6ee959f22dc5f6628d8648af77f4702418.camel@infradead.org/
> > My first cut at a proposed spec change looks something like this. I'll
> > post it to the virtio-comment list once I've done some corporate
> > bureaucracy and when the list stops sending me python tracebacks in
> > response to my subscribe request.
>
> the linux foundation one does this? maybe poke at the admins.
>
> > In the meantime I'll hack up some QEMU and guest Linux driver support
> > to match.
> >
> > diff --git a/content.tex b/content.tex
> > index c17ffa6..1e6e1d6 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > Currently these device-independent feature bits are defined:
> >
> > \begin{description}
> > + \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device
> > + provides a memory region which is to be used for bounce buffering,
> > + rather than permitting direct memory access to system memory.
> > \item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates
> > that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT
> > flag set, as described in \ref{sec:Basic Facilities of a Virtio
> > @@ -885,6 +888,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical
> > addresses to the device.
> >
> > +A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST
> > +then pass only addresses within the Software IOTLB bounce buffer to the
> > +device.
> > +
> > A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered.
> >
> > A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered.
> > @@ -921,6 +928,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not
> > accepted.
> >
> > +A device MUST NOT offer VIRTIO_F_SWIOTLB if its transport does not
> > +provide a Software IOTLB bounce buffer.
> > +A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted.
> > +
> > If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use
> > buffers in the same order in which they have been available.
> >
> > diff --git a/transport-pci.tex b/transport-pci.tex
> > index a5c6719..23e0d57 100644
> > --- a/transport-pci.tex
> > +++ b/transport-pci.tex
> > @@ -129,6 +129,7 @@ \subsection{Virtio Structure PCI Capabilities}\label{sec:Virtio Transport Option
> > \item ISR Status
> > \item Device-specific configuration (optional)
> > \item PCI configuration access
> > +\item SWIOTLB bounce buffer
> > \end{itemize}
> >
> > Each structure can be mapped by a Base Address register (BAR) belonging to
> > @@ -188,6 +189,8 @@ \subsection{Virtio Structure PCI Capabilities}\label{sec:Virtio Transport Option
> > #define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
> > /* Vendor-specific data */
> > #define VIRTIO_PCI_CAP_VENDOR_CFG 9
> > +/* Software IOTLB bounce buffer */
> > +#define VIRTIO_PCI_CAP_SWIOTLB 10
> > \end{lstlisting}
> >
> > Any other value is reserved for future use.
> > @@ -744,6 +747,36 @@ \subsubsection{Vendor data capability}\label{sec:Virtio
> > The driver MUST qualify the \field{vendor_id} before
> > interpreting or writing into the Vendor data capability.
> >
> > +\subsubsection{Software IOTLB bounce buffer capability}\label{sec:Virtio
> > +Transport Options / Virtio Over PCI Bus / PCI Device Layout /
> > +Software IOTLB bounce buffer capability}
> > +
> > +The optional Software IOTLB bounce buffer capability allows the
> > +device to provide a memory region which can be used by the driver
> > +driver for bounce buffering. This allows a device on the PCI
> > +transport to operate without DMA access to system memory addresses.
> > +
> > +The Software IOTLB region is referenced by the
> > +VIRTIO_PCI_CAP_SWIOTLB capability. Bus addresses within the referenced
> > +range are not subject to the requirements of the VIRTIO_F_ORDER_PLATFORM
> > +capability, if negotiated.
>
>
> why not? an optimization?
> A mix of swiotlb and system memory might be very challenging from POV
> of ordering.
Conceptually, these addresses are *on* the PCI device. If the device is
accessing addresses which are local to it, they aren't subject to IOMMU
translation/filtering because they never even make it to the PCI bus as
memory transactions.
>
> > +
> > +\devicenormative{\paragraph}{Software IOTLB bounce buffer capability}{Virtio
> > +Transport Options / Virtio Over PCI Bus / PCI Device Layout /
> > +Software IOTLB bounce buffer capability}
> > +
> > +Devices which present the Software IOTLB bounce buffer capability
> > +SHOULD also offer the VIRTIO_F_SWIOTLB feature.
> > +
> > +\drivernormative{\paragraph}{Software IOTLB bounce buffer capability}{Virtio
> > +Transport Options / Virtio Over PCI Bus / PCI Device Layout /
> > +Software IOTLB bounce buffer capability}
> > +
> > +The driver SHOULD use the offered buffer in preference to passing system
> > +memory addresses to the device.
>
> Even if not using VIRTIO_F_SWIOTLB? Is that really necessary?
That part isn't strictly necessary, but I think it makes sense, for
cases where the SWIOTLB support is an *optimisation* even if it isn't
strictly necessary.
Why might it be an "optimisation"? Well... if we're thinking of a model
like pKVM where the VMM can't just arbitrarily access guest memory,
using the SWIOTLB is a simple way to avoid that (by using the on-board
memory instead, which *can* be shared with the VMM).
But if we want to go to extra lengths to support unenlightened guests,
an implementation might choose to just *disable* the memory protection
if the guest doesn't negotiate VIRTIO_F_SWIOTLB, instead of breaking
that guest.
Or it might have a complicated emulation/snooping of virtqueues in the
trusted part of the hypervisor so that it knows which addresses the
guest has truly *asked* the VMM to access. (And yes, of course that's
what an IOMMU is for, but when have you seen hardware companies design
a two-stage IOMMU which supports actual PCI passthrough *and* get it
right for the hypervisor to 'snoop' on the stage1 page tables to
support emulated devices too....)
Ultimately I think it was natural to advertise the location of the
buffer with the VIRTIO_PCI_CAP_SWIOTLB capability and then to have the
separate VIRTIO_F_SWIOTLB for negotiation... leaving the obvious
question of what a device should do if it sees one but *not* the other.
Obviously you can't have VIRTIO_F_SWIOTLB *without* there actually
being a buffer advertised with VIRTIO_PCI_CAP_SWIOTLB (or its
equivalent for other transports). But the converse seemed reasonable as
a *hint* even if the use of the SWIOTLB isn't mandatory.
On Sun, Mar 30, 2025 at 04:07:56PM +0100, David Woodhouse wrote:
> On Sun, 2025-03-30 at 09:42 -0400, Michael S. Tsirkin wrote:
> > On Fri, Mar 28, 2025 at 05:40:41PM +0000, David Woodhouse wrote:
> > > On Fri, 2025-03-21 at 18:42 +0000, David Woodhouse wrote:
> > > > >
> > > > > I don't mind as such (though I don't understand completely), but since
> > > > > this is changing the device anyway, I am a bit confused why you can't
> > > > > just set the VIRTIO_F_ACCESS_PLATFORM feature bit? This forces DMA API
> > > > > which will DTRT for you, will it not?
> > > >
> > > > That would be necessary but not sufficient. ...
> >
> > could you explain pls?
>
> There was more to that in the previous email which I elided for this
> followup.
>
> https://lore.kernel.org/all/d1382a6ee959f22dc5f6628d8648af77f4702418.camel@infradead.org/
>
> > > My first cut at a proposed spec change looks something like this. I'll
> > > post it to the virtio-comment list once I've done some corporate
> > > bureaucracy and when the list stops sending me python tracebacks in
> > > response to my subscribe request.
> >
> > the linux foundation one does this? maybe poke at the admins.
> >
> > > In the meantime I'll hack up some QEMU and guest Linux driver support
> > > to match.
> > >
> > > diff --git a/content.tex b/content.tex
> > > index c17ffa6..1e6e1d6 100644
> > > --- a/content.tex
> > > +++ b/content.tex
> > > @@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > > Currently these device-independent feature bits are defined:
> > >
> > > \begin{description}
> > > + \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device
> > > + provides a memory region which is to be used for bounce buffering,
> > > + rather than permitting direct memory access to system memory.
> > > \item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates
> > > that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT
> > > flag set, as described in \ref{sec:Basic Facilities of a Virtio
> > > @@ -885,6 +888,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > > VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical
> > > addresses to the device.
> > >
> > > +A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST
> > > +then pass only addresses within the Software IOTLB bounce buffer to the
> > > +device.
> > > +
> > > A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered.
> > >
> > > A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered.
> > > @@ -921,6 +928,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > > A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not
> > > accepted.
> > >
> > > +A device MUST NOT offer VIRTIO_F_SWIOTLB if its transport does not
> > > +provide a Software IOTLB bounce buffer.
> > > +A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted.
> > > +
> > > If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use
> > > buffers in the same order in which they have been available.
> > >
> > > diff --git a/transport-pci.tex b/transport-pci.tex
> > > index a5c6719..23e0d57 100644
> > > --- a/transport-pci.tex
> > > +++ b/transport-pci.tex
> > > @@ -129,6 +129,7 @@ \subsection{Virtio Structure PCI Capabilities}\label{sec:Virtio Transport Option
> > > \item ISR Status
> > > \item Device-specific configuration (optional)
> > > \item PCI configuration access
> > > +\item SWIOTLB bounce buffer
> > > \end{itemize}
> > >
> > > Each structure can be mapped by a Base Address register (BAR) belonging to
> > > @@ -188,6 +189,8 @@ \subsection{Virtio Structure PCI Capabilities}\label{sec:Virtio Transport Option
> > > #define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
> > > /* Vendor-specific data */
> > > #define VIRTIO_PCI_CAP_VENDOR_CFG 9
> > > +/* Software IOTLB bounce buffer */
> > > +#define VIRTIO_PCI_CAP_SWIOTLB 10
> > > \end{lstlisting}
> > >
> > > Any other value is reserved for future use.
> > > @@ -744,6 +747,36 @@ \subsubsection{Vendor data capability}\label{sec:Virtio
> > > The driver MUST qualify the \field{vendor_id} before
> > > interpreting or writing into the Vendor data capability.
> > >
> > > +\subsubsection{Software IOTLB bounce buffer capability}\label{sec:Virtio
> > > +Transport Options / Virtio Over PCI Bus / PCI Device Layout /
> > > +Software IOTLB bounce buffer capability}
> > > +
> > > +The optional Software IOTLB bounce buffer capability allows the
> > > +device to provide a memory region which can be used by the driver
> > > +driver for bounce buffering. This allows a device on the PCI
> > > +transport to operate without DMA access to system memory addresses.
> > > +
> > > +The Software IOTLB region is referenced by the
> > > +VIRTIO_PCI_CAP_SWIOTLB capability. Bus addresses within the referenced
> > > +range are not subject to the requirements of the VIRTIO_F_ORDER_PLATFORM
> > > +capability, if negotiated.
> >
> >
> > why not? an optimization?
> > A mix of swiotlb and system memory might be very challenging from POV
> > of ordering.
>
> Conceptually, these addresses are *on* the PCI device. If the device is
> accessing addresses which are local to it, they aren't subject to IOMMU
> translation/filtering because they never even make it to the PCI bus as
> memory transactions.
>
> >
> > > +
> > > +\devicenormative{\paragraph}{Software IOTLB bounce buffer capability}{Virtio
> > > +Transport Options / Virtio Over PCI Bus / PCI Device Layout /
> > > +Software IOTLB bounce buffer capability}
> > > +
> > > +Devices which present the Software IOTLB bounce buffer capability
> > > +SHOULD also offer the VIRTIO_F_SWIOTLB feature.
> > > +
> > > +\drivernormative{\paragraph}{Software IOTLB bounce buffer capability}{Virtio
> > > +Transport Options / Virtio Over PCI Bus / PCI Device Layout /
> > > +Software IOTLB bounce buffer capability}
> > > +
> > > +The driver SHOULD use the offered buffer in preference to passing system
> > > +memory addresses to the device.
> >
> > Even if not using VIRTIO_F_SWIOTLB? Is that really necessary?
>
> That part isn't strictly necessary, but I think it makes sense, for
> cases where the SWIOTLB support is an *optimisation* even if it isn't
> strictly necessary.
>
> Why might it be an "optimisation"? Well... if we're thinking of a model
> like pKVM where the VMM can't just arbitrarily access guest memory,
> using the SWIOTLB is a simple way to avoid that (by using the on-board
> memory instead, which *can* be shared with the VMM).
>
> But if we want to go to extra lengths to support unenlightened guests,
> an implementation might choose to just *disable* the memory protection
> if the guest doesn't negotiate VIRTIO_F_SWIOTLB, instead of breaking
> that guest.
>
> Or it might have a complicated emulation/snooping of virtqueues in the
> trusted part of the hypervisor so that it knows which addresses the
> guest has truly *asked* the VMM to access. (And yes, of course that's
> what an IOMMU is for, but when have you seen hardware companies design
> a two-stage IOMMU which supports actual PCI passthrough *and* get it
> right for the hypervisor to 'snoop' on the stage1 page tables to
> support emulated devices too....)
>
> Ultimately I think it was natural to advertise the location of the
> buffer with the VIRTIO_PCI_CAP_SWIOTLB capability and then to have the
> separate VIRTIO_F_SWIOTLB for negotiation... leaving the obvious
> question of what a device should do if it sees one but *not* the other.
>
> Obviously you can't have VIRTIO_F_SWIOTLB *without* there actually
> being a buffer advertised with VIRTIO_PCI_CAP_SWIOTLB (or its
> equivalent for other transports). But the converse seemed reasonable as
> a *hint* even if the use of the SWIOTLB isn't mandatory.
OK but I feel it's more work than you think, so we really need
a better reason than just "why not".
For example, it's not at all clear to me how the ordering is
going to work if buffers are in memory but the ring is swiotlb
or the reverse. Ordering will all be messed up.
--
MST
On 30 March 2025 17:59:13 BST, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>On Sun, Mar 30, 2025 at 04:07:56PM +0100, David Woodhouse wrote:
>> On Sun, 2025-03-30 at 09:42 -0400, Michael S. Tsirkin wrote:
>> > On Fri, Mar 28, 2025 at 05:40:41PM +0000, David Woodhouse wrote:
>> > > On Fri, 2025-03-21 at 18:42 +0000, David Woodhouse wrote:
>> > > > >
>> > > > > I don't mind as such (though I don't understand completely), but since
>> > > > > this is changing the device anyway, I am a bit confused why you can't
>> > > > > just set the VIRTIO_F_ACCESS_PLATFORM feature bit? This forces DMA API
>> > > > > which will DTRT for you, will it not?
>> > > >
>> > > > That would be necessary but not sufficient. ...
>> >
>> > could you explain pls?
>>
>> There was more to that in the previous email which I elided for this
>> followup.
>>
>> https://lore.kernel.org/all/d1382a6ee959f22dc5f6628d8648af77f4702418.camel@infradead.org/
>>
>> > > My first cut at a proposed spec change looks something like this. I'll
>> > > post it to the virtio-comment list once I've done some corporate
>> > > bureaucracy and when the list stops sending me python tracebacks in
>> > > response to my subscribe request.
>> >
>> > the linux foundation one does this? maybe poke at the admins.
>> >
>> > > In the meantime I'll hack up some QEMU and guest Linux driver support
>> > > to match.
>> > >
>> > > diff --git a/content.tex b/content.tex
>> > > index c17ffa6..1e6e1d6 100644
>> > > --- a/content.tex
>> > > +++ b/content.tex
>> > > @@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>> > > Currently these device-independent feature bits are defined:
>> > >
>> > > \begin{description}
>> > > + \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device
>> > > + provides a memory region which is to be used for bounce buffering,
>> > > + rather than permitting direct memory access to system memory.
>> > > \item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates
>> > > that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT
>> > > flag set, as described in \ref{sec:Basic Facilities of a Virtio
>> > > @@ -885,6 +888,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>> > > VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical
>> > > addresses to the device.
>> > >
>> > > +A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST
>> > > +then pass only addresses within the Software IOTLB bounce buffer to the
>> > > +device.
>> > > +
>> > > A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered.
>> > >
>> > > A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered.
>> > > @@ -921,6 +928,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>> > > A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not
>> > > accepted.
>> > >
>> > > +A device MUST NOT offer VIRTIO_F_SWIOTLB if its transport does not
>> > > +provide a Software IOTLB bounce buffer.
>> > > +A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted.
>> > > +
>> > > If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use
>> > > buffers in the same order in which they have been available.
>> > >
>> > > diff --git a/transport-pci.tex b/transport-pci.tex
>> > > index a5c6719..23e0d57 100644
>> > > --- a/transport-pci.tex
>> > > +++ b/transport-pci.tex
>> > > @@ -129,6 +129,7 @@ \subsection{Virtio Structure PCI Capabilities}\label{sec:Virtio Transport Option
>> > > \item ISR Status
>> > > \item Device-specific configuration (optional)
>> > > \item PCI configuration access
>> > > +\item SWIOTLB bounce buffer
>> > > \end{itemize}
>> > >
>> > > Each structure can be mapped by a Base Address register (BAR) belonging to
>> > > @@ -188,6 +189,8 @@ \subsection{Virtio Structure PCI Capabilities}\label{sec:Virtio Transport Option
>> > > #define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
>> > > /* Vendor-specific data */
>> > > #define VIRTIO_PCI_CAP_VENDOR_CFG 9
>> > > +/* Software IOTLB bounce buffer */
>> > > +#define VIRTIO_PCI_CAP_SWIOTLB 10
>> > > \end{lstlisting}
>> > >
>> > > Any other value is reserved for future use.
>> > > @@ -744,6 +747,36 @@ \subsubsection{Vendor data capability}\label{sec:Virtio
>> > > The driver MUST qualify the \field{vendor_id} before
>> > > interpreting or writing into the Vendor data capability.
>> > >
>> > > +\subsubsection{Software IOTLB bounce buffer capability}\label{sec:Virtio
>> > > +Transport Options / Virtio Over PCI Bus / PCI Device Layout /
>> > > +Software IOTLB bounce buffer capability}
>> > > +
>> > > +The optional Software IOTLB bounce buffer capability allows the
>> > > +device to provide a memory region which can be used by the driver
>> > > +driver for bounce buffering. This allows a device on the PCI
>> > > +transport to operate without DMA access to system memory addresses.
>> > > +
>> > > +The Software IOTLB region is referenced by the
>> > > +VIRTIO_PCI_CAP_SWIOTLB capability. Bus addresses within the referenced
>> > > +range are not subject to the requirements of the VIRTIO_F_ORDER_PLATFORM
>> > > +capability, if negotiated.
>> >
>> >
>> > why not? an optimization?
>> > A mix of swiotlb and system memory might be very challenging from POV
>> > of ordering.
>>
>> Conceptually, these addresses are *on* the PCI device. If the device is
>> accessing addresses which are local to it, they aren't subject to IOMMU
>> translation/filtering because they never even make it to the PCI bus as
>> memory transactions.
>>
>> >
>> > > +
>> > > +\devicenormative{\paragraph}{Software IOTLB bounce buffer capability}{Virtio
>> > > +Transport Options / Virtio Over PCI Bus / PCI Device Layout /
>> > > +Software IOTLB bounce buffer capability}
>> > > +
>> > > +Devices which present the Software IOTLB bounce buffer capability
>> > > +SHOULD also offer the VIRTIO_F_SWIOTLB feature.
>> > > +
>> > > +\drivernormative{\paragraph}{Software IOTLB bounce buffer capability}{Virtio
>> > > +Transport Options / Virtio Over PCI Bus / PCI Device Layout /
>> > > +Software IOTLB bounce buffer capability}
>> > > +
>> > > +The driver SHOULD use the offered buffer in preference to passing system
>> > > +memory addresses to the device.
>> >
>> > Even if not using VIRTIO_F_SWIOTLB? Is that really necessary?
>>
>> That part isn't strictly necessary, but I think it makes sense, for
>> cases where the SWIOTLB support is an *optimisation* even if it isn't
>> strictly necessary.
>>
>> Why might it be an "optimisation"? Well... if we're thinking of a model
>> like pKVM where the VMM can't just arbitrarily access guest memory,
>> using the SWIOTLB is a simple way to avoid that (by using the on-board
>> memory instead, which *can* be shared with the VMM).
>>
>> But if we want to go to extra lengths to support unenlightened guests,
>> an implementation might choose to just *disable* the memory protection
>> if the guest doesn't negotiate VIRTIO_F_SWIOTLB, instead of breaking
>> that guest.
>>
>> Or it might have a complicated emulation/snooping of virtqueues in the
>> trusted part of the hypervisor so that it knows which addresses the
>> guest has truly *asked* the VMM to access. (And yes, of course that's
>> what an IOMMU is for, but when have you seen hardware companies design
>> a two-stage IOMMU which supports actual PCI passthrough *and* get it
>> right for the hypervisor to 'snoop' on the stage1 page tables to
>> support emulated devices too....)
>>
>> Ultimately I think it was natural to advertise the location of the
>> buffer with the VIRTIO_PCI_CAP_SWIOTLB capability and then to have the
>> separate VIRTIO_F_SWIOTLB for negotiation... leaving the obvious
>> question of what a device should do if it sees one but *not* the other.
>>
>> Obviously you can't have VIRTIO_F_SWIOTLB *without* there actually
>> being a buffer advertised with VIRTIO_PCI_CAP_SWIOTLB (or its
>> equivalent for other transports). But the converse seemed reasonable as
>> a *hint* even if the use of the SWIOTLB isn't mandatory.
>
>OK but I feel it's more work than you think, so we really need
>a better reason than just "why not".
>
>For example, it's not at all clear to me how the ordering is
>going to work if buffers are in memory but the ring is swiotlb
>or the reverse. Ordering will all be messed up.
Maybe. Although by the time the driver has *observed* the data written to the swiotlb on the device's BAR, it has had to cross the same PCI bus.
But sure, we could require all-or-nothing. Or require that the SWIOTLB only be used if the driver negotiates VIRTIO_F_SWIOTLB.
Even in the latter case we can still allow for SWIOTLB to either be a requirement or a hint, purely down to whether the device *allows* the driver not to negotiate `VIRTIO_F_SWIOTLB`.
© 2016 - 2026 Red Hat, Inc.