From: David Woodhouse <dwmw@amazon.co.uk>
Device-tree bindings for `restricted-dma-pool` were defined in 2021, which
allow devices to be restricted to a given SWIOTLB pool instead of allowing
DMA to arbitrary system memory:
https://lore.kernel.org/all/20210624155526.2775863-1-tientzu@chromium.org/
This facility was not specific to virtio-mmio, but does apply to it. No
attempt was made to ensure backward-compatibility for virtio-mmio devices.
Define a VIRTIO_F_SWIOTLB feature which allows the device and driver to
agree on the use of the SWIOTLB, if present. This enables the device to
refuse to operate further if the driver does not support the SWIOTLB
requirement expressed in the device-tree.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
content.tex | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/content.tex b/content.tex
index c17ffa6..63d075f 100644
--- a/content.tex
+++ b/content.tex
@@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
Currently these device-independent feature bits are defined:
\begin{description}
+ \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device
+ transport provides a memory region which is to be used for bounce
+ buffering, rather than permitting direct memory access to system memory.
\item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates
that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT
flag set, as described in \ref{sec:Basic Facilities of a Virtio
@@ -807,6 +810,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
the driver. When clear, this overrides any platform-specific description of
whether device access is limited or translated in any way, e.g.
whether an IOMMU may be present.
+ If a the device transport provides a software IOTLB bounce buffer,
+ addresses within its range are not subject to the requirements of
+ VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''.
\item[VIRTIO_F_RING_PACKED(34)] This feature indicates
support for the packed virtqueue layout as described in
\ref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}~\nameref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}.
@@ -885,6 +891,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical
addresses to the device.
+A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST
+then pass only addresses within the Software IOTLB bounce buffer to the
+device.
+
A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered.
A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered.
@@ -921,6 +931,8 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not
accepted.
+A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted.
+
If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use
buffers in the same order in which they have been available.
--
2.49.0
On Wed, Apr 02, 2025 at 12:04:45PM +0100, David Woodhouse wrote: > Define a VIRTIO_F_SWIOTLB feature which allows the device and driver to > agree on the use of the SWIOTLB, if present. This enables the device to > refuse to operate further if the driver does not support the SWIOTLB > requirement expressed in the device-tree. This makes no sense whatsoever. Swiotlb is a Linux guest implementation detail, virtio is a transport protocol. Mixing them in the same spec doesn't even compute. Please describe the actual on the wire semantics you want, and don't use the term swiotlb.
On Thu, 2025-04-03 at 00:24 -0700, Christoph Hellwig wrote: > On Wed, Apr 02, 2025 at 12:04:45PM +0100, David Woodhouse wrote: > > Define a VIRTIO_F_SWIOTLB feature which allows the device and driver to > > agree on the use of the SWIOTLB, if present. This enables the device to > > refuse to operate further if the driver does not support the SWIOTLB > > requirement expressed in the device-tree. > > This makes no sense whatsoever. Swiotlb is a Linux guest implementation > detail, virtio is a transport protocol. Mixing them in the same spec > doesn't even compute. Please describe the actual on the wire semantics > you want, and don't use the term swiotlb. Linux has 'SWIOTLB' support, but I didn't see it as a term specific to that particular implementation; it seemed like a concept would would be understood elsewhere. But sure, I'm happy to rename it to just 'bounce buffers' or something like that. I'll see what I can come up with that is succinct enough to use in VIRTIO_F_xxx and VIRTIO_PCI_CAP_xxx names.
On Thu, Apr 03, 2025 at 08:28:19AM +0100, David Woodhouse wrote: > Linux has 'SWIOTLB' support, but I didn't see it as a term specific to > that particular implementation; it seemed like a concept would would be > understood elsewhere. But sure, I'm happy to rename it to just 'bounce > buffers' or something like that. I'll see what I can come up with that > is succinct enough to use in VIRTIO_F_xxx and VIRTIO_PCI_CAP_xxx names. I think you still miss the point. What you describing is either: 1) A BAR (or BAR-like region on MMIO) that can be used to stage indirect I/O and 2a) a device that doesn't have any DMA capabilities, but instead requires use for the above staging region and/or 2b) a system configuration where devices are prevented from accessing guest memory 1) makes perfect sense in virtio. It is the equivalent of the NVMe CMB and similar features on various devices. 2a) would make sense if we actually have such devices and not just system configs, which I doubt. 2b) is the job of ACPI or DT to communicate, not that of virtio
On Thu, 2025-04-03 at 00:35 -0700, Christoph Hellwig wrote: > On Thu, Apr 03, 2025 at 08:28:19AM +0100, David Woodhouse wrote: > > Linux has 'SWIOTLB' support, but I didn't see it as a term specific to > > that particular implementation; it seemed like a concept would would be > > understood elsewhere. But sure, I'm happy to rename it to just 'bounce > > buffers' or something like that. I'll see what I can come up with that > > is succinct enough to use in VIRTIO_F_xxx and VIRTIO_PCI_CAP_xxx names. > > I think you still miss the point. > > What you describing is either: > > 1) A BAR (or BAR-like region on MMIO) that can be used to stage > indirect I/O > > and > > 2a) a device that doesn't have any DMA capabilities, but instead > requires use for the above staging region > > and/or > > 2b) a system configuration where devices are prevented from accessing > guest memory > > 1) makes perfect sense in virtio. It is the equivalent of the NVMe CMB > and similar features on various devices. > > 2a) would make sense if we actually have such devices and not just > system configs, which I doubt. > > 2b) is the job of ACPI or DT to communicate, not that of virtio > I am designing virtual systems, with the constraint in your (2b): The virtual devices implemented in the *VMM* are prevented from accessing guest memory. (Physical devices are actually fine because there's a two-stage IOMMU which copes with those.) I am describing the solution in your (1): Give them an on-device bounce-buffer. The implementation of (1) is to offer them a device which doesn't have DMA capabilities (because *I* as the system designer know that it can't use them). Which is your (2a). You are right, in theory, that (2b) is the job of ACPI or DT to communicate. But virtio is presented as just another PCI device, indistinguishable by the guest from other PCI devices which *are* actually able to do DMA through the hardware's two-stage IOMMU. To my knowledge there isn't a clean way, across platforms, to tell an operating system that a certain device just *can't* do DMA at all. And even if there was, why offer a device model that could *theoretically* do DMA, just to tell the guest through some other channel that it can't? What's wrong with a simple option to indicate, in the device model that the system designer chooses to build for the guest, that it can't do DMA?
On Thu, Apr 03, 2025 at 09:06:37AM +0100, David Woodhouse wrote: > You are right, in theory, that (2b) is the job of ACPI or DT to > communicate. But virtio is presented as just another PCI device, > indistinguishable by the guest from other PCI devices which *are* > actually able to do DMA through the hardware's two-stage IOMMU. Then the virtual device should be behind a different IOMMU (or no IOMMU at all) to clearly differciate them for the assigned hardware devices. > To my knowledge there isn't a clean way, across platforms, to tell an > operating system that a certain device just *can't* do DMA at all. Well, it's part of the device specific if it can do DMA or not. We had a bunch of early PCI (and a lot of ISA) device that could not do P2P. Even on PCIe there might be weirdo device that just expose a bar or have some bitbanged i2c or similar. Now if you mean virtio devices - yes there's no way currently as virtio is a quite fundamentally DMA based model. > And > even if there was, why offer a device model that could *theoretically* > do DMA, just to tell the guest through some other channel that it > can't? That would be the normal way. Because virtio devices fundamentally can do DMA. It's just your guest specification doesn't want them to. > What's wrong with a simple option to indicate, in the device model that > the system designer chooses to build for the guest, that it can't do > DMA? As stated above I suspect you still are asking the wrong question and have the wrong mental model. Virtio fundamentally can do DMA, you just don't want it to. And based on the other subthread I also suspect you'd actually be much better off with your bounce buffer in the virtual host memory instead of coming up with this particularly odd case, but even if not it's the system description that matters here, not the device model.
On Thu, 2025-04-03 at 23:35 -0700, Christoph Hellwig wrote: > As stated above I suspect you still are asking the wrong question and > have the wrong mental model. Virtio fundamentally can do DMA, you just > don't want it to. And based on the other subthread I also suspect you'd > actually be much better off with your bounce buffer in the virtual host > memory instead of coming up with this particularly odd case, but even > if not it's the system description that matters here, not the device > model. I do agree, this is fundamentally a system issue. In a CoCo model, it's non-trivial for the system to allow *virtual* devices to do "DMA" because that actually means allowing the VMM to access arbitrary guest memory. And most hardware designers screwed up the 2-stage IOMMU which allows protection from *real* devices, and didn't build in a way for the VMM to access guest memory through a stage1 mapping. Fixing that to allow system DMA to those guest devices in a way which will work across operating systems is impractical, as it involves the guest knowing which PCI device is implemented where, and either core kernel (share/unshare) enlightenments, or emulating a full IOMMU in the *trusted* part of the hypervisor (e.g. pKVM). So "for the emulated devices, just use a device model that doesn't do arbitrary DMA to system memory" is a nice simple answer, and keeps the guest support restricted to its *own* standalone driver. And yes, virtio is kind of fundamentally DMA-based as you said, but that's OK if the DMA is only to a specific range which is *known* to be shareable and which the guest would never consider to be 'normal RAM'. Because then the VMM can be *allowed* to access that. What's annoying is that this should work out of the box *already* with virtio-mmio and a `restricted-dma-pool` — for systems which aren't afflicted by UEFI/ACPI/PCI as their discovery mechanisms. In fact, I may even try offering virtio-mmio in a PRP0001 ACPI device to see how that works, although I think I'll lament the lack of MSI interrupts. But the cleaner solution seemed (at least when I started this discussion) to be to add the same kind of facility to the virtio-pci transport too. And I still don't think I've quite given up on it; especially as you point out the use case for staged P2P DMA too. I'll go take a closer look at how we can cleanly disambiguate between address spaces.
On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse wrote: > I do agree, this is fundamentally a system issue. In a CoCo model, it's > non-trivial for the system to allow *virtual* devices to do "DMA" > because that actually means allowing the VMM to access arbitrary guest > memory. > So "for the emulated devices, just use a device model that doesn't do > arbitrary DMA to system memory" is a nice simple answer, and keeps the > guest support restricted to its *own* standalone driver. It's also one that completely breaks the abstraction. I still don't understand what the problem is with having the paravirtualized devices on a different part of the virtual PCIe topology so that the stage2 IOMMU isn't used for them, but instead just the direct mapping or a stub viommu that blocks all access. > What's annoying is that this should work out of the box *already* with > virtio-mmio and a `restricted-dma-pool` — for systems which aren't > afflicted by UEFI/ACPI/PCI as their discovery mechanisms. Yes. And the fix is to get the equivalent to restricted-dma-pool into UEFI/ACPI. That gives you a portable and device-independent way to describe this limitation, which is much better than hacking around it using an odd device model.
On 4 April 2025 09:23:24 BST, Christoph Hellwig <hch@infradead.org> wrote: >On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse wrote: >> I do agree, this is fundamentally a system issue. In a CoCo model, it's >> non-trivial for the system to allow *virtual* devices to do "DMA" >> because that actually means allowing the VMM to access arbitrary guest >> memory. > > >> So "for the emulated devices, just use a device model that doesn't do >> arbitrary DMA to system memory" is a nice simple answer, and keeps the >> guest support restricted to its *own* standalone driver. > >It's also one that completely breaks the abstraction. Hm? Having a device that simply doesn't *do* any DMA surely doesn't break any system DMA / IOMMU abstractions because it's no longer even relevant to them, which is kind of the point. Which abstraction did you mean? > I still don't >understand what the problem is with having the paravirtualized devices >on a different part of the virtual PCIe topology so that the stage2 >IOMMU isn't used for them, but instead just the direct mapping or a >stub viommu that blocks all access. It can't have a direct mapping because the VMM can't access guest memory. It has to be blocked, which is fine. But that's only part of the picture — then how do you actually get data in/out of the device? By having on-device memory and not attempting DMA to system memory, perhaps...? :) >> What's annoying is that this should work out of the box *already* with >> virtio-mmio and a `restricted-dma-pool` — for systems which aren't >> afflicted by UEFI/ACPI/PCI as their discovery mechanisms. > >Yes. And the fix is to get the equivalent to restricted-dma-pool into >UEFI/ACPI. That gives you a portable and device-independent way to >describe this limitation, which is much better than hacking around it >using an odd device model. It still has the problem that existing drivers in all operating systems produced before 2030 will see the device and try to use it as-is, with no comprehension of this new thing.
On Fri, Apr 04, 2025 at 10:39:35AM +0100, David Woodhouse wrote: > >> So "for the emulated devices, just use a device model that doesn't do > >> arbitrary DMA to system memory" is a nice simple answer, and keeps the > >> guest support restricted to its *own* standalone driver. > > > >It's also one that completely breaks the abstraction. > > Hm? Having a device that simply doesn't *do* any DMA surely doesn't break > any system DMA / IOMMU abstractions because it's no longer even relevant > to them, which is kind of the point. > > Which abstraction did you mean? Implementing a DMA less device is fine. Implementing a DMA less virtual device to work around the lack of CoCo shared bounce buffers in the system is. > > > > I still don't > >understand what the problem is with having the paravirtualized devices > >on a different part of the virtual PCIe topology so that the stage2 > >IOMMU isn't used for them, but instead just the direct mapping or a > >stub viommu that blocks all access. > > It can't have a direct mapping because the VMM can't access guest > memory. It has to be blocked, which is fine. I only mentioned direct mapping in not having an IOMMU. I fully expect it not to work. > But that's only part of the picture — then how do you actually get data > in/out of the device? > > By having on-device memory and not attempting DMA to system memory, > perhaps...? :) By implementing the discovery of a shared pool in host memory as discussed in another branch of this thread? > >describe this limitation, which is much better than hacking around it > >using an odd device model. > > It still has the problem that existing drivers in all operating systems > produced before 2030 will see the device and try to use it as-is, with > no comprehension of this new thing. Given how much enablement the various CoCo schemes need it won't work anyway. Btw, you mail formatting in the last days went completely crazy.
On Mon, 2025-04-07 at 00:34 -0700, Christoph Hellwig wrote: > > > > describe this limitation, which is much better than hacking around it > > > using an odd device model. > > > > It still has the problem that existing drivers in all operating systems > > produced before 2030 will see the device and try to use it as-is, with > > no comprehension of this new thing. > > Given how much enablement the various CoCo schemes need it won't work > anyway. Not all CoCo-style schemes need any enablement on the guest side at all. It's perfectly feasible to have a pKVM-like model where the VMM (and host Linux) are absolutely prevented from accessing guest memory, without the guest knowing *anything* about it at all. It's not about attestation to the guest. Deprivileging the host kernel provides a huge amount of protection against both software and CPU speculation bugs across the whole of that massive code base. And you can do it with *full* compatibility. The guest just sees a standard SMMU for its *actual* network and block devices, which are PCI passthrough. (Yes, the guest does need to *use* the SMMU). The caveat is when we have *emulated* devices, like virtio-vsock. That brings us to LART the IOMMU designers for building a two-stage IOMMU without considering emulated PCI devices, and then to start this conversation we're having now about how to do it. > Btw, you mail formatting in the last days went completely crazy. Yeah, that's K-9 mail on Android not word wrapping; I'll see if I can fix it. At least it's a mobile client that doesn't send HTML :)
On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse wrote: > What's annoying is that this should work out of the box *already* with > virtio-mmio and a `restricted-dma-pool` — for systems which aren't > afflicted by UEFI/ACPI/PCI as their discovery mechanisms. That specifically would be just a driver bugfix then?
On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote: > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse wrote: > > What's annoying is that this should work out of the box *already* with > > virtio-mmio and a `restricted-dma-pool` — for systems which aren't > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms. > > > That specifically would be just a driver bugfix then? I actually think it works out of the box and there isn't even a bug to fix. Haven't tested yet. The sad part is that the system does it all automatically *if* it has CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even notices that the dma_ops it's using are the swiotlb ops using the provided buffer. Which is *kind* of nice... except that when on a guest OS which *isn't* Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore the `restricted-dma-pool` node and try DMA to system memory anyway, which will fail. That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB feature, so that the device side can refuse, if the guest *isn't* agreeing to use the bounce buffer in the situations where it must do so.
On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote: > The sad part is that the system does it all automatically *if* it has > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even > notices that the dma_ops it's using are the swiotlb ops using the > provided buffer. > > Which is *kind* of nice... except that when on a guest OS which *isn't* > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore the > `restricted-dma-pool` node and try DMA to system memory anyway, which > will fail. > > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB feature, so > that the device side can refuse, if the guest *isn't* agreeing to use > the bounce buffer in the situations where it must do so. That all assumes the indirect DMA is a device "feature". It's not. It is a limitation placed on the guest system.
On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote: > On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote: > > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse wrote: > > > What's annoying is that this should work out of the box *already* with > > > virtio-mmio and a `restricted-dma-pool` — for systems which aren't > > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms. > > > > > > That specifically would be just a driver bugfix then? > > I actually think it works out of the box and there isn't even a bug to > fix. Haven't tested yet. > > The sad part is that the system does it all automatically *if* it has > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even > notices that the dma_ops it's using are the swiotlb ops using the > provided buffer. > > Which is *kind* of nice... except that when on a guest OS which *isn't* > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore the > `restricted-dma-pool` node and try DMA to system memory anyway, which > will fail. I mean, it's easy to misconfigure Linux, this is why we love it ;) Why is this such a concern? > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB feature, so > that the device side can refuse, if the guest *isn't* agreeing to use > the bounce buffer in the situations where it must do so. OTOH then setting this feature and if you make the device force it, you are breaking guests restricted-dma-pool which worked previously, no? -- MST
On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@redhat.com> wrote: >On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote: >> On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote: >> > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse wrote: >> > > What's annoying is that this should work out of the box *already* with >> > > virtio-mmio and a `restricted-dma-pool` — for systems which aren't >> > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms. >> > >> > >> > That specifically would be just a driver bugfix then? >> >> I actually think it works out of the box and there isn't even a bug to >> fix. Haven't tested yet. >> >> The sad part is that the system does it all automatically *if* it has >> CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even >> notices that the dma_ops it's using are the swiotlb ops using the >> provided buffer. >> >> Which is *kind* of nice... except that when on a guest OS which *isn't* >> Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore the >> `restricted-dma-pool` node and try DMA to system memory anyway, which >> will fail. > >I mean, it's easy to misconfigure Linux, this is why we love it ;) Why >is this such a concern? Because it's incompatible. In the DT world, perhaps this new *non-optional* feature/restriction should have come with a new "compatible" string such as "virtio-mmio-restricted-dma". Adding it without backwards compatibility wasn't ideal. >> That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB feature, so >> that the device side can refuse, if the guest *isn't* agreeing to use >> the bounce buffer in the situations where it must do so. > > >OTOH then setting this feature and if you make the device force it, >you are breaking guests restricted-dma-pool which worked previously, no? Yes. So a platform offering virtio-mmio with restricted DMA, if the driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to accept that negotiation anyway, and *hope* that the driver/OS are going to use the buffer anyway. I just didn't want to make that same mistake again when formalising and documenting this, and especially when attempting to extend it to PCI.
On Fri, 2025-04-04 at 10:27 +0100, David Woodhouse wrote: > On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@redhat.com> > wrote: > > On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote: > > > On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote: > > > > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse > > > > wrote: > > > > > What's annoying is that this should work out of the box > > > > > *already* with > > > > > virtio-mmio and a `restricted-dma-pool` — for systems which > > > > > aren't > > > > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms. > > > > > > > > > > > > That specifically would be just a driver bugfix then? > > > > > > I actually think it works out of the box and there isn't even a > > > bug to > > > fix. Haven't tested yet. > > > > > > The sad part is that the system does it all automatically *if* it > > > has > > > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even > > > notices that the dma_ops it's using are the swiotlb ops using the > > > provided buffer. > > > > > > Which is *kind* of nice... except that when on a guest OS which > > > *isn't* > > > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore > > > the > > > `restricted-dma-pool` node and try DMA to system memory anyway, > > > which > > > will fail. > > > > I mean, it's easy to misconfigure Linux, this is why we love it ;) > > Why > > is this such a concern? > > Because it's incompatible. In the DT world, perhaps this new *non- > optional* feature/restriction should have come with a new > "compatible" string such as "virtio-mmio-restricted-dma". > > Adding it without backwards compatibility wasn't ideal. > > > > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB > > > feature, so > > > that the device side can refuse, if the guest *isn't* agreeing to > > > use > > > the bounce buffer in the situations where it must do so. > > > > > > OTOH then setting this feature and if you make the device force it, > > you are breaking guests restricted-dma-pool which worked > > previously, no? > > Yes. So a platform offering virtio-mmio with restricted DMA, if the > driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to > accept that negotiation anyway, and *hope* that the driver/OS are > going to use the buffer anyway. > > I just didn't want to make that same mistake again when formalising > and documenting this, and especially when attempting to extend it to > PCI. Of course, the beauty of the restricted-dma-pool as supported by DT is that it's a *system* memory buffer, which is actually OK as long as it's reserved address space and not just part of normal system memory that an unsuspecting guest might use for general purposes. So the trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access to that space. It doesn't *have* to be on-device. That just seemed like the more natural way to do it for PCI. I suppose we *could* allow for the virtio-pci transport to do it the same way as virtio-mmio though. The VIRTIO_PCI_CAP_SWIOTLB capability¹ could reference a range of system memory space, just like the `restricted-dma-pool` property does. It's a weird abstraction especially for a physical PCI device to do that because the system memory space is outside its ownership. But in a physical device it could be writable, and you could consider it the responsibility of the system firmware to configure it appropriately, in accordance with the IOMMU and other DMA restrictions of the platform. That does solve it for the CoCo case without addressing the P2P staging case that Christoph mentions, though. ¹ I will rename it, Christoph, if it survives at all. Probably VIRTIO_F_RESTRICTED_DMA and VIRTIO_PCI_CAP_RESTRICTED_DMA but of course it depends on the semantics we conclude it should have.
On Fri, Apr 04, 2025 at 11:15:33AM +0100, David Woodhouse wrote: > Of course, the beauty of the restricted-dma-pool as supported by DT is > that it's a *system* memory buffer, which is actually OK as long as > it's reserved address space and not just part of normal system memory > that an unsuspecting guest might use for general purposes. So the > trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access > to that space. > > It doesn't *have* to be on-device. That just seemed like the more > natural way to do it for PCI. It isn't very natural. For actual PCI devices it is a lot less efficient compared to bounce buffering in special host memory. For virtual device that's not the case, but at least for Linux guest it still makes the setup a lot more complicated as we can't just reuse the restricted-dma-pool but need a new version of it that uses MMIO helpers and device specific setup code.
On Fri, Apr 04, 2025 at 11:15:33AM +0100, David Woodhouse wrote: > On Fri, 2025-04-04 at 10:27 +0100, David Woodhouse wrote: > > On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@redhat.com> > > wrote: > > > On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote: > > > > On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote: > > > > > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse > > > > > wrote: > > > > > > What's annoying is that this should work out of the box > > > > > > *already* with > > > > > > virtio-mmio and a `restricted-dma-pool` — for systems which > > > > > > aren't > > > > > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms. > > > > > > > > > > > > > > > That specifically would be just a driver bugfix then? > > > > > > > > I actually think it works out of the box and there isn't even a > > > > bug to > > > > fix. Haven't tested yet. > > > > > > > > The sad part is that the system does it all automatically *if* it > > > > has > > > > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even > > > > notices that the dma_ops it's using are the swiotlb ops using the > > > > provided buffer. > > > > > > > > Which is *kind* of nice... except that when on a guest OS which > > > > *isn't* > > > > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore > > > > the > > > > `restricted-dma-pool` node and try DMA to system memory anyway, > > > > which > > > > will fail. > > > > > > I mean, it's easy to misconfigure Linux, this is why we love it ;) > > > Why > > > is this such a concern? > > > > Because it's incompatible. In the DT world, perhaps this new *non- > > optional* feature/restriction should have come with a new > > "compatible" string such as "virtio-mmio-restricted-dma". > > > > Adding it without backwards compatibility wasn't ideal. > > > > > > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB > > > > feature, so > > > > that the device side can refuse, if the guest *isn't* agreeing to > > > > use > > > > the bounce buffer in the situations where it must do so. > > > > > > > > > OTOH then setting this feature and if you make the device force it, > > > you are breaking guests restricted-dma-pool which worked > > > previously, no? > > > > Yes. So a platform offering virtio-mmio with restricted DMA, if the > > driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to > > accept that negotiation anyway, and *hope* that the driver/OS are > > going to use the buffer anyway. > > > > I just didn't want to make that same mistake again when formalising > > and documenting this, and especially when attempting to extend it to > > PCI. > > Of course, the beauty of the restricted-dma-pool as supported by DT is > that it's a *system* memory buffer, which is actually OK as long as > it's reserved address space and not just part of normal system memory > that an unsuspecting guest might use for general purposes. So the > trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access > to that space. > > It doesn't *have* to be on-device. That just seemed like the more > natural way to do it for PCI. > > I suppose we *could* allow for the virtio-pci transport to do it the > same way as virtio-mmio though. The VIRTIO_PCI_CAP_SWIOTLB capability¹ > could reference a range of system memory space, just like the > `restricted-dma-pool` property does. > > It's a weird abstraction especially for a physical PCI device to do > that because the system memory space is outside its ownership. But in a > physical device it could be writable, and you could consider it the > responsibility of the system firmware to configure it appropriately, in > accordance with the IOMMU and other DMA restrictions of the platform. > > That does solve it for the CoCo case without addressing the P2P staging > case that Christoph mentions, though. > > > ¹ I will rename it, Christoph, if it survives at all. Probably > VIRTIO_F_RESTRICTED_DMA and VIRTIO_PCI_CAP_RESTRICTED_DMA but of course > it depends on the semantics we conclude it should have. OK. So basically, all this does, is a promise by driver to only DMA into a range of memory? This part, I get. I wouldn't put it in a capability, just in config space then. What I don't get, is what does the *device* want, exactly? Here's a vague idea to explain the question: some embedded devices can have addressing restrictions, such as the number of bits of an address. In another example, the legacy balloon device only supports addresses up to 48 bit. These can still be useful if driver does not use the inaccessible addresses. So far so good? Does this look like a generalization of your idea? Now, a question: DMA API under linux at least, can actually work around device limitations using a bounce buffer. It does, however, need to know what to work around. So, maybe the device needs to expose a range? But what is the range in your case? Maybe the new registers list the range of addresses device can access, and driver promises to be within that range by negotiating the feature bit? -- MST
On Fri, Apr 04, 2025 at 06:37:16AM -0400, Michael S. Tsirkin wrote: > some embedded devices can have addressing restrictions, such > as the number of bits of an address. > In another example, the legacy balloon device only supports addresses > up to 48 bit. > These can still be useful if driver does not use the inaccessible > addresses. Let's not mix that up. The DMA API handles addressing limitations just fine, which is totally different of only allowing access to a specific region no matter where it sits. > Now, a question: DMA API under linux at least, can actually work > around device limitations using a bounce buffer. It does, however, > need to know what to work around. > So, maybe the device needs to expose a range? > But what is the range in your case? > > > Maybe the new registers list the range of addresses device can access, > and driver promises to be within that range by negotiating the > feature bit? We do not have an issue with a device here. We have a system design that does not allow devices or the hyper visor to access random guest memory, and because of that DMA or the virtual or physical kind needs to go through a designated area that is accessible outside the guest. Trying to frame this as a device limitation is already causing so much trouble in this thread and it will just get worse.
On Fri, 2025-04-04 at 06:37 -0400, Michael S. Tsirkin wrote: > On Fri, Apr 04, 2025 at 11:15:33AM +0100, David Woodhouse wrote: > > On Fri, 2025-04-04 at 10:27 +0100, David Woodhouse wrote: > > > On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@redhat.com> > > > wrote: > > > > On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote: > > > > > On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote: > > > > > > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse > > > > > > wrote: > > > > > > > What's annoying is that this should work out of the box > > > > > > > *already* with > > > > > > > virtio-mmio and a `restricted-dma-pool` — for systems which > > > > > > > aren't > > > > > > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms. > > > > > > > > > > > > > > > > > > That specifically would be just a driver bugfix then? > > > > > > > > > > I actually think it works out of the box and there isn't even a > > > > > bug to > > > > > fix. Haven't tested yet. > > > > > > > > > > The sad part is that the system does it all automatically *if* it > > > > > has > > > > > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even > > > > > notices that the dma_ops it's using are the swiotlb ops using the > > > > > provided buffer. > > > > > > > > > > Which is *kind* of nice... except that when on a guest OS which > > > > > *isn't* > > > > > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore > > > > > the > > > > > `restricted-dma-pool` node and try DMA to system memory anyway, > > > > > which > > > > > will fail. > > > > > > > > I mean, it's easy to misconfigure Linux, this is why we love it ;) > > > > Why > > > > is this such a concern? > > > > > > Because it's incompatible. In the DT world, perhaps this new *non- > > > optional* feature/restriction should have come with a new > > > "compatible" string such as "virtio-mmio-restricted-dma". > > > > > > Adding it without backwards compatibility wasn't ideal. > > > > > > > > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB > > > > > feature, so > > > > > that the device side can refuse, if the guest *isn't* agreeing to > > > > > use > > > > > the bounce buffer in the situations where it must do so. > > > > > > > > > > > > OTOH then setting this feature and if you make the device force it, > > > > you are breaking guests restricted-dma-pool which worked > > > > previously, no? > > > > > > Yes. So a platform offering virtio-mmio with restricted DMA, if the > > > driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to > > > accept that negotiation anyway, and *hope* that the driver/OS are > > > going to use the buffer anyway. > > > > > > I just didn't want to make that same mistake again when formalising > > > and documenting this, and especially when attempting to extend it to > > > PCI. > > > > Of course, the beauty of the restricted-dma-pool as supported by DT is > > that it's a *system* memory buffer, which is actually OK as long as > > it's reserved address space and not just part of normal system memory > > that an unsuspecting guest might use for general purposes. So the > > trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access > > to that space. > > > > It doesn't *have* to be on-device. That just seemed like the more > > natural way to do it for PCI. > > > > I suppose we *could* allow for the virtio-pci transport to do it the > > same way as virtio-mmio though. The VIRTIO_PCI_CAP_SWIOTLB capability¹ > > could reference a range of system memory space, just like the > > `restricted-dma-pool` property does. > > > > It's a weird abstraction especially for a physical PCI device to do > > that because the system memory space is outside its ownership. But in a > > physical device it could be writable, and you could consider it the > > responsibility of the system firmware to configure it appropriately, in > > accordance with the IOMMU and other DMA restrictions of the platform. > > > > That does solve it for the CoCo case without addressing the P2P staging > > case that Christoph mentions, though. > > > > > > ¹ I will rename it, Christoph, if it survives at all. Probably > > VIRTIO_F_RESTRICTED_DMA and VIRTIO_PCI_CAP_RESTRICTED_DMA but of course > > it depends on the semantics we conclude it should have. > > OK. So basically, all this does, is a promise by driver to only > DMA into a range of memory? Basically, yes. > This part, I get. I wouldn't put it in a capability, just in config > space then. Sure... but how? There are some things which are defined to be at fixed locations in config space, like the vendor/device IDs, COMMAND, STATUS, BARs, etc.. And for the rest of the optional things which might be in config space of a given device... isn't that exactly what capabilities are for? > What I don't get, is what does the *device* want, exactly? The device wants to know that a driver won't try to use it without understanding the restriction. Because otherwise the driver will just give it system addresses for DMA and be sad, without any coherent error/failure report about why. (You could ask the same question about what the *device* wants with VIRTIO_F_ACCESS_PLATFORM, and the answer is much the same). Or maybe not the *device* per se, but the *system integrator* wants to know that only operating systems which understand the restriction described above, will attempt to drive the device in question. We could achieve that by presenting the device with a completely new PCI device/vendor ID so that old drivers don't match, or in the DT model you could make a new "compatible" string for it. I chose to use a VIRTIO_F_ bit for it instead, which seemed natural and allows the device model (under the influence of the system integrator) to *choose* whether a failure to negotiate such bit is fatal or not. > Here's a vague idea to explain the question: > > some embedded devices can have addressing restrictions, such > as the number of bits of an address. > In another example, the legacy balloon device only supports addresses > up to 48 bit. > These can still be useful if driver does not use the inaccessible > addresses. > > > > So far so good? Does this look like a generalization of your idea? Makes sense. > Now, a question: DMA API under linux at least, can actually work > around device limitations using a bounce buffer. It does, however, > need to know what to work around. > So, maybe the device needs to expose a range? > But what is the range in your case? The main thing in the CoCo case is that the range in question must not contain any memory which an unenlightened guest might treat as normal system RAM (because it has to be accessible by the VMM, and that would make it insecure if it's being used a general-purpose RAM). So on x86 it might be an e820-reserved region for example. I don't think we want the guest OS just *assuming* that there's usable memory in that e820-reserved region, just because some device says that it's actually capable of DMA to those addresses. So it would probably want a separate object, like the separate `restricted-dma-pool` in DT, which explicitly identifies that range as a DMA bounce-buffer pool. We probably *can* do that even in ACPI with a PRP0001 device today, using a `restricted-dma-pool` compatible property. Then the OS would need to spot this range in the config space, and say "oh, I *do* have a swiotlb pool this device can reach", and use that. Which is slightly less pretty than the DT model where the device's node explicitly references the handle of the `restricted-dma-pool` node, but I think that's OK. And it does kind of keep "device" capability separate from "system" properties, in a comprehensible way. > Maybe the new registers list the range of addresses device can access, > and driver promises to be within that range by negotiating the > feature bit? That should work for the CoCo use case, yes.
On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote: > > What I don't get, is what does the *device* want, exactly? > > The device wants to know that a driver won't try to use it without > understanding the restriction. Because otherwise the driver will just > give it system addresses for DMA and be sad, without any coherent > error/failure report about why. > > (You could ask the same question about what the *device* wants with > VIRTIO_F_ACCESS_PLATFORM, and the answer is much the same). > > Or maybe not the *device* per se, but the *system integrator* wants to > know that only operating systems which understand the restriction > described above, will attempt to drive the device in question. > > We could achieve that by presenting the device with a completely new > PCI device/vendor ID so that old drivers don't match, or in the DT > model you could make a new "compatible" string for it. I chose to use a > VIRTIO_F_ bit for it instead, which seemed natural and allows the > device model (under the influence of the system integrator) to *choose* > whether a failure to negotiate such bit is fatal or not. Let's focus on the mmio part, for simplicity. So IIUC there's a devicetree attribute restricted dma, that guests currently simply ignore. You want to fix it in the guest, but you also want to find a clean way to detect that it's fixed. Right? And if so, my question is, why this specific bug especially? There likely are a ton of bugs, some more catastrophic than just crashing the guest, like data corruption. Is it because we were supposed to add it to the virtio spec but did not? -- MST
On Mon, 2025-04-07 at 08:14 -0400, Michael S. Tsirkin wrote: > On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote: > > > What I don't get, is what does the *device* want, exactly? > > > > The device wants to know that a driver won't try to use it without > > understanding the restriction. Because otherwise the driver will just > > give it system addresses for DMA and be sad, without any coherent > > error/failure report about why. > > > > (You could ask the same question about what the *device* wants with > > VIRTIO_F_ACCESS_PLATFORM, and the answer is much the same). > > > > Or maybe not the *device* per se, but the *system integrator* wants to > > know that only operating systems which understand the restriction > > described above, will attempt to drive the device in question. > > > > We could achieve that by presenting the device with a completely new > > PCI device/vendor ID so that old drivers don't match, or in the DT > > model you could make a new "compatible" string for it. I chose to use a > > VIRTIO_F_ bit for it instead, which seemed natural and allows the > > device model (under the influence of the system integrator) to *choose* > > whether a failure to negotiate such bit is fatal or not. > > Let's focus on the mmio part, for simplicity. > So IIUC there's a devicetree attribute restricted dma, that > guests currently simply ignore. No, I think Linux guests with CONFIG_DMA_RESTRICTED_POOL enabled will honour it just fine and do the right thing. For virtio-mmio. > You want to fix it in the guest, but you also want to find a clean way > to detect that it's fixed. Right? For virtio-mmio I believe there's no bug to fix. Sure, it would have been nice if a feature bit had been added when this attribute was defined, so that a device which *knows* it has this restriction could gracefully decline to operate with a driver which doesn't understand it. But as you pointed out, that ship has sailed. Enforcing a new feature bit now for virtio-mmio devices would break compatibility even with existing drivers in kernels which *do* have CONFIG_DMA_RESTRICTED_POOL enabled and which do work today. There's nothing to fix here for virtio-mmio. > And if so, my question is, why this specific bug especially? > There likely are a ton of bugs, some more catastrophic than just > crashing the guest, like data corruption. > Is it because we were supposed to add it to the virtio spec but > did not? If I could easily expose virtio-mmio on an ACPI-afflicted guest instead of using virtio-pci, maybe that might suffice¹. But while we might be able to hack that up for Linux guests, I don't think that's feasible for the general case. So what I'm trying to do here is just bring the same capability to virtio-pci, that already exists for virtio-mmio. And if we do so I *would* like to get it right this time and add that feature bit. ¹ It'd want MSI support, and the CONFIG_DMA_RESTRICTED_POOL support in the kernel is a bit OF-specific and probably needs to be made a bit more generic in some places. But I'm prepared to consider those as implementation details, and the latter probably needs to be done anyway for any of the approaches we've considered for supporting this in virtio-pci.
On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote: > We could achieve that by presenting the device with a completely new > PCI device/vendor ID so that old drivers don't match, or in the DT > model you could make a new "compatible" string for it. I chose to use a > VIRTIO_F_ bit for it instead, which seemed natural and allows the > device model (under the influence of the system integrator) to *choose* > whether a failure to negotiate such bit is fatal or not. Stop thinking about devices. Your CoCo VM will have that exact same limitation for all devices, because none of them can DMA into random memory. The solution to the compatibility problem is to just not boot an OS aware of the exact CoCo scheme into a CoCo VM. Bonus points for figuring out something clever at the system level that let's the boot fail gracefully early on for that case. > The main thing in the CoCo case is that the range in question must not > contain any memory which an unenlightened guest might treat as normal > system RAM (because it has to be accessible by the VMM, and that would > make it insecure if it's being used a general-purpose RAM). Yes. > So on x86 it might be an e820-reserved region for example. Hasn't e820 replaced with something more "elaborate" for UEFI systems anyway? > I don't think we want the guest OS just *assuming* that there's usable > memory in that e820-reserved region, just because some device says that > it's actually capable of DMA to those addresses. > > So it would probably want a separate object, like the separate > `restricted-dma-pool` in DT, which explicitly identifies that range as > a DMA bounce-buffer pool. We probably *can* do that even in ACPI with a > PRP0001 device today, using a `restricted-dma-pool` compatible > property. > > Then the OS would need to spot this range in the config space, and say > "oh, I *do* have a swiotlb pool this device can reach", and use that. Yes, that's largely how it should work.
On Mon, 2025-04-07 at 00:30 -0700, Christoph Hellwig wrote: > On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote: > > We could achieve that by presenting the device with a completely new > > PCI device/vendor ID so that old drivers don't match, or in the DT > > model you could make a new "compatible" string for it. I chose to use a > > VIRTIO_F_ bit for it instead, which seemed natural and allows the > > device model (under the influence of the system integrator) to *choose* > > whether a failure to negotiate such bit is fatal or not. > > Stop thinking about devices. Your CoCo VM will have that exact same > limitation for all devices, because none of them can DMA into random > memory. Nah, most of them are just fine because they're actual passthrough PCI devices behind a proper 2-stage IOMMU. > > So on x86 it might be an e820-reserved region for example. > > Hasn't e820 replaced with something more "elaborate" for UEFI systems > anyway? Sure, it's no longer literally a BIOS call with 0xe820 in any registers but Linux still calls it that. > > I don't think we want the guest OS just *assuming* that there's usable > > memory in that e820-reserved region, just because some device says that > > it's actually capable of DMA to those addresses. > > > > So it would probably want a separate object, like the separate > > `restricted-dma-pool` in DT, which explicitly identifies that range as > > a DMA bounce-buffer pool. We probably *can* do that even in ACPI with a > > PRP0001 device today, using a `restricted-dma-pool` compatible > > property. > > > > Then the OS would need to spot this range in the config space, and say > > "oh, I *do* have a swiotlb pool this device can reach", and use that. > > Yes, that's largely how it should work. The problem in ACPI is matching the device to that SWIOTLB pool. I think we can expose a `restricted-dma-pool` node via PRP0001 but then we need to associate a particular device (or set of devices) to that pool. In DT we do that by referencing it from a `memory-region` node of the device itself. The idea above was that the affected devices would state that they are only capable of DMA to that range, and that's how we'd match them up.
On Mon, Apr 07, 2025 at 08:54:46AM +0100, David Woodhouse wrote: > On Mon, 2025-04-07 at 00:30 -0700, Christoph Hellwig wrote: > > On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote: > > > We could achieve that by presenting the device with a completely new > > > PCI device/vendor ID so that old drivers don't match, or in the DT > > > model you could make a new "compatible" string for it. I chose to use a > > > VIRTIO_F_ bit for it instead, which seemed natural and allows the > > > device model (under the influence of the system integrator) to *choose* > > > whether a failure to negotiate such bit is fatal or not. > > > > Stop thinking about devices. Your CoCo VM will have that exact same > > limitation for all devices, because none of them can DMA into random > > memory. > > Nah, most of them are just fine because they're actual passthrough PCI > devices behind a proper 2-stage IOMMU. Except for all virtual devices. > > > Then the OS would need to spot this range in the config space, and say > > > "oh, I *do* have a swiotlb pool this device can reach", and use that. > > > > Yes, that's largely how it should work. > > The problem in ACPI is matching the device to that SWIOTLB pool. I > think we can expose a `restricted-dma-pool` node via PRP0001 but then > we need to associate a particular device (or set of devices) to that > pool. In DT we do that by referencing it from a `memory-region` node of > the device itself. I don't think you actually _need_ to have an explicity device vs pool match. All pools in host memory (assuming there is more than one) should be usable for all devices bar actual addressing limits that are handled in the dma layer already. The only things you need is: a) a way to declare one or more pools b) a way to destinguish between devices behind a two stage IOMMU vs not to figure out if they need to use a pool
On Mon, 2025-04-07 at 02:05 -0700, Christoph Hellwig wrote: > On Mon, Apr 07, 2025 at 08:54:46AM +0100, David Woodhouse wrote: > > On Mon, 2025-04-07 at 00:30 -0700, Christoph Hellwig wrote: > > > On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote: > > > > We could achieve that by presenting the device with a completely new > > > > PCI device/vendor ID so that old drivers don't match, or in the DT > > > > model you could make a new "compatible" string for it. I chose to use a > > > > VIRTIO_F_ bit for it instead, which seemed natural and allows the > > > > device model (under the influence of the system integrator) to *choose* > > > > whether a failure to negotiate such bit is fatal or not. > > > > > > Stop thinking about devices. Your CoCo VM will have that exact same > > > limitation for all devices, because none of them can DMA into random > > > memory. > > > > Nah, most of them are just fine because they're actual passthrough PCI > > devices behind a proper 2-stage IOMMU. > > Except for all virtual devices. Yes, that's what I'm saying. And that's also why it's reasonable to have a solution which handles this for virtio devices, without necessarily having to handle it for *arbitrary* emulated PCI devices across the whole system, and without having to change core concepts of DMA handling across all operating systems. This isn't just about Linux guests, and especially not just about Linux guests running running 6.16+ kernels. A solution which can live in a device driver is a *lot* easier to actually get into the hands of users. Not just Windows users, but even the slower-moving Linux distros. > > > > Then the OS would need to spot this range in the config space, and say > > > > "oh, I *do* have a swiotlb pool this device can reach", and use that. > > > > > > Yes, that's largely how it should work. > > > > The problem in ACPI is matching the device to that SWIOTLB pool. I > > think we can expose a `restricted-dma-pool` node via PRP0001 but then > > we need to associate a particular device (or set of devices) to that > > pool. In DT we do that by referencing it from a `memory-region` node of > > the device itself. > > I don't think you actually _need_ to have an explicity device vs pool > match. All pools in host memory (assuming there is more than one) > should be usable for all devices bar actual addressing limits that are > handled in the dma layer already. The only things you need is: > > a) a way to declare one or more pools > b) a way to destinguish between devices behind a two stage IOMMU vs not > to figure out if they need to use a pool I'm not averse to that, but it's different to the `restricted-dma-pool` model that's defined today which has explicit matching. So I'd like to reconcile them — and preferably literally use PRP0001 to expose `restricted-dma-pool` even under ACPI. Maybe it's as simple as a flag/property on the `restricted-dma-pool` node which declares that it's a 'catch-all', and that *all* devices which aren't explicitly bound to an IOMMU or other DMA operations (e.g. explicitly bound to a different restricted-dma-pool) should use it? That's actually what of_dma_configure_id() does today even with the existing `restricted-dma-pool` which *is* explicitly matched to a device; the pool still only gets used if of_configure_iommu() doesn't find a better option. But I would also be entirely OK with following the existing model and having the virtio device itself provide a reference to the restricted- dma-pool. Either by its address, or by some other handle/reference. Referencing it by address, which is what Michael and I were discussing, is simple enough. And in this model it *is* a device restriction — that virtio device knows full well that it can only perform DMA to that range of addresses. And it also allows for a standalone device driver to go check for the corresponding ACPI device, claim that memory and set up the bounce buffering for *itself*, in operating systems which don't support it. Although I suppose a standalone device driver can do the same even in the 'catch-all' model. Either way, we'd still ideally want a virtio feature bit to say "don't touch me if you don't understand my DMA restrictions", to prevent older drivers (on older operating systems) from failing. (I know you say that's a 'restriction' not a 'feature', but that's just fine. That's why it's a *negotiation* not simply the device advertising optional features which the driver may choose to use, or ignore, as it sees fit.)
On Mon, Apr 07, 2025 at 11:09:54AM +0100, David Woodhouse wrote: > > Except for all virtual devices. > > Yes, that's what I'm saying. > > And that's also why it's reasonable to have a solution which handles > this for virtio devices, without necessarily having to handle it for > *arbitrary* emulated PCI devices across the whole system, and without > having to change core concepts of DMA handling across all operating > systems. Except that you might not have a working two stage iommu everywhere anyway. So fix it for real to cover all these cases and skip your layering violations as a nice little benefit. > This isn't just about Linux guests, and especially not just about Linux > guests running running 6.16+ kernels. That shoudn't matter. No interface should grow stupid hacks ever just to support legacy guests. Get the interface right first and then implement it in all places you need. > A solution which can live in a device driver is a *lot* easier to > actually get into the hands of users. Not just Windows users, but even > the slower-moving Linux distros. Don't go there, that just gets us stupid crap like the Intel hiding NVMe behind AHCI hack or PCIe firmware first error handling. Get the interface right and then work on making it fit later. > > should be usable for all devices bar actual addressing limits that are > > handled in the dma layer already. The only things you need is: > > > > a) a way to declare one or more pools > > b) a way to destinguish between devices behind a two stage IOMMU vs not > > to figure out if they need to use a pool > > I'm not averse to that, but it's different to the `restricted-dma-pool` > model that's defined today which has explicit matching. So I'd like to > reconcile them — and preferably literally use PRP0001 to expose > `restricted-dma-pool` even under ACPI. A per-device flag is probably fine and easier for some things like pool sizing. It also would be worse for other things like eventually implementing percpu pools. > Maybe it's as simple as a flag/property on the `restricted-dma-pool` > node which declares that it's a 'catch-all', and that *all* devices > which aren't explicitly bound to an IOMMU or other DMA operations (e.g. > explicitly bound to a different restricted-dma-pool) should use it? Yeah. Similar what we do with the generic shared-dma-pool. > Either way, we'd still ideally want a virtio feature bit to say "don't > touch me if you don't understand my DMA restrictions", to prevent older > drivers (on older operating systems) from failing. As said before they really need system level awareness of a coco host. That's not something to be hacked into virtio drivers.
On Mon, 2025-04-07 at 07:06 -0700, Christoph Hellwig wrote: > On Mon, Apr 07, 2025 at 11:09:54AM +0100, David Woodhouse wrote: > > > should be usable for all devices bar actual addressing limits that are > > > handled in the dma layer already. The only things you need is: > > > > > > a) a way to declare one or more pools > > > b) a way to destinguish between devices behind a two stage IOMMU vs not > > > to figure out if they need to use a pool > > > > I'm not averse to that, but it's different to the `restricted-dma-pool` > > model that's defined today which has explicit matching. So I'd like to > > reconcile them — and preferably literally use PRP0001 to expose > > `restricted-dma-pool` even under ACPI. > > A per-device flag is probably fine and easier for some things like > pool sizing. It also would be worse for other things like eventually > implementing percpu pools. What exactly are you thinking of when you say a 'per-device' flag? > > Maybe it's as simple as a flag/property on the `restricted-dma-pool` > > node which declares that it's a 'catch-all', and that *all* devices > > which aren't explicitly bound to an IOMMU or other DMA operations (e.g. > > explicitly bound to a different restricted-dma-pool) should use it? > > Yeah. Similar what we do with the generic shared-dma-pool. Yeah, I was trying to work that one out, and it looks like I'm almost describing the 'linux,dma-default' property. Except that seems to be only for coherent allocations on the `shared-dma-pool`. Maybe it makes sense for 'linux,dma-default' on a restricted-dma-pool to mean that it should be used as a swiotlb for *all* DMA? I don't much like the "linux," prefix but can live with that if that's the precedent. > > Either way, we'd still ideally want a virtio feature bit to say "don't > > touch me if you don't understand my DMA restrictions", to prevent older > > drivers (on older operating systems) from failing. > > As said before they really need system level awareness of a coco host. > That's not something to be hacked into virtio drivers. Nah. As long as they use the IOMMU they're offered, guests don't need any awareness at all that they're running in an environment where the VMM can't access all their memory. There is precisely *zero* enablement required in the guest for that. The only problem is virtual devices which are provided by that VMM. If the guests *do* have system level awareness of the particular variant of CoCo host they're running on, then sure, they can explicitly manage sharing/encryption/whatever with a system-specific set of DMA operations and the problem goes away. But that's a long way off. How does this sound as a plan... 1. Define 'linux,dma-default' on a 'restricted-dma-pool' object to mean that it should be used as a fallback SWIOTLB for *all* devices (basically, as if each device explicitly referenced it as they do today). 2. Provide a way to represent `restricted-dma-pool` in ACPI (which ought to work naturally as PRP0001 and it's mostly a matter of untangling it from OF-specific code on the guest side). 3. Add a VIRTIO_F_RESTRICTED_DMA feature to virtio, so that devices which *know* they have such a limitation can gracefully decline to operate with a driver which doesn't support it. 4. Add a way for a virtio-pci device to explicitly reference a given `restricted-dma-pool` like virtio-mmio devices can today. And it probably makes sense for a virtio-pci device which requires the VIRTIO_F_RESTRICTED_DMA feature to explicitly reference such a pool even if that pool has the `linux,dma-default' property? With #1 and #2 we provide this facility even for passthrough PCI devices. And with #3 and #4 we make it work *better* for virtio, which is the more important use case (since those two-stage IOMMUs *do* exist in the cases of CoCo-supporting hardware; it's just that they don't enable *virtual* devices).
On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote: > On Fri, 2025-04-04 at 06:37 -0400, Michael S. Tsirkin wrote: > > On Fri, Apr 04, 2025 at 11:15:33AM +0100, David Woodhouse wrote: > > > On Fri, 2025-04-04 at 10:27 +0100, David Woodhouse wrote: > > > > On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@redhat.com> > > > > wrote: > > > > > On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote: > > > > > > On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote: > > > > > > > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse > > > > > > > wrote: > > > > > > > > What's annoying is that this should work out of the box > > > > > > > > *already* with > > > > > > > > virtio-mmio and a `restricted-dma-pool` — for systems which > > > > > > > > aren't > > > > > > > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms. > > > > > > > > > > > > > > > > > > > > > That specifically would be just a driver bugfix then? > > > > > > > > > > > > I actually think it works out of the box and there isn't even a > > > > > > bug to > > > > > > fix. Haven't tested yet. > > > > > > > > > > > > The sad part is that the system does it all automatically *if* it > > > > > > has > > > > > > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even > > > > > > notices that the dma_ops it's using are the swiotlb ops using the > > > > > > provided buffer. > > > > > > > > > > > > Which is *kind* of nice... except that when on a guest OS which > > > > > > *isn't* > > > > > > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore > > > > > > the > > > > > > `restricted-dma-pool` node and try DMA to system memory anyway, > > > > > > which > > > > > > will fail. > > > > > > > > > > I mean, it's easy to misconfigure Linux, this is why we love it ;) > > > > > Why > > > > > is this such a concern? > > > > > > > > Because it's incompatible. In the DT world, perhaps this new *non- > > > > optional* feature/restriction should have come with a new > > > > "compatible" string such as "virtio-mmio-restricted-dma". > > > > > > > > Adding it without backwards compatibility wasn't ideal. > > > > > > > > > > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB > > > > > > feature, so > > > > > > that the device side can refuse, if the guest *isn't* agreeing to > > > > > > use > > > > > > the bounce buffer in the situations where it must do so. > > > > > > > > > > > > > > > OTOH then setting this feature and if you make the device force it, > > > > > you are breaking guests restricted-dma-pool which worked > > > > > previously, no? > > > > > > > > Yes. So a platform offering virtio-mmio with restricted DMA, if the > > > > driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to > > > > accept that negotiation anyway, and *hope* that the driver/OS are > > > > going to use the buffer anyway. > > > > > > > > I just didn't want to make that same mistake again when formalising > > > > and documenting this, and especially when attempting to extend it to > > > > PCI. > > > > > > Of course, the beauty of the restricted-dma-pool as supported by DT is > > > that it's a *system* memory buffer, which is actually OK as long as > > > it's reserved address space and not just part of normal system memory > > > that an unsuspecting guest might use for general purposes. So the > > > trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access > > > to that space. > > > > > > It doesn't *have* to be on-device. That just seemed like the more > > > natural way to do it for PCI. > > > > > > I suppose we *could* allow for the virtio-pci transport to do it the > > > same way as virtio-mmio though. The VIRTIO_PCI_CAP_SWIOTLB capability¹ > > > could reference a range of system memory space, just like the > > > `restricted-dma-pool` property does. > > > > > > It's a weird abstraction especially for a physical PCI device to do > > > that because the system memory space is outside its ownership. But in a > > > physical device it could be writable, and you could consider it the > > > responsibility of the system firmware to configure it appropriately, in > > > accordance with the IOMMU and other DMA restrictions of the platform. > > > > > > That does solve it for the CoCo case without addressing the P2P staging > > > case that Christoph mentions, though. > > > > > > > > > ¹ I will rename it, Christoph, if it survives at all. Probably > > > VIRTIO_F_RESTRICTED_DMA and VIRTIO_PCI_CAP_RESTRICTED_DMA but of course > > > it depends on the semantics we conclude it should have. > > > > OK. So basically, all this does, is a promise by driver to only > > DMA into a range of memory? > > Basically, yes. > > > This part, I get. I wouldn't put it in a capability, just in config > > space then. > > Sure... but how? There are some things which are defined to be at fixed > locations in config space, like the vendor/device IDs, COMMAND, STATUS, > BARs, etc.. > > And for the rest of the optional things which might be in config space > of a given device... isn't that exactly what capabilities are for? Sorry I am unclear. Not the pci config space. The virtio config space. After admin_queue_num ?
On 6 April 2025 19:28:08 BST, "Michael S. Tsirkin" <mst@redhat.com> wrote: >On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote: >> On Fri, 2025-04-04 at 06:37 -0400, Michael S. Tsirkin wrote: >> > On Fri, Apr 04, 2025 at 11:15:33AM +0100, David Woodhouse wrote: >> > > On Fri, 2025-04-04 at 10:27 +0100, David Woodhouse wrote: >> > > > On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@redhat.com> >> > > > wrote: >> > > > > On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote: >> > > > > > On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote: >> > > > > > > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse >> > > > > > > wrote: >> > > > > > > > What's annoying is that this should work out of the box >> > > > > > > > *already* with >> > > > > > > > virtio-mmio and a `restricted-dma-pool` — for systems which >> > > > > > > > aren't >> > > > > > > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms. >> > > > > > > >> > > > > > > >> > > > > > > That specifically would be just a driver bugfix then? >> > > > > > >> > > > > > I actually think it works out of the box and there isn't even a >> > > > > > bug to >> > > > > > fix. Haven't tested yet. >> > > > > > >> > > > > > The sad part is that the system does it all automatically *if* it >> > > > > > has >> > > > > > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even >> > > > > > notices that the dma_ops it's using are the swiotlb ops using the >> > > > > > provided buffer. >> > > > > > >> > > > > > Which is *kind* of nice... except that when on a guest OS which >> > > > > > *isn't* >> > > > > > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore >> > > > > > the >> > > > > > `restricted-dma-pool` node and try DMA to system memory anyway, >> > > > > > which >> > > > > > will fail. >> > > > > >> > > > > I mean, it's easy to misconfigure Linux, this is why we love it ;) >> > > > > Why >> > > > > is this such a concern? >> > > > >> > > > Because it's incompatible. In the DT world, perhaps this new *non- >> > > > optional* feature/restriction should have come with a new >> > > > "compatible" string such as "virtio-mmio-restricted-dma". >> > > > >> > > > Adding it without backwards compatibility wasn't ideal. >> > > > >> > > > > > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB >> > > > > > feature, so >> > > > > > that the device side can refuse, if the guest *isn't* agreeing to >> > > > > > use >> > > > > > the bounce buffer in the situations where it must do so. >> > > > > >> > > > > >> > > > > OTOH then setting this feature and if you make the device force it, >> > > > > you are breaking guests restricted-dma-pool which worked >> > > > > previously, no? >> > > > >> > > > Yes. So a platform offering virtio-mmio with restricted DMA, if the >> > > > driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to >> > > > accept that negotiation anyway, and *hope* that the driver/OS are >> > > > going to use the buffer anyway. >> > > > >> > > > I just didn't want to make that same mistake again when formalising >> > > > and documenting this, and especially when attempting to extend it to >> > > > PCI. >> > > >> > > Of course, the beauty of the restricted-dma-pool as supported by DT is >> > > that it's a *system* memory buffer, which is actually OK as long as >> > > it's reserved address space and not just part of normal system memory >> > > that an unsuspecting guest might use for general purposes. So the >> > > trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access >> > > to that space. >> > > >> > > It doesn't *have* to be on-device. That just seemed like the more >> > > natural way to do it for PCI. >> > > >> > > I suppose we *could* allow for the virtio-pci transport to do it the >> > > same way as virtio-mmio though. The VIRTIO_PCI_CAP_SWIOTLB capability¹ >> > > could reference a range of system memory space, just like the >> > > `restricted-dma-pool` property does. >> > > >> > > It's a weird abstraction especially for a physical PCI device to do >> > > that because the system memory space is outside its ownership. But in a >> > > physical device it could be writable, and you could consider it the >> > > responsibility of the system firmware to configure it appropriately, in >> > > accordance with the IOMMU and other DMA restrictions of the platform. >> > > >> > > That does solve it for the CoCo case without addressing the P2P staging >> > > case that Christoph mentions, though. >> > > >> > > >> > > ¹ I will rename it, Christoph, if it survives at all. Probably >> > > VIRTIO_F_RESTRICTED_DMA and VIRTIO_PCI_CAP_RESTRICTED_DMA but of course >> > > it depends on the semantics we conclude it should have. >> > >> > OK. So basically, all this does, is a promise by driver to only >> > DMA into a range of memory? >> >> Basically, yes. >> >> > This part, I get. I wouldn't put it in a capability, just in config >> > space then. >> >> Sure... but how? There are some things which are defined to be at fixed >> locations in config space, like the vendor/device IDs, COMMAND, STATUS, >> BARs, etc.. >> >> And for the rest of the optional things which might be in config space >> of a given device... isn't that exactly what capabilities are for? > > >Sorry I am unclear. Not the pci config space. The virtio config space. >After admin_queue_num ? > > Makes sense. Let me knock up a second RFC based on that, and test using PRP0001 as a vehicle for `restricted-dma-pool` in an ACPI-afflicted guest.
On 4/2/2025 7:04 PM, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > Device-tree bindings for `restricted-dma-pool` were defined in 2021, which > allow devices to be restricted to a given SWIOTLB pool instead of allowing > DMA to arbitrary system memory: > https://lore.kernel.org/all/20210624155526.2775863-1-tientzu@chromium.org/ > > This facility was not specific to virtio-mmio, but does apply to it. No > attempt was made to ensure backward-compatibility for virtio-mmio devices. > > Define a VIRTIO_F_SWIOTLB feature which allows the device and driver to > agree on the use of the SWIOTLB, if present. This enables the device to > refuse to operate further if the driver does not support the SWIOTLB > requirement expressed in the device-tree. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > content.tex | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/content.tex b/content.tex > index c17ffa6..63d075f 100644 > --- a/content.tex > +++ b/content.tex > @@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} > Currently these device-independent feature bits are defined: > > \begin{description} > + \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device > + transport provides a memory region which is to be used for bounce > + buffering, rather than permitting direct memory access to system memory. Hello David IMHO, if we need a bounce buffer, why not place it on the host memory? Because if the bounce buffer resides in the device memory, it requires CPU traverse the pci bus, that can be 10 times slower than accessing host memory. This performance overhead can be worse when transmitting small packets, and stress the CPU cores, even lead to IO starvation when CPU over-commit. The device side memory is a host thing to the guest anyway. Thanks Zhu Lingshan > \item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates > that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT > flag set, as described in \ref{sec:Basic Facilities of a Virtio > @@ -807,6 +810,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} > the driver. When clear, this overrides any platform-specific description of > whether device access is limited or translated in any way, e.g. > whether an IOMMU may be present. > + If a the device transport provides a software IOTLB bounce buffer, > + addresses within its range are not subject to the requirements of > + VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''. > \item[VIRTIO_F_RING_PACKED(34)] This feature indicates > support for the packed virtqueue layout as described in > \ref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}~\nameref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}. > @@ -885,6 +891,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} > VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical > addresses to the device. > > +A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST > +then pass only addresses within the Software IOTLB bounce buffer to the > +device. > + > A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered. > > A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered. > @@ -921,6 +931,8 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} > A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not > accepted. > > +A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted. > + > If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use > buffers in the same order in which they have been available. >
On Thu, Apr 03, 2025 at 03:13:30PM +0800, Zhu Lingshan wrote: > On 4/2/2025 7:04 PM, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > Device-tree bindings for `restricted-dma-pool` were defined in 2021, which > > allow devices to be restricted to a given SWIOTLB pool instead of allowing > > DMA to arbitrary system memory: > > https://lore.kernel.org/all/20210624155526.2775863-1-tientzu@chromium.org/ > > > > This facility was not specific to virtio-mmio, but does apply to it. No > > attempt was made to ensure backward-compatibility for virtio-mmio devices. > > > > Define a VIRTIO_F_SWIOTLB feature which allows the device and driver to > > agree on the use of the SWIOTLB, if present. This enables the device to > > refuse to operate further if the driver does not support the SWIOTLB > > requirement expressed in the device-tree. > > > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > > --- > > content.tex | 12 ++++++++++++ > > 1 file changed, 12 insertions(+) > > > > diff --git a/content.tex b/content.tex > > index c17ffa6..63d075f 100644 > > --- a/content.tex > > +++ b/content.tex > > @@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} > > Currently these device-independent feature bits are defined: > > > > \begin{description} > > + \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device > > + transport provides a memory region which is to be used for bounce > > + buffering, rather than permitting direct memory access to system memory. > Hello David > > IMHO, if we need a bounce buffer, why not place it on the host memory? > Because if the bounce buffer resides in the device memory, it requires CPU traverse the pci bus, > that can be 10 times slower than accessing host memory. > This performance overhead can be worse when transmitting small packets, > and stress the CPU cores, even lead to IO starvation when CPU over-commit. > The device side memory is a host thing to the guest anyway. > > Thanks > Zhu Lingshan Indeed I personally do not exactly get why implement a virtual system without an IOMMU when virtio-iommu is available. I have a feeling it's about lack of windows drivers for virtio-iommu at this point. > > \item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates > > that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT > > flag set, as described in \ref{sec:Basic Facilities of a Virtio > > @@ -807,6 +810,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} > > the driver. When clear, this overrides any platform-specific description of > > whether device access is limited or translated in any way, e.g. > > whether an IOMMU may be present. > > + If a the device transport provides a software IOTLB bounce buffer, > > + addresses within its range are not subject to the requirements of > > + VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''. > > \item[VIRTIO_F_RING_PACKED(34)] This feature indicates > > support for the packed virtqueue layout as described in > > \ref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}~\nameref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}. > > @@ -885,6 +891,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} > > VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical > > addresses to the device. > > > > +A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST > > +then pass only addresses within the Software IOTLB bounce buffer to the > > +device. > > + > > A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered. > > > > A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered. > > @@ -921,6 +931,8 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} > > A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not > > accepted. > > > > +A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted. > > + > > If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use > > buffers in the same order in which they have been available. > >
On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote: > > Indeed I personally do not exactly get why implement a virtual system > without an IOMMU when virtio-iommu is available. > > I have a feeling it's about lack of windows drivers for virtio-iommu > at this point. And a pKVM (etc.) implementation of virtio-iommu which would allow the *trusted* part of the hypervisor to know which guest memory should be shared with the VMM implementing the virtio device models? You'd also end up in a situation where you have a virtio-iommu for some devices, and a real two-stage IOMMU (e.g. SMMU or AMD's vIOMMU) for other devices. Are guest operating systems going to cope well with that? Do the available discovery mechanisms for all the relevant IOMMUs even *allow* for that to be expressed?
On Thu, Apr 03, 2025 at 08:54:45AM +0100, David Woodhouse wrote: > On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote: > > > > Indeed I personally do not exactly get why implement a virtual system > > without an IOMMU when virtio-iommu is available. > > > > I have a feeling it's about lack of windows drivers for virtio-iommu > > at this point. > > And a pKVM (etc.) implementation of virtio-iommu which would allow the > *trusted* part of the hypervisor to know which guest memory should be > shared with the VMM implementing the virtio device models? Is there a blocker here? > You'd also end up in a situation where you have a virtio-iommu for some > devices, and a real two-stage IOMMU (e.g. SMMU or AMD's vIOMMU) for > other devices. Are guest operating systems going to cope well with > that? They should. In particular because systems with multiple IOMMUs already exist. > Do the available discovery mechanisms for all the relevant IOMMUs > even *allow* for that to be expressed? I think yes. But, it's been a while since I played with this, let me check what works, what does not, and get back to you on this. -- MST
On Thu, 2025-04-03 at 04:13 -0400, Michael S. Tsirkin wrote: > On Thu, Apr 03, 2025 at 08:54:45AM +0100, David Woodhouse wrote: > > On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote: > > > > > > Indeed I personally do not exactly get why implement a virtual system > > > without an IOMMU when virtio-iommu is available. > > > > > > I have a feeling it's about lack of windows drivers for virtio-iommu > > > at this point. > > > > And a pKVM (etc.) implementation of virtio-iommu which would allow the > > *trusted* part of the hypervisor to know which guest memory should be > > shared with the VMM implementing the virtio device models? > > Is there a blocker here? Only the amount of complexity in what should be a minimal Trusted Compute Base. (And ideally subject to formal methods of proving its correctness too.) And frankly, if we were going to accept a virtio-iommu in the TCB why not just implement enough virtqueue knowledge to build something where the trusted part just snoops on the *actual* e.g. virtio-net device to know which buffers the VMM was *invited* to access, and facilitate that? We looked at doing that. It's awful. > > You'd also end up in a situation where you have a virtio-iommu for some > > devices, and a real two-stage IOMMU (e.g. SMMU or AMD's vIOMMU) for > > other devices. Are guest operating systems going to cope well with > > that? > > They should. In particular because systems with multiple IOMMUs already > exist. > > > Do the available discovery mechanisms for all the relevant IOMMUs > > even *allow* for that to be expressed? > > I think yes. But, it's been a while since I played with this, let me > check what works, what does not, and get back to you on this. Even if it could work in theory, I'll be astonished if it actually works in practice across a wide set of operating systems, and if it *ever* works for Windows. Compared with the simple option of presenting a device which conceptually doesn't even *do* DMA, which is confined to its own modular device driver...
On Thu, Apr 03, 2025 at 09:22:57AM +0100, David Woodhouse wrote: > On Thu, 2025-04-03 at 04:13 -0400, Michael S. Tsirkin wrote: > > On Thu, Apr 03, 2025 at 08:54:45AM +0100, David Woodhouse wrote: > > > On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote: > > > > > > > > Indeed I personally do not exactly get why implement a virtual system > > > > without an IOMMU when virtio-iommu is available. > > > > > > > > I have a feeling it's about lack of windows drivers for virtio-iommu > > > > at this point. > > > > > > And a pKVM (etc.) implementation of virtio-iommu which would allow the > > > *trusted* part of the hypervisor to know which guest memory should be > > > shared with the VMM implementing the virtio device models? > > > > Is there a blocker here? > > Only the amount of complexity in what should be a minimal Trusted > Compute Base. (And ideally subject to formal methods of proving its > correctness too.) Shrug. Does not have to be complex. Could be a "simple mode" for virtio-iommu where it just accepts one buffer. No? > And frankly, if we were going to accept a virtio-iommu in the TCB why > not just implement enough virtqueue knowledge to build something where > the trusted part just snoops on the *actual* e.g. virtio-net device to > know which buffers the VMM was *invited* to access, and facilitate > that? Because it's awful? Buffers are a datapath thing. Stay away from there. > We looked at doing that. It's awful. Indeed. > > > You'd also end up in a situation where you have a virtio-iommu for some > > > devices, and a real two-stage IOMMU (e.g. SMMU or AMD's vIOMMU) for > > > other devices. Are guest operating systems going to cope well with > > > that? > > > > They should. In particular because systems with multiple IOMMUs already > > exist. > > > > > Do the available discovery mechanisms for all the relevant IOMMUs > > > even *allow* for that to be expressed? > > > > I think yes. But, it's been a while since I played with this, let me > > check what works, what does not, and get back to you on this. > > Even if it could work in theory, I'll be astonished if it actually > works in practice across a wide set of operating systems, and if it > *ever* works for Windows. Well it used to work. I won't have time to play with it until sometime next week, if it's relevant. If I poke at my windows system, I see > Compared with the simple option of presenting a device which > conceptually doesn't even *do* DMA, which is confined to its own > modular device driver... I'm not (yet) nacking this hack, though I already heartily dislike the fact that it is mostly a PV-only thing since it can not be offloaded to a real device efficiently *and* requires copies to move data between devices. But, let's see if more issues surface. -- MST
On 4/3/2025 4:22 PM, David Woodhouse wrote: > On Thu, 2025-04-03 at 04:13 -0400, Michael S. Tsirkin wrote: >> On Thu, Apr 03, 2025 at 08:54:45AM +0100, David Woodhouse wrote: >>> On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote: >>>> Indeed I personally do not exactly get why implement a virtual system >>>> without an IOMMU when virtio-iommu is available. >>>> >>>> I have a feeling it's about lack of windows drivers for virtio-iommu >>>> at this point. >>> And a pKVM (etc.) implementation of virtio-iommu which would allow the >>> *trusted* part of the hypervisor to know which guest memory should be >>> shared with the VMM implementing the virtio device models? >> Is there a blocker here? > Only the amount of complexity in what should be a minimal Trusted > Compute Base. (And ideally subject to formal methods of proving its > correctness too.) > > And frankly, if we were going to accept a virtio-iommu in the TCB why > not just implement enough virtqueue knowledge to build something where > the trusted part just snoops on the *actual* e.g. virtio-net device to > know which buffers the VMM was *invited* to access, and facilitate > that? you trust CPU and its IOMMU, and the virtio-iommu is provided by the hypervisor, emulated by the CPU. If you don't trust virtio-iommu, then you should not trust the bounce buffer, because it is unencrypted, more like a security leak. Actually everything is suspicious even the CPU, but you have to trust a TCB and try to minimize the TCB. I remember there is an attestation mechanism to help examine the infrastructure. We need a balance and a tradeoff. Thanks Zhu Lingshan > > We looked at doing that. It's awful. > >>> You'd also end up in a situation where you have a virtio-iommu for some >>> devices, and a real two-stage IOMMU (e.g. SMMU or AMD's vIOMMU) for >>> other devices. Are guest operating systems going to cope well with >>> that? >> They should. In particular because systems with multiple IOMMUs already >> exist. >> >>> Do the available discovery mechanisms for all the relevant IOMMUs >>> even *allow* for that to be expressed? >> I think yes. But, it's been a while since I played with this, let me >> check what works, what does not, and get back to you on this. > Even if it could work in theory, I'll be astonished if it actually > works in practice across a wide set of operating systems, and if it > *ever* works for Windows. > > Compared with the simple option of presenting a device which > conceptually doesn't even *do* DMA, which is confined to its own > modular device driver...
On Thu, 2025-04-03 at 16:34 +0800, Zhu Lingshan wrote: > On 4/3/2025 4:22 PM, David Woodhouse wrote: > > On Thu, 2025-04-03 at 04:13 -0400, Michael S. Tsirkin wrote: > > > On Thu, Apr 03, 2025 at 08:54:45AM +0100, David Woodhouse wrote: > > > > On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote: > > > > > Indeed I personally do not exactly get why implement a virtual system > > > > > without an IOMMU when virtio-iommu is available. > > > > > > > > > > I have a feeling it's about lack of windows drivers for virtio-iommu > > > > > at this point. > > > > And a pKVM (etc.) implementation of virtio-iommu which would allow the > > > > *trusted* part of the hypervisor to know which guest memory should be > > > > shared with the VMM implementing the virtio device models? > > > Is there a blocker here? > > Only the amount of complexity in what should be a minimal Trusted > > Compute Base. (And ideally subject to formal methods of proving its > > correctness too.) > > > > And frankly, if we were going to accept a virtio-iommu in the TCB why > > not just implement enough virtqueue knowledge to build something where > > the trusted part just snoops on the *actual* e.g. virtio-net device to > > know which buffers the VMM was *invited* to access, and facilitate > > that? > you trust CPU and its IOMMU, and the virtio-iommu is provided by the hypervisor, > emulated by the CPU. If you don't trust virtio-iommu, then you should not trust > the bounce buffer, because it is unencrypted, more like a security leak. > > Actually everything is suspicious even the CPU, but you have to trust a TCB and > try to minimize the TCB. I remember there is an attestation mechanism to help > examine the infrastructure. We need a balance and a tradeoff. In the pKVM model, we have a minimal trusted part of the hypervisor, which some are calling a "lowvisor", which enforces the property that even the rest of Linux/KVM and the VMM are not able to access arbitrary guest memory. For true PCI passthrough devices, hardware has a two-stage IOMMU which allows the guest to control which parts of its memory are accessible by the devices. The problem is those device models which are emulated in the VMM, because the VMM no longer has blanket access to the guest's memory. The simplest answer is just for the device models presented by the VMM to *not* do DMA access to guest system memory. Stick a bounce-buffer on the device itself, and do I/O through that. Yes, as you say, we have to trust the CPU and its IOMMU. And the microcode and low-level firmware, etc. But we do *not* trust most of Linux/KVM and the VMM. We only trust up to the pKVM "lowvisor" level. So if there is going to be a virtio- iommu, that's where it would have to be implemented. Otherwise, the VMM is just saying to the lowvisor, "please grant me access to <this> guest page because...erm... I said so". Which is not quite how the trust model is supposed to work. As noted in the original cover letter to this series, there are other options too. We could implement enlightenments in the guest to share/unshare pages, and corresponding dma_ops which simply invoke those hypercalls. That's a bit less complexity in the TCB, but does end up with a set of guest enlightenments which need to be supported in the core of every operating system. Compared with the option of a simple device driver for a device which conceptually doesn't do DMA at all.
On 4/3/2025 4:57 PM, David Woodhouse wrote: > On Thu, 2025-04-03 at 16:34 +0800, Zhu Lingshan wrote: >> On 4/3/2025 4:22 PM, David Woodhouse wrote: >>> On Thu, 2025-04-03 at 04:13 -0400, Michael S. Tsirkin wrote: >>>> On Thu, Apr 03, 2025 at 08:54:45AM +0100, David Woodhouse wrote: >>>>> On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote: >>>>>> Indeed I personally do not exactly get why implement a virtual system >>>>>> without an IOMMU when virtio-iommu is available. >>>>>> >>>>>> I have a feeling it's about lack of windows drivers for virtio-iommu >>>>>> at this point. >>>>> And a pKVM (etc.) implementation of virtio-iommu which would allow the >>>>> *trusted* part of the hypervisor to know which guest memory should be >>>>> shared with the VMM implementing the virtio device models? >>>> Is there a blocker here? >>> Only the amount of complexity in what should be a minimal Trusted >>> Compute Base. (And ideally subject to formal methods of proving its >>> correctness too.) >>> >>> And frankly, if we were going to accept a virtio-iommu in the TCB why >>> not just implement enough virtqueue knowledge to build something where >>> the trusted part just snoops on the *actual* e.g. virtio-net device to >>> know which buffers the VMM was *invited* to access, and facilitate >>> that? >> you trust CPU and its IOMMU, and the virtio-iommu is provided by the hypervisor, >> emulated by the CPU. If you don't trust virtio-iommu, then you should not trust >> the bounce buffer, because it is unencrypted, more like a security leak. >> >> Actually everything is suspicious even the CPU, but you have to trust a TCB and >> try to minimize the TCB. I remember there is an attestation mechanism to help >> examine the infrastructure. We need a balance and a tradeoff. > In the pKVM model, we have a minimal trusted part of the hypervisor, > which some are calling a "lowvisor", which enforces the property that > even the rest of Linux/KVM and the VMM are not able to access arbitrary > guest memory. > > For true PCI passthrough devices, hardware has a two-stage IOMMU which > allows the guest to control which parts of its memory are accessible by > the devices. > > The problem is those device models which are emulated in the VMM, > because the VMM no longer has blanket access to the guest's memory. > > The simplest answer is just for the device models presented by the VMM > to *not* do DMA access to guest system memory. Stick a bounce-buffer on > the device itself, and do I/O through that. > > Yes, as you say, we have to trust the CPU and its IOMMU. And the > microcode and low-level firmware, etc. > > But we do *not* trust most of Linux/KVM and the VMM. We only trust up > to the pKVM "lowvisor" level. So if there is going to be a virtio- > iommu, that's where it would have to be implemented. Otherwise, the VMM > is just saying to the lowvisor, "please grant me access to <this> guest > page because...erm... I said so". Which is not quite how the trust > model is supposed to work. > > As noted in the original cover letter to this series, there are other > options too. We could implement enlightenments in the guest to > share/unshare pages, and corresponding dma_ops which simply invoke > those hypercalls. That's a bit less complexity in the TCB, but does end > up with a set of guest enlightenments which need to be supported in the > core of every operating system. > > Compared with the option of a simple device driver for a device which > conceptually doesn't do DMA at all. Hello David, I see you want to limit the device DMA range for security reasons, a possible solution is not to do DMA at all, and provide a bounce buffer from the device side. However, such a bounce buffer resides on the device is unencrypted, means itself is insecure. I have a rough proposal, first here are some assumptions: 1) we trust the guest because it belongs to the tenant, and we can not change any guest drivers, only the tenants upgrade guest kernel. 2) we only trust a minimal TCB, even don't trust most of the hypervisor code. 3) an emulated device is a part of the hypervisor. 4) even when the host / hypervisor compromised, we still want to secure the guest data. 5) virtio devices and virtio driver exchange data through the virtqueues, means virtio devices initiate DMA against queue buffers. So here is a draft design: 1) queue buffer encryption, per queue or per device. 2) this feature is controlled by a feature bit. 3) driver initializes(writing to a device side specific registers / fields ) an encryption algorithm and keys before DRIVER_OK, can be null, means unencrypted. The device side registers are WRITE ONLY, means can be overwritten in the worst case, but can not be hacked. 4) Write side(the producer, an example is the driver fills data in available queue buffers) encrypts, and the read side(the consumer) deciphers. 5) provide multiple encryption algorithm for different type of workloads and resource expenditure. The pros: 1) a natural design, proven to work for many years. 2) even when host / hypervisor compromised, the DMA buffers are still encrypted and secure. 3) don't need to change any virtio-iommu implementation 4) compatible and actually orthogonal to CoCo, no conflictions. 5) this solution works for both emulated and real pass-through devices. 6)unlike TDX-IO, this is an implementation in virtio devices, so this design does not conflict with any existing use cases, supports live migration. The cons: 1) need to change driver, but new features always need new implementations in the driver code. 2) consume more CPU cycles, but bounce buffers also need to cost more CPU resource, and CoCo attestation costs more CPU cycles. So, does this low hanging fruit sounds good to you? This is just my brainstorming and a rough proposal. Please let me know your thoughts. Thanks Zhu Lingshan > >
On Thu, 2025-04-03 at 15:13 +0800, Zhu Lingshan wrote: > > Hello David > > IMHO, if we need a bounce buffer, why not place it on the host memory? As explained in the cover letter, this is designed for the case where the device *cannot* easily access host memory.
On 4/3/2025 3:24 PM, David Woodhouse wrote: > On Thu, 2025-04-03 at 15:13 +0800, Zhu Lingshan wrote: >> Hello David >> >> IMHO, if we need a bounce buffer, why not place it on the host memory? > As explained in the cover letter, this is designed for the case where > the device *cannot* easily access host memory. Do you mean CoCo encrypted memory? If so, what is the difference between the host memory side bounce buffer and a device memory bounce buffer to a guest except the transport(DDR or PCI)? The device should be able to access host memory side bounce buffer. Thanks Zhu Lingshan >
On Thu, 2025-04-03 at 15:31 +0800, Zhu Lingshan wrote: > On 4/3/2025 3:24 PM, David Woodhouse wrote: > > On Thu, 2025-04-03 at 15:13 +0800, Zhu Lingshan wrote: > > > Hello David > > > > > > IMHO, if we need a bounce buffer, why not place it on the host memory? > > As explained in the cover letter, this is designed for the case where > > the device *cannot* easily access host memory. > > Do you mean CoCo encrypted memory? If so, what is the difference between > the host memory side bounce buffer and a device memory bounce buffer > to a guest except the transport(DDR or PCI)? The device should be > able to access host memory side bounce buffer. In the CoCo case, yes. As long as it's *known* to be memory which is shared with the hypervisor, and unsuspecting guests won't accidentally use it as general purpose RAM, then pKVM (or whatever) can happily allow Linux and the VMM to access it. That's basically how it works for virtio-mmio, as I just mentioned in a different subthread. The region referenced by `restricted-dma-pool` is in the guest memory address space, and the "platform" is configured to allow DMA *only* there. Doing the same on the PCI transport (referencing an address range in system address space rather than on the device itself) is slightly unnatural, but as I said if we consider it the responsibility of the firmware to *program* that then we could maybe squint and accept it even in the case of physical devices. I'm not sure I like it much though.
On Wed, Apr 02, 2025 at 12:04:45PM +0100, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > Device-tree bindings for `restricted-dma-pool` were defined in 2021, which > allow devices to be restricted to a given SWIOTLB pool instead of allowing > DMA to arbitrary system memory: > https://lore.kernel.org/all/20210624155526.2775863-1-tientzu@chromium.org/ > > This facility was not specific to virtio-mmio, but does apply to it. No > attempt was made to ensure backward-compatibility for virtio-mmio devices. > > Define a VIRTIO_F_SWIOTLB feature which allows the device and driver to > agree on the use of the SWIOTLB, if present. This enables the device to > refuse to operate further if the driver does not support the SWIOTLB > requirement expressed in the device-tree. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > content.tex | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/content.tex b/content.tex > index c17ffa6..63d075f 100644 > --- a/content.tex > +++ b/content.tex > @@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} > Currently these device-independent feature bits are defined: > > \begin{description} > + \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device > + transport provides a memory region which is to be used for bounce > + buffering, rather than permitting direct memory access to system memory. > \item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates > that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT > flag set, as described in \ref{sec:Basic Facilities of a Virtio > @@ -807,6 +810,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} > the driver. When clear, this overrides any platform-specific description of > whether device access is limited or translated in any way, e.g. > whether an IOMMU may be present. > + If a the device transport provides a software IOTLB bounce buffer, > + addresses within its range are not subject to the requirements of > + VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''. I don't get this part. the system designers currently have a choice whether to have these controlled by VIRTIO_F_ACCESS_PLATFORM or not. with PCI, for example, BAR on the same device is naturally not behind an iommu. > \item[VIRTIO_F_RING_PACKED(34)] This feature indicates > support for the packed virtqueue layout as described in > \ref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}~\nameref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}. > @@ -885,6 +891,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} > VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical > addresses to the device. > > +A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST > +then pass only addresses within the Software IOTLB bounce buffer to the > +device. > + > A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered. > > A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered. > @@ -921,6 +931,8 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} > A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not > accepted. > > +A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted. > + > If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use > buffers in the same order in which they have been available. > > -- > 2.49.0 >
On Wed, 2025-04-02 at 10:54 -0400, Michael S. Tsirkin wrote: > > + If a the device transport provides a software IOTLB bounce buffer, > > + addresses within its range are not subject to the requirements of > > + VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''. > > I don't get this part. the system designers currently have a choice > whether to have these controlled by VIRTIO_F_ACCESS_PLATFORM or not. > with PCI, for example, BAR on the same device is naturally not > behind an iommu. In the PCI case this *is* a BAR on the same device, and is naturally not behind an IOMMU as you say. This is just stating the obvious, for clarity. For virtio-mmio it also isn't translated by an IOMMU; that was the *point* of the `restricted-dma-pool` support.
On Wed, Apr 02, 2025 at 04:12:39PM +0100, David Woodhouse wrote: > On Wed, 2025-04-02 at 10:54 -0400, Michael S. Tsirkin wrote: > > > + If a the device transport provides a software IOTLB bounce buffer, > > > + addresses within its range are not subject to the requirements of > > > + VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''. > > > > I don't get this part. the system designers currently have a choice > > whether to have these controlled by VIRTIO_F_ACCESS_PLATFORM or not. > > with PCI, for example, BAR on the same device is naturally not > > behind an iommu. > > In the PCI case this *is* a BAR on the same device, and is naturally > not behind an IOMMU as you say. This is just stating the obvious, for > clarity. Then the platform already does this right, and it's better not to try and override it in the spec. > For virtio-mmio it also isn't translated by an IOMMU; that was the > *point* of the `restricted-dma-pool` support. > Clear VIRTIO_F_ACCESS_PLATFORM then? Generally, it is preferable to keep all features orthogonal if at all possible.
On Wed, 2025-04-02 at 11:20 -0400, Michael S. Tsirkin wrote: > On Wed, Apr 02, 2025 at 04:12:39PM +0100, David Woodhouse wrote: > > On Wed, 2025-04-02 at 10:54 -0400, Michael S. Tsirkin wrote: > > > > + If a the device transport provides a software IOTLB bounce buffer, > > > > + addresses within its range are not subject to the requirements of > > > > + VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''. > > > > > > I don't get this part. the system designers currently have a choice > > > whether to have these controlled by VIRTIO_F_ACCESS_PLATFORM or not. > > > with PCI, for example, BAR on the same device is naturally not > > > behind an iommu. > > > > In the PCI case this *is* a BAR on the same device, and is naturally > > not behind an IOMMU as you say. This is just stating the obvious, for > > clarity. > > Then the platform already does this right, and it's better not to > try and override it in the spec. It isn't intended as an "override". This *is* what will happen if the platform does it right. By mandating it in the spec, the intent is to reduce the chances of platforms doing it *wrong*? (Or of drivers making wrong assumptions). > > For virtio-mmio it also isn't translated by an IOMMU; that was the > > *point* of the `restricted-dma-pool` support. > > > > Clear VIRTIO_F_ACCESS_PLATFORM then? > I don't want to say that VIRTIO_F_ACCESS_PLATFORM and VIRTIO_F_SWIOTLB > are mutually exclusive... > Generally, it is preferable to keep all features orthogonal if > at all possible. ...precisely because they *should* be orthogonal. VIRTIO_F_ACCESS_PLATFORM defines how system memory is accessed; basically whether DMA goes through an IOMMU or not. And as you point out, the "on-device" buffer used with VIRTIO_F_SWIOTLB should never pass through the IOMMU anyway, so it *is* naturally orthogonal. And I think it does make sense for both to be set in some cases, for both physical and virtual devices. For physical devices that would mean "Got an IOMMU? Sure, go ahead and use it. If not, if you don't trust me, you can just disable my bus mastering and just use the SWIOTLB". It's basically the same for a virtual device. In a confidential compute model, the device model (in the VMM) might not be *able* to access the guest memory unless the core guest OS explicitly permits that, through some kind of pKVM enlightenment to allow pages to be shared, or a vIOMMU, or marking hardware pages unencrypted. So having both bits set would mean "Know how to drive that enlightenment? Sure, go ahead and use it. Otherwise, use the SWIOTLB". In fact that's exactly what that Linux code for `restricted-dma-pool` already does — when setting up the dma_ops for the device, if it finds an actual set of IOMMU operations, it'll use those. And if not, that's when it falls back to using the provided SWIOTLB.
On Wed, Apr 02, 2025 at 04:47:18PM +0100, David Woodhouse wrote: > On Wed, 2025-04-02 at 11:20 -0400, Michael S. Tsirkin wrote: > > On Wed, Apr 02, 2025 at 04:12:39PM +0100, David Woodhouse wrote: > > > On Wed, 2025-04-02 at 10:54 -0400, Michael S. Tsirkin wrote: > > > > > + If a the device transport provides a software IOTLB bounce buffer, > > > > > + addresses within its range are not subject to the requirements of > > > > > + VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''. > > > > > > > > I don't get this part. the system designers currently have a choice > > > > whether to have these controlled by VIRTIO_F_ACCESS_PLATFORM or not. > > > > with PCI, for example, BAR on the same device is naturally not > > > > behind an iommu. > > > > > > In the PCI case this *is* a BAR on the same device, and is naturally > > > not behind an IOMMU as you say. This is just stating the obvious, for > > > clarity. > > > > Then the platform already does this right, and it's better not to > > try and override it in the spec. > > It isn't intended as an "override". This *is* what will happen if the > platform does it right. By mandating it in the spec, the intent is to > reduce the chances of platforms doing it *wrong*? (Or of drivers making > wrong assumptions). The text you wrote makes it seem that even if the platform says use an IOMMU, it should be bypassed. > > > For virtio-mmio it also isn't translated by an IOMMU; that was the > > > *point* of the `restricted-dma-pool` support. > > > > > > > Clear VIRTIO_F_ACCESS_PLATFORM then? > > > I don't want to say that VIRTIO_F_ACCESS_PLATFORM and VIRTIO_F_SWIOTLB > > > are mutually exclusive... > > > > Generally, it is preferable to keep all features orthogonal if > > at all possible. > > ...precisely because they *should* be orthogonal. > > VIRTIO_F_ACCESS_PLATFORM defines how system memory is accessed; > basically whether DMA goes through an IOMMU or not. And as you point > out, the "on-device" buffer used with VIRTIO_F_SWIOTLB should never > pass through the IOMMU anyway, so it *is* naturally orthogonal. > > > And I think it does make sense for both to be set in some cases, for > both physical and virtual devices. > > > For physical devices that would mean "Got an IOMMU? Sure, go ahead and > use it. If not, if you don't trust me, you can just disable my bus > mastering and just use the SWIOTLB". > > It's basically the same for a virtual device. In a confidential compute > model, the device model (in the VMM) might not be *able* to access the > guest memory unless the core guest OS explicitly permits that, through > some kind of pKVM enlightenment to allow pages to be shared, or a > vIOMMU, or marking hardware pages unencrypted. So having both bits set > would mean "Know how to drive that enlightenment? Sure, go ahead and > use it. Otherwise, use the SWIOTLB". > > In fact that's exactly what that Linux code for `restricted-dma-pool` > already does — when setting up the dma_ops for the device, if it finds > an actual set of IOMMU operations, it'll use those. And if not, that's > when it falls back to using the provided SWIOTLB. > I just feel your text for VIRTIO_F_ACCESS_PLATFORM make it seem like that is not the case. VIRTIO_F_ACCESS_PLATFORM says a lot of things, e.g. it has implications for encrypted VMs, and so on. I would drop this text, and maybe add some clarification in the mmio transport, as needed. -- MST
On Wed, 2025-04-02 at 11:51 -0400, Michael S. Tsirkin wrote: > On Wed, Apr 02, 2025 at 04:47:18PM +0100, David Woodhouse wrote: > > On Wed, 2025-04-02 at 11:20 -0400, Michael S. Tsirkin wrote: > > > On Wed, Apr 02, 2025 at 04:12:39PM +0100, David Woodhouse wrote: > > > > On Wed, 2025-04-02 at 10:54 -0400, Michael S. Tsirkin wrote: > > > > > > + If a the device transport provides a software IOTLB bounce buffer, > > > > > > + addresses within its range are not subject to the requirements of > > > > > > + VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device'' ... > The text you wrote makes it seem that even if the platform says use > an IOMMU, it should be bypassed. It was trying just to state the obvious, that addresses within the range of the *on-device* memory buffer are not handled by the IOMMU. > I would drop this text, and maybe add some clarification in the mmio transport, > as needed. It would be PCI too. I guess we could move the "obvious" comment that 'addresses within the range of the SWIOTLB bounce buffer region are considered to be "on-device" and are thus not affected by the requirements of VIRTIO_F_ACCESS_PLATFORM' into *both* the MMIO and PCI transport docs? But then it's basically just saying the same thing in two different locations? I don't think we're debating what the actual implementations should *do* ... are we? To me it's obvious that what I'm trying to say here *should* always be true. We're just debating the wording and where to put it, yes?
On Wed, Apr 02, 2025 at 05:16:28PM +0100, David Woodhouse wrote: > On Wed, 2025-04-02 at 11:51 -0400, Michael S. Tsirkin wrote: > > On Wed, Apr 02, 2025 at 04:47:18PM +0100, David Woodhouse wrote: > > > On Wed, 2025-04-02 at 11:20 -0400, Michael S. Tsirkin wrote: > > > > On Wed, Apr 02, 2025 at 04:12:39PM +0100, David Woodhouse wrote: > > > > > On Wed, 2025-04-02 at 10:54 -0400, Michael S. Tsirkin wrote: > > > > > > > + If a the device transport provides a software IOTLB bounce buffer, > > > > > > > + addresses within its range are not subject to the requirements of > > > > > > > + VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device'' > > ... > > > The text you wrote makes it seem that even if the platform says use > > an IOMMU, it should be bypassed. > > It was trying just to state the obvious, that addresses within the > range of the *on-device* memory buffer are not handled by the IOMMU. > > > I would drop this text, and maybe add some clarification in the mmio transport, > > as needed. > > It would be PCI too. I guess we could move the "obvious" comment that > 'addresses within the range of the SWIOTLB bounce buffer region are > considered to be "on-device" and are thus not affected by the > requirements of VIRTIO_F_ACCESS_PLATFORM' into *both* the MMIO and PCI > transport docs? But then it's basically just saying the same thing in > two different locations? > > I don't think we're debating what the actual implementations should > *do* ... are we? To me it's obvious that what I'm trying to say here > *should* always be true. > > We're just debating the wording and where to put it, yes? yes. I know a bit more about PCI, and for PCI I prefer just not saying anything. The platform already defines whether it is behind an iommu or not, and duplication is not good. For mmio it is my understanding that the "restricted" does the same already? or is it required in the spec for some reason?
On Wed, 2025-04-02 at 12:43 -0400, Michael S. Tsirkin wrote: > > yes. > > I know a bit more about PCI, and for PCI I prefer just not saying > anything. The platform already defines whether it is behind an iommu > or not, and duplication is not good. Not a hill for me to die on I suppose, but I would personally prefer to spell it out in words of one syllable or fewer, to make *sure* that device and driver authors get it right even though it's "obvious". After all, if we could trust them to do their thinking, we would never have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM existing in the first place; the legacy behaviour we get when that bit *isn't* set would never have happened. > For mmio it is my understanding that the "restricted" does the same > already? or is it required in the spec for some reason? No, it's exactly the same. But I still don't trust driver authors to realise the obvious, or VMM implementations either for that matter. I'm not sure I see the *harm* in spelling out explicitly for the hard- of-thinking.
On Wed, Apr 02, 2025 at 06:10:53PM +0100, David Woodhouse wrote: > On Wed, 2025-04-02 at 12:43 -0400, Michael S. Tsirkin wrote: > > > > yes. > > > > I know a bit more about PCI, and for PCI I prefer just not saying > > anything. The platform already defines whether it is behind an iommu > > or not, and duplication is not good. > > Not a hill for me to die on I suppose, but I would personally prefer to > spell it out in words of one syllable or fewer, to make *sure* that > device and driver authors get it right even though it's "obvious". > > After all, if we could trust them to do their thinking, we would never > have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM > existing in the first place; the legacy behaviour we get when that bit > *isn't* set would never have happened. Oh, you are wrong here. It's not implementer's fault. virtio just was not designed with real DMA in mind, and micro-optimizing by bypassing the DMA API was very much intentional. > > For mmio it is my understanding that the "restricted" does the same > > already? or is it required in the spec for some reason? > > No, it's exactly the same. But I still don't trust driver authors to > realise the obvious, or VMM implementations either for that matter. > > I'm not sure I see the *harm* in spelling out explicitly for the hard- > of-thinking. I don't want people to make assumptions, like crashing if device is behind an iommu or whatnot. We can go with something informative. "It is expected that for most implementations, when using this feature, the behaviour with and without ACCESS_PLATFORM is the same" -- MST
On Thu, 2025-04-03 at 03:31 -0400, Michael S. Tsirkin wrote: > On Wed, Apr 02, 2025 at 06:10:53PM +0100, David Woodhouse wrote: > > On Wed, 2025-04-02 at 12:43 -0400, Michael S. Tsirkin wrote: > > > > > > yes. > > > > > > I know a bit more about PCI, and for PCI I prefer just not saying > > > anything. The platform already defines whether it is behind an iommu > > > or not, and duplication is not good. > > > > Not a hill for me to die on I suppose, but I would personally prefer to > > spell it out in words of one syllable or fewer, to make *sure* that > > device and driver authors get it right even though it's "obvious". > > > > After all, if we could trust them to do their thinking, we would never > > have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM > > existing in the first place; the legacy behaviour we get when that bit > > *isn't* set would never have happened. > > Oh, you are wrong here. It's not implementer's fault. > virtio just was not designed with real DMA > in mind, and micro-optimizing by bypassing the DMA API > was very much intentional. That's one point of view, I suppose. In the early days of IOMMUs, and DMA ops coming to mainstream platforms, we found a *lot* of device drivers that had the same "micro-optimisation" of just handing physical addresses directly to devices. We called them 'bugs' though. And the thing that was different for virtio-pci was that the host side had the *same* bug, as I recall, so we had to introduce a feature bit to declare/negotiate the "natural" behaviour. But we're a long way from the original topic here. > > > For mmio it is my understanding that the "restricted" does the same > > > already? or is it required in the spec for some reason? > > > > No, it's exactly the same. But I still don't trust driver authors to > > realise the obvious, or VMM implementations either for that matter. > > > > I'm not sure I see the *harm* in spelling out explicitly for the hard- > > of-thinking. > > I don't want people to make assumptions, like crashing if device is > behind an iommu or whatnot. Why would that happen? If we explicitly document that "on-device memory access don't go through an external IOMMU that sits all the way over the other side of the PCI bus between the device and system memory", which is what I was trying to say... it doesn't *matter* if the device is behind an IOMMU or not. It doesn't ever *do* any DMA on the PCI bus. > We can go with something informative. > > "It is expected that for most implementations, when using this feature, > the behaviour with and without ACCESS_PLATFORM is the same" I'd prefer to say nothing. Saying nothing relies on people to do their thinking and realise that on-device access doesn't cross the PCI bus. This version actually seems to hint that it's a *choice*, and hints that it might be OK if the external IOMMU *does* intervene in on-device memory accesses.
On Thu, Apr 03, 2025 at 08:45:22AM +0100, David Woodhouse wrote: > On Thu, 2025-04-03 at 03:31 -0400, Michael S. Tsirkin wrote: > > On Wed, Apr 02, 2025 at 06:10:53PM +0100, David Woodhouse wrote: > > > On Wed, 2025-04-02 at 12:43 -0400, Michael S. Tsirkin wrote: > > > > > > > > yes. > > > > > > > > I know a bit more about PCI, and for PCI I prefer just not saying > > > > anything. The platform already defines whether it is behind an iommu > > > > or not, and duplication is not good. > > > > > > Not a hill for me to die on I suppose, but I would personally prefer to > > > spell it out in words of one syllable or fewer, to make *sure* that > > > device and driver authors get it right even though it's "obvious". > > > > > > After all, if we could trust them to do their thinking, we would never > > > have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM > > > existing in the first place; the legacy behaviour we get when that bit > > > *isn't* set would never have happened. > > > > Oh, you are wrong here. It's not implementer's fault. > > virtio just was not designed with real DMA > > in mind, and micro-optimizing by bypassing the DMA API > > was very much intentional. > > That's one point of view, I suppose. In the early days of IOMMUs, and > DMA ops coming to mainstream platforms, we found a *lot* of device > drivers that had the same "micro-optimisation" of just handing physical > addresses directly to devices. We called them 'bugs' though. Indeed. virtio was developed way after these days though. We just thought we are being clever. > And the thing that was different for virtio-pci was that the host side > had the *same* bug, as I recall, so we had to introduce a feature bit > to declare/negotiate the "natural" behaviour. > > But we're a long way from the original topic here. In a sense. But see the cache mode discussion: if this proposed interface can not be implemented efficiently on actual hardware, it does begin to look a little bit like repeating the same mistake. > > > > For mmio it is my understanding that the "restricted" does the same > > > > already? or is it required in the spec for some reason? > > > > > > No, it's exactly the same. But I still don't trust driver authors to > > > realise the obvious, or VMM implementations either for that matter. > > > > > > I'm not sure I see the *harm* in spelling out explicitly for the hard- > > > of-thinking. > > > > I don't want people to make assumptions, like crashing if device is > > behind an iommu or whatnot. > > Why would that happen? If we explicitly document that "on-device memory > access don't go through an external IOMMU that sits all the way over > the other side of the PCI bus between the device and system memory", > which is what I was trying to say... it doesn't *matter* if the device > is behind an IOMMU or not. It doesn't ever *do* any DMA on the PCI bus. Saying this explicitly in the pci transport like you write here is fine. > > We can go with something informative. > > > > "It is expected that for most implementations, when using this feature, > > the behaviour with and without ACCESS_PLATFORM is the same" > > I'd prefer to say nothing. Saying nothing relies on people to do their > thinking and realise that on-device access doesn't cross the PCI bus. > This version actually seems to hint that it's a *choice*, and hints > that it might be OK if the external IOMMU *does* intervene in on-device > memory accesses. > Nothing is fine, too.
On Wed, Apr 02, 2025 at 06:10:53PM +0100, David Woodhouse wrote: > > I know a bit more about PCI, and for PCI I prefer just not saying > > anything. The platform already defines whether it is behind an iommu > > or not, and duplication is not good. > > Not a hill for me to die on I suppose, but I would personally prefer to > spell it out in words of one syllable or fewer, to make *sure* that > device and driver authors get it right even though it's "obvious". > > After all, if we could trust them to do their thinking, we would never > have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM > existing in the first place; the legacy behaviour we get when that bit > *isn't* set would never have happened. You'll need to define the semanics for VIRTIO_F_ACCESS_PLATFORM only then. An the only sane answer there is: don't allow non-translated regions at all an in a broader sense stop people to use VIRTIO_F_ACCESS_PLATFORM at all or at least for anything that requires a new feature bit. > > For mmio it is my understanding that the "restricted" does the same > > already? or is it required in the spec for some reason? > > No, it's exactly the same. But I still don't trust driver authors to > realise the obvious, or VMM implementations either for that matter. > > I'm not sure I see the *harm* in spelling out explicitly for the hard- > of-thinking. Write a whitepaper than and explain how it maps to the existing perfectly working features. Note that VIRTIO_F_ACCESS_PLATFORM just like everything in virtio would actually benefit from being turned into proper spec language, but anecdotes about random use cases are not helpful.
On Thu, 2025-04-03 at 00:29 -0700, Christoph Hellwig wrote: > On Wed, Apr 02, 2025 at 06:10:53PM +0100, David Woodhouse wrote: > > > I know a bit more about PCI, and for PCI I prefer just not saying > > > anything. The platform already defines whether it is behind an iommu > > > or not, and duplication is not good. > > > > Not a hill for me to die on I suppose, but I would personally prefer to > > spell it out in words of one syllable or fewer, to make *sure* that > > device and driver authors get it right even though it's "obvious". > > > > After all, if we could trust them to do their thinking, we would never > > have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM > > existing in the first place; the legacy behaviour we get when that bit > > *isn't* set would never have happened. > > You'll need to define the semanics for VIRTIO_F_ACCESS_PLATFORM only > then. > You mean the semantics for VIRTIO_F_ACCESS_PLATFORM only, without VIRTIO_F_SWIOTLB? Are those not defined already? > An the only sane answer there is: don't allow non-translated > regions at all an in a broader sense stop people to use > VIRTIO_F_ACCESS_PLATFORM at all or at least for anything that requires > a new feature bit. > > > > For mmio it is my understanding that the "restricted" does the same > > > already? or is it required in the spec for some reason? > > > > No, it's exactly the same. But I still don't trust driver authors to > > realise the obvious, or VMM implementations either for that matter. > > > > I'm not sure I see the *harm* in spelling out explicitly for the hard- > > of-thinking. > > Write a whitepaper than and explain how it maps to the existing perfectly > working features. Note that VIRTIO_F_ACCESS_PLATFORM just like > everything in virtio would actually benefit from being turned into > proper spec language, but anecdotes about random use cases are not > helpful. Hm. I was just trying to point out what seemed obvious, that when a PCI device does 'DMA' to an address region which is actually within one of its *own* BARs, it isn't going to reach the PCI bus and get translated by an IOMMU. If it's causing this much contention, I'll just drop it. It didn't *change* anything anyway, except hopefully avoiding bugs in implementations.
On Thu, Apr 03, 2025 at 08:37:20AM +0100, David Woodhouse wrote: > On Thu, 2025-04-03 at 00:29 -0700, Christoph Hellwig wrote: > > On Wed, Apr 02, 2025 at 06:10:53PM +0100, David Woodhouse wrote: > > > > I know a bit more about PCI, and for PCI I prefer just not saying > > > > anything. The platform already defines whether it is behind an iommu > > > > or not, and duplication is not good. > > > > > > Not a hill for me to die on I suppose, but I would personally prefer to > > > spell it out in words of one syllable or fewer, to make *sure* that > > > device and driver authors get it right even though it's "obvious". > > > > > > After all, if we could trust them to do their thinking, we would never > > > have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM > > > existing in the first place; the legacy behaviour we get when that bit > > > *isn't* set would never have happened. > > > > You'll need to define the semanics for VIRTIO_F_ACCESS_PLATFORM only > > then. > > > > You mean the semantics for VIRTIO_F_ACCESS_PLATFORM only, without > VIRTIO_F_SWIOTLB? Are those not defined already? > > > An the only sane answer there is: don't allow non-translated > > regions at all an in a broader sense stop people to use > > VIRTIO_F_ACCESS_PLATFORM at all or at least for anything that requires > > a new feature bit. > > > > > > For mmio it is my understanding that the "restricted" does the same > > > > already? or is it required in the spec for some reason? > > > > > > No, it's exactly the same. But I still don't trust driver authors to > > > realise the obvious, or VMM implementations either for that matter. > > > > > > I'm not sure I see the *harm* in spelling out explicitly for the hard- > > > of-thinking. > > > > Write a whitepaper than and explain how it maps to the existing perfectly > > working features. Note that VIRTIO_F_ACCESS_PLATFORM just like > > everything in virtio would actually benefit from being turned into > > proper spec language, but anecdotes about random use cases are not > > helpful. > > Hm. I was just trying to point out what seemed obvious, that when a PCI > device does 'DMA' to an address region which is actually within one of > its *own* BARs, it isn't going to reach the PCI bus and get translated > by an IOMMU. If it's causing this much contention, I'll just drop it. > It didn't *change* anything anyway, except hopefully avoiding bugs in > implementations. > If you want, I would just spell this in the transport text then. "Note that on most existing platforms, and since the BAR is part of the device itself, access controls generally to do not apply to device accesses there, and this is true even when VIRTIO_F_ACCESS_PLATFORM has been negotiated".
On Thu, Apr 03, 2025 at 08:37:20AM +0100, David Woodhouse wrote: > Hm. I was just trying to point out what seemed obvious, that when a PCI > device does 'DMA' to an address region which is actually within one of > its *own* BARs, PCIe devices can't do DMA to their own BARs by definition, see the route to self rule. Pretending that they do it by parsing the addresses is bound to fail because the addresses seen by the driver and the device can be different. NVMe got this wrong not just once but twice and is still suffering from this misunderstanding. If you want to enhance a protocol to support addressing a local indirection buffer do not treat it as fake DMA but rather use explicit addressing for it, or you will be in a world of trouble.
On Thu, 2025-04-03 at 00:39 -0700, Christoph Hellwig wrote: > On Thu, Apr 03, 2025 at 08:37:20AM +0100, David Woodhouse wrote: > > Hm. I was just trying to point out what seemed obvious, that when a PCI > > device does 'DMA' to an address region which is actually within one of > > its *own* BARs, > > PCIe devices can't do DMA to their own BARs by definition, see the route > to self rule. > > Pretending that they do it by parsing the addresses is bound to fail > because the addresses seen by the driver and the device can be > different. > > NVMe got this wrong not just once but twice and is still suffering from > this misunderstanding. If you want to enhance a protocol to support > addressing a local indirection buffer do not treat it as fake DMA > but rather use explicit addressing for it, or you will be in a world of > trouble. This is, of course, the other benefit of pointing out the "obvious". Because you can get corrected when you've got it wrong :) Thanks. I'll take a closer look at handling that. I think it's reasonable for the negotiation of the VIRTIO_F_SWIOTLB feature to be the thing that switches *all* addresses to be on-device, and the on- device buffer can't be accessed unless VIRTIO_F_SWIOTLB has been negotiated. Which neatly sidesteps the original thing I was trying to clarify anyway.
On Thu, Apr 03, 2025 at 09:10:41AM +0100, David Woodhouse wrote: > Thanks. I'll take a closer look at handling that. I think it's > reasonable for the negotiation of the VIRTIO_F_SWIOTLB feature to be > the thing that switches *all* addresses to be on-device, and the on- > device buffer can't be accessed unless VIRTIO_F_SWIOTLB has been > negotiated. > > Which neatly sidesteps the original thing I was trying to clarify > anyway. Switching all addressing does not sound like a good idea. The main thing these indirect buffers are used for is as a staging points for P2P DMA, in which case they often are only used for some transfers or even parts of a transfer. At least for physical virtio devices P2P is probably not far off with the current GPU craze, and I wouldn't be surprised if people found uses for paravirt P2P as well.
On 4 April 2025 07:29:02 BST, Christoph Hellwig <hch@infradead.org> wrote: >On Thu, Apr 03, 2025 at 09:10:41AM +0100, David Woodhouse wrote: >> Thanks. I'll take a closer look at handling that. I think it's >> reasonable for the negotiation of the VIRTIO_F_SWIOTLB feature to be >> the thing that switches *all* addresses to be on-device, and the on- >> device buffer can't be accessed unless VIRTIO_F_SWIOTLB has been >> negotiated. >> >> Which neatly sidesteps the original thing I was trying to clarify >> anyway. > >Switching all addressing does not sound like a good idea. The main >thing these indirect buffers are used for is as a staging points for >P2P DMA, in which case they often are only used for some transfers or >even parts of a transfer. At least for physical virtio devices P2P >is probably not far off with the current GPU craze, and I wouldn't >be surprised if people found uses for paravirt P2P as well. That isn't the original use case I had for it in virtio but that's useful input, thanks. Plumbing a 65th address space bit through all descriptors seems impractical. Would it suffice for the driver to *specify* the location in the device's DMA address space that the on-device buffer appears, to allow it to avoid conflicts with system memory addresses? Better ideas?
On Fri, Apr 04, 2025 at 07:39:02AM +0100, David Woodhouse wrote: > Plumbing a 65th address space bit through all descriptors seems impractical. Would it suffice for the driver to *specify* the location in the device's DMA address space that the on-device buffer appears, to allow it to avoid conflicts with system memory addresses? Better ideas? That's what NVMe currently does in it's second attempt to get things wrong. It mostly works for the current use cases, but there might actually not be a single canonical address where it is mapped. This was mostly theoretical, but with SVM and especially SIOV it is becoming a real thing now. In the worst case you might have to limit yourself to one address space per request. Or have a bitmap which buffer in the request uses which address space.
On Thu, Apr 03, 2025 at 11:44:40PM -0700, Christoph Hellwig wrote: > That's what NVMe currently does in it's second attempt to get things > wrong. It mostly works for the current use cases, but there might > actually not be a single canonical address where it is mapped. This > was mostly theoretical, but with SVM and especially SIOV it is becoming > a real thing now. > > In the worst case you might have to limit yourself to one address space > per request. Or have a bitmap which buffer in the request uses which > address space. Also with unordered I/O in the newer PCIe specs we might need another such address space bit to force use of that for certain ranges, although system use cases for UIO are still not finalized.
On Thu, Apr 03, 2025 at 12:39:39AM -0700, Christoph Hellwig wrote: > On Thu, Apr 03, 2025 at 08:37:20AM +0100, David Woodhouse wrote: > > Hm. I was just trying to point out what seemed obvious, that when a PCI > > device does 'DMA' to an address region which is actually within one of > > its *own* BARs, > > PCIe devices can't do DMA to their own BARs by definition, see the route > to self rule. > > Pretending that they do it by parsing the addresses is bound to fail > because the addresses seen by the driver and the device can be > different. > > NVMe got this wrong not just once but twice and is still suffering from > this misunderstanding. If you want to enhance a protocol to support > addressing a local indirection buffer do not treat it as fake DMA > but rather use explicit addressing for it, or you will be in a world of > trouble. Hm. This is what this proposal does, indeed. Can be changed though - since all accesses are now within the SWIOTLB, they can be treated as offsets instead. -- MST
On Thu, Apr 03, 2025 at 03:43:31AM -0400, Michael S. Tsirkin wrote: > Hm. This is what this proposal does, indeed. Can be changed though - > since all accesses are now within the SWIOTLB, they can be treated > as offsets instead. Yes, marking them as offsets into the BAR (including potentially picking a specific BAR or region if there is more than one) is the right interface and implemented by various hardware that way.
© 2016 - 2025 Red Hat, Inc.