Add Software IOTLB bounce buffer support

[RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

From: David Woodhouse <dwmw@amazon.co.uk>

Device-tree bindings for `restricted-dma-pool` were defined in 2021, which
allow devices to be restricted to a given SWIOTLB pool instead of allowing
DMA to arbitrary system memory:
https://lore.kernel.org/all/20210624155526.2775863-1-tientzu@chromium.org/

This facility was not specific to virtio-mmio, but does apply to it. No
attempt was made to ensure backward-compatibility for virtio-mmio devices.

Define a VIRTIO_F_SWIOTLB feature which allows the device and driver to
agree on the use of the SWIOTLB, if present. This enables the device to
refuse to operate further if the driver does not support the SWIOTLB
requirement expressed in the device-tree.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 content.tex | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/content.tex b/content.tex
index c17ffa6..63d075f 100644
--- a/content.tex
+++ b/content.tex
@@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
 Currently these device-independent feature bits are defined:
 
 \begin{description}
+  \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device
+  transport provides a memory region which is to be used for bounce
+  buffering, rather than permitting direct memory access to system memory.
   \item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates
   that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT
   flag set, as described in \ref{sec:Basic Facilities of a Virtio
@@ -807,6 +810,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
   the driver. When clear, this overrides any platform-specific description of
   whether device access is limited or translated in any way, e.g.
   whether an IOMMU may be present.
+  If a the device transport provides a software IOTLB bounce buffer,
+  addresses within its range are not subject to the requirements of
+  VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''.
   \item[VIRTIO_F_RING_PACKED(34)] This feature indicates
   support for the packed virtqueue layout as described in
   \ref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}~\nameref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}.
@@ -885,6 +891,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
 VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical
 addresses to the device.
 
+A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST
+then pass only addresses within the Software IOTLB bounce buffer to the
+device.
+
 A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered.
 
 A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered.
@@ -921,6 +931,8 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
 A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not
 accepted.
 
+A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted.
+
 If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use
 buffers in the same order in which they have been available.
 
-- 
2.49.0

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month, 1 week ago

On Wed, Apr 02, 2025 at 12:04:45PM +0100, David Woodhouse wrote:
> Define a VIRTIO_F_SWIOTLB feature which allows the device and driver to
> agree on the use of the SWIOTLB, if present. This enables the device to
> refuse to operate further if the driver does not support the SWIOTLB
> requirement expressed in the device-tree.

This makes no sense whatsoever.  Swiotlb is a Linux guest implementation
detail, virtio is a transport protocol.  Mixing them in the same spec
doesn't even compute.  Please describe the actual on the wire semantics
you want, and don't use the term swiotlb.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Thu, 2025-04-03 at 00:24 -0700, Christoph Hellwig wrote:
> On Wed, Apr 02, 2025 at 12:04:45PM +0100, David Woodhouse wrote:
> > Define a VIRTIO_F_SWIOTLB feature which allows the device and driver to
> > agree on the use of the SWIOTLB, if present. This enables the device to
> > refuse to operate further if the driver does not support the SWIOTLB
> > requirement expressed in the device-tree.
> 
> This makes no sense whatsoever.  Swiotlb is a Linux guest implementation
> detail, virtio is a transport protocol.  Mixing them in the same spec
> doesn't even compute.  Please describe the actual on the wire semantics
> you want, and don't use the term swiotlb.

Linux has 'SWIOTLB' support, but I didn't see it as a term specific to
that particular implementation; it seemed like a concept would would be
understood elsewhere. But sure, I'm happy to rename it to just 'bounce
buffers' or something like that. I'll see what I can come up with that
is succinct enough to use in VIRTIO_F_xxx and VIRTIO_PCI_CAP_xxx names.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month, 1 week ago

On Thu, Apr 03, 2025 at 08:28:19AM +0100, David Woodhouse wrote:
> Linux has 'SWIOTLB' support, but I didn't see it as a term specific to
> that particular implementation; it seemed like a concept would would be
> understood elsewhere. But sure, I'm happy to rename it to just 'bounce
> buffers' or something like that. I'll see what I can come up with that
> is succinct enough to use in VIRTIO_F_xxx and VIRTIO_PCI_CAP_xxx names.

I think you still miss the point.

What you describing is either:

 1) A BAR (or BAR-like region on MMIO) that can be used to stage
    indirect I/O

and

 2a) a device that doesn't have any DMA capabilities, but instead
     requires use for the above staging region

and/or

 2b) a system configuration where devices are prevented from accessing
     guest memory
 
1) makes perfect sense in virtio.  It is the equivalent of the NVMe CMB
and similar features on various devices.

2a) would make sense if we actually have such devices and not just
system configs, which I doubt.

2b) is the job of ACPI or DT to communicate, not that of virtio

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Thu, 2025-04-03 at 00:35 -0700, Christoph Hellwig wrote:
> On Thu, Apr 03, 2025 at 08:28:19AM +0100, David Woodhouse wrote:
> > Linux has 'SWIOTLB' support, but I didn't see it as a term specific to
> > that particular implementation; it seemed like a concept would would be
> > understood elsewhere. But sure, I'm happy to rename it to just 'bounce
> > buffers' or something like that. I'll see what I can come up with that
> > is succinct enough to use in VIRTIO_F_xxx and VIRTIO_PCI_CAP_xxx names.
> 
> I think you still miss the point.
> 
> What you describing is either:
> 
>  1) A BAR (or BAR-like region on MMIO) that can be used to stage
>     indirect I/O
> 
> and
> 
>  2a) a device that doesn't have any DMA capabilities, but instead
>      requires use for the above staging region
> 
> and/or
> 
>  2b) a system configuration where devices are prevented from accessing
>      guest memory
>  
> 1) makes perfect sense in virtio.  It is the equivalent of the NVMe CMB
> and similar features on various devices.
> 
> 2a) would make sense if we actually have such devices and not just
> system configs, which I doubt.
> 
> 2b) is the job of ACPI or DT to communicate, not that of virtio
> 

I am designing virtual systems, with the constraint in your (2b):
The virtual devices implemented in the *VMM* are prevented from
accessing guest memory. (Physical devices are actually fine because
there's a two-stage IOMMU which copes with those.)

I am describing the solution in your (1): Give them an on-device
bounce-buffer.

The implementation of (1) is to offer them a device which doesn't have
DMA capabilities (because *I* as the system designer know that it can't
use them). Which is your (2a).

You are right, in theory, that (2b) is the job of ACPI or DT to
communicate. But virtio is presented as just another PCI device,
indistinguishable by the guest from other PCI devices which *are*
actually able to do DMA through the hardware's two-stage IOMMU.

To my knowledge there isn't a clean way, across platforms, to tell an
operating system that a certain device just *can't* do DMA at all. And
even if there was, why offer a device model that could *theoretically*
do DMA, just to tell the guest through some other channel that it
can't?

What's wrong with a simple option to indicate, in the device model that
the system designer chooses to build for the guest, that it can't do
DMA?

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month ago

On Thu, Apr 03, 2025 at 09:06:37AM +0100, David Woodhouse wrote:
> You are right, in theory, that (2b) is the job of ACPI or DT to
> communicate. But virtio is presented as just another PCI device,
> indistinguishable by the guest from other PCI devices which *are*
> actually able to do DMA through the hardware's two-stage IOMMU.

Then the virtual device should be behind a different IOMMU (or no IOMMU
at all) to clearly differciate them for the assigned hardware devices.

> To my knowledge there isn't a clean way, across platforms, to tell an
> operating system that a certain device just *can't* do DMA at all. 

Well, it's part of the device specific if it can do DMA or not.  We
had a bunch of early PCI (and a lot of ISA) device that could not
do P2P.  Even on PCIe there might be weirdo device that just expose
a bar or have some bitbanged i2c or similar.

Now if you mean virtio devices - yes there's no way currently as virtio
is a quite fundamentally DMA based model.

> And
> even if there was, why offer a device model that could *theoretically*
> do DMA, just to tell the guest through some other channel that it
> can't?

That would be the normal way.  Because virtio devices fundamentally can
do DMA.  It's just your guest specification doesn't want them to.

> What's wrong with a simple option to indicate, in the device model that
> the system designer chooses to build for the guest, that it can't do
> DMA?

As stated above I suspect you still are asking the wrong question and
have the wrong mental model.  Virtio fundamentally can do DMA, you just
don't want it to.  And based on the other subthread I also suspect you'd
actually be much better off with your bounce buffer in the virtual host
memory instead of coming up with this particularly odd case, but even
if not it's the system description that matters here, not the device
model.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On Thu, 2025-04-03 at 23:35 -0700, Christoph Hellwig wrote:
> As stated above I suspect you still are asking the wrong question and
> have the wrong mental model.  Virtio fundamentally can do DMA, you just
> don't want it to.  And based on the other subthread I also suspect you'd
> actually be much better off with your bounce buffer in the virtual host
> memory instead of coming up with this particularly odd case, but even
> if not it's the system description that matters here, not the device
> model.

I do agree, this is fundamentally a system issue. In a CoCo model, it's
non-trivial for the system to allow *virtual* devices to do "DMA"
because that actually means allowing the VMM to access arbitrary guest
memory.

And most hardware designers screwed up the 2-stage IOMMU which allows
protection from *real* devices, and didn't build in a way for the VMM
to access guest memory through a stage1 mapping.

Fixing that to allow system DMA to those guest devices in a way which
will work across operating systems is impractical, as it involves the
guest knowing which PCI device is implemented where, and either core
kernel (share/unshare) enlightenments, or emulating a full IOMMU in the
*trusted* part of the hypervisor (e.g. pKVM).

So "for the emulated devices, just use a device model that doesn't do
arbitrary DMA to system memory" is a nice simple answer, and keeps the
guest support restricted to its *own* standalone driver.

And yes, virtio is kind of fundamentally DMA-based as you said, but
that's OK if the DMA is only to a specific range which is *known* to be
shareable and which the guest would never consider to be 'normal RAM'.
Because then the VMM can be *allowed* to access that.

What's annoying is that this should work out of the box *already* with
virtio-mmio and a `restricted-dma-pool` — for systems which aren't
afflicted by UEFI/ACPI/PCI as their discovery mechanisms.

In fact, I may even try offering virtio-mmio in a PRP0001 ACPI device
to see how that works, although I think I'll lament the lack of MSI
interrupts.

But the cleaner solution seemed (at least when I started this
discussion) to be to add the same kind of facility to the virtio-pci
transport too. And I still don't think I've quite given up on it;
especially as you point out the use case for staged P2P DMA too.

I'll go take a closer look at how we can cleanly disambiguate between
address spaces.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month ago

On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse wrote:
> I do agree, this is fundamentally a system issue. In a CoCo model, it's
> non-trivial for the system to allow *virtual* devices to do "DMA"
> because that actually means allowing the VMM to access arbitrary guest
> memory.

> So "for the emulated devices, just use a device model that doesn't do
> arbitrary DMA to system memory" is a nice simple answer, and keeps the
> guest support restricted to its *own* standalone driver.

It's also one that completely breaks the abstraction.  I still don't
understand what the problem is with having the paravirtualized devices
on a different part of the virtual PCIe topology so that the stage2
IOMMU isn't used for them, but instead just the direct mapping or a
stub viommu that blocks all access.

> What's annoying is that this should work out of the box *already* with
> virtio-mmio and a `restricted-dma-pool` — for systems which aren't
> afflicted by UEFI/ACPI/PCI as their discovery mechanisms.

Yes.  And the fix is to get the equivalent to restricted-dma-pool into
UEFI/ACPI.  That gives you a portable and device-independent way to
describe this limitation, which is much better than hacking around it
using an odd device model.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On 4 April 2025 09:23:24 BST, Christoph Hellwig <hch@infradead.org> wrote:
>On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse wrote:
>> I do agree, this is fundamentally a system issue. In a CoCo model, it's
>> non-trivial for the system to allow *virtual* devices to do "DMA"
>> because that actually means allowing the VMM to access arbitrary guest
>> memory.
>
>
>> So "for the emulated devices, just use a device model that doesn't do
>> arbitrary DMA to system memory" is a nice simple answer, and keeps the
>> guest support restricted to its *own* standalone driver.
>
>It's also one that completely breaks the abstraction. 

Hm? Having a device that simply doesn't *do* any DMA surely doesn't break any system DMA / IOMMU abstractions because it's no longer even relevant to them, which is kind of the point.

Which abstraction did you mean?

> I still don't
>understand what the problem is with having the paravirtualized devices
>on a different part of the virtual PCIe topology so that the stage2
>IOMMU isn't used for them, but instead just the direct mapping or a
>stub viommu that blocks all access.

It can't have a direct mapping because the VMM can't access guest memory. It has to be blocked, which is fine. But that's only part of the picture — then how do you actually get data in/out of the device?

By having on-device memory and not attempting DMA to system memory, perhaps...? :)

>> What's annoying is that this should work out of the box *already* with
>> virtio-mmio and a `restricted-dma-pool` — for systems which aren't
>> afflicted by UEFI/ACPI/PCI as their discovery mechanisms.
>
>Yes.  And the fix is to get the equivalent to restricted-dma-pool into
>UEFI/ACPI.  That gives you a portable and device-independent way to
>describe this limitation, which is much better than hacking around it
>using an odd device model.

It still has the problem that existing drivers in all operating systems produced before 2030 will see the device and try to use it as-is, with no comprehension of this new thing.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month ago

On Fri, Apr 04, 2025 at 10:39:35AM +0100, David Woodhouse wrote:
> >> So "for the emulated devices, just use a device model that doesn't do
> >> arbitrary DMA to system memory" is a nice simple answer, and keeps the
> >> guest support restricted to its *own* standalone driver.
> >
> >It's also one that completely breaks the abstraction. 
> 
> Hm? Having a device that simply doesn't *do* any DMA surely doesn't break
> any system DMA / IOMMU abstractions because it's no longer even relevant
> to them, which is kind of the point.
> 
> Which abstraction did you mean?

Implementing a DMA less device is fine.  Implementing a DMA less virtual
device to work around the lack of CoCo shared bounce buffers in the
system is.

> 
> 
> > I still don't
> >understand what the problem is with having the paravirtualized devices
> >on a different part of the virtual PCIe topology so that the stage2
> >IOMMU isn't used for them, but instead just the direct mapping or a
> >stub viommu that blocks all access.
> 
> It can't have a direct mapping because the VMM can't access guest
> memory. It has to be blocked, which is fine.

I only mentioned direct mapping in not having an IOMMU.  I fully expect
it not to work.

> But that's only part of the picture — then how do you actually get data
> in/out of the device?
> 
> By having on-device memory and not attempting DMA to system memory,
> perhaps...? :)

By implementing the discovery of a shared pool in host memory as
discussed in another branch of this thread?

> >describe this limitation, which is much better than hacking around it
> >using an odd device model.
> 
> It still has the problem that existing drivers in all operating systems
> produced before 2030 will see the device and try to use it as-is, with
> no comprehension of this new thing.

Given how much enablement the various CoCo schemes need it won't work
anyway.

Btw, you mail formatting in the last days went completely crazy.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On Mon, 2025-04-07 at 00:34 -0700, Christoph Hellwig wrote:
> 
> > > describe this limitation, which is much better than hacking around it
> > > using an odd device model.
> > 
> > It still has the problem that existing drivers in all operating systems
> > produced before 2030 will see the device and try to use it as-is, with
> > no comprehension of this new thing.
> 
> Given how much enablement the various CoCo schemes need it won't work
> anyway.

Not all CoCo-style schemes need any enablement on the guest side at
all. It's perfectly feasible to have a pKVM-like model where the VMM
(and host Linux) are absolutely prevented from accessing guest memory,
without the guest knowing *anything* about it at all.

It's not about attestation to the guest. Deprivileging the host kernel
provides a huge amount of protection against both software and CPU
speculation bugs across the whole of that massive code base.

And you can do it with *full* compatibility. The guest just sees a
standard SMMU for its *actual* network and block devices, which are PCI
passthrough. (Yes, the guest does need to *use* the SMMU).

The caveat is when we have *emulated* devices, like virtio-vsock. That
brings us to LART the IOMMU designers for building a two-stage IOMMU
without considering emulated PCI devices, and then to start this
conversation we're having now about how to do it.

> Btw, you mail formatting in the last days went completely crazy.

Yeah, that's K-9 mail on Android not word wrapping; I'll see if I can
fix it. At least it's a mobile client that doesn't send HTML :)

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month ago

On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse wrote:
> What's annoying is that this should work out of the box *already* with
> virtio-mmio and a `restricted-dma-pool` — for systems which aren't
> afflicted by UEFI/ACPI/PCI as their discovery mechanisms.


That specifically would be just a driver bugfix then?

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote:
> On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse wrote:
> > What's annoying is that this should work out of the box *already* with
> > virtio-mmio and a `restricted-dma-pool` — for systems which aren't
> > afflicted by UEFI/ACPI/PCI as their discovery mechanisms.
> 
> 
> That specifically would be just a driver bugfix then?

I actually think it works out of the box and there isn't even a bug to
fix. Haven't tested yet.

The sad part is that the system does it all automatically *if* it has
CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even
notices that the dma_ops it's using are the swiotlb ops using the
provided buffer.

Which is *kind* of nice... except that when on a guest OS which *isn't*
Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore the
`restricted-dma-pool` node and try DMA to system memory anyway, which
will fail.

That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB feature, so
that the device side can refuse, if the guest *isn't* agreeing to use
the bounce buffer in the situations where it must do so.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month ago

On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote:
> The sad part is that the system does it all automatically *if* it has
> CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even
> notices that the dma_ops it's using are the swiotlb ops using the
> provided buffer.
> 
> Which is *kind* of nice... except that when on a guest OS which *isn't*
> Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore the
> `restricted-dma-pool` node and try DMA to system memory anyway, which
> will fail.
> 
> That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB feature, so
> that the device side can refuse, if the guest *isn't* agreeing to use
> the bounce buffer in the situations where it must do so.

That all assumes the indirect DMA is a device "feature".  It's not.
It is a limitation placed on the guest system.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month ago

On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote:
> On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote:
> > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse wrote:
> > > What's annoying is that this should work out of the box *already* with
> > > virtio-mmio and a `restricted-dma-pool` — for systems which aren't
> > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms.
> > 
> > 
> > That specifically would be just a driver bugfix then?
> 
> I actually think it works out of the box and there isn't even a bug to
> fix. Haven't tested yet.
> 
> The sad part is that the system does it all automatically *if* it has
> CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even
> notices that the dma_ops it's using are the swiotlb ops using the
> provided buffer.
> 
> Which is *kind* of nice... except that when on a guest OS which *isn't*
> Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore the
> `restricted-dma-pool` node and try DMA to system memory anyway, which
> will fail.

I mean, it's easy to misconfigure Linux, this is why we love it ;) Why
is this such a concern?

> That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB feature, so
> that the device side can refuse, if the guest *isn't* agreeing to use
> the bounce buffer in the situations where it must do so.


OTOH then setting this feature and if you make the device force it,
you are breaking guests restricted-dma-pool which worked previously, no?

-- 
MST

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote:
>> On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote:
>> > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse wrote:
>> > > What's annoying is that this should work out of the box *already* with
>> > > virtio-mmio and a `restricted-dma-pool` — for systems which aren't
>> > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms.
>> > 
>> > 
>> > That specifically would be just a driver bugfix then?
>> 
>> I actually think it works out of the box and there isn't even a bug to
>> fix. Haven't tested yet.
>> 
>> The sad part is that the system does it all automatically *if* it has
>> CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even
>> notices that the dma_ops it's using are the swiotlb ops using the
>> provided buffer.
>> 
>> Which is *kind* of nice... except that when on a guest OS which *isn't*
>> Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore the
>> `restricted-dma-pool` node and try DMA to system memory anyway, which
>> will fail.
>
>I mean, it's easy to misconfigure Linux, this is why we love it ;) Why
>is this such a concern?

Because it's incompatible. In the DT world, perhaps this new *non-optional* feature/restriction should have come with a new "compatible" string such as "virtio-mmio-restricted-dma".

Adding it without backwards compatibility wasn't ideal.

>> That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB feature, so
>> that the device side can refuse, if the guest *isn't* agreeing to use
>> the bounce buffer in the situations where it must do so.
>
>
>OTOH then setting this feature and if you make the device force it,
>you are breaking guests restricted-dma-pool which worked previously, no?

Yes. So a platform offering virtio-mmio with restricted DMA, if the driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to accept that negotiation anyway, and *hope* that the driver/OS are going to use the buffer anyway.

I just didn't want to make that same mistake again when formalising and documenting this, and especially when attempting to extend it to PCI.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On Fri, 2025-04-04 at 10:27 +0100, David Woodhouse wrote:
> On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@redhat.com>
> wrote:
> > On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote:
> > > On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote:
> > > > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse
> > > > wrote:
> > > > > What's annoying is that this should work out of the box
> > > > > *already* with
> > > > > virtio-mmio and a `restricted-dma-pool` — for systems which
> > > > > aren't
> > > > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms.
> > > > 
> > > > 
> > > > That specifically would be just a driver bugfix then?
> > > 
> > > I actually think it works out of the box and there isn't even a
> > > bug to
> > > fix. Haven't tested yet.
> > > 
> > > The sad part is that the system does it all automatically *if* it
> > > has
> > > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even
> > > notices that the dma_ops it's using are the swiotlb ops using the
> > > provided buffer.
> > > 
> > > Which is *kind* of nice... except that when on a guest OS which
> > > *isn't*
> > > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore
> > > the
> > > `restricted-dma-pool` node and try DMA to system memory anyway,
> > > which
> > > will fail.
> > 
> > I mean, it's easy to misconfigure Linux, this is why we love it ;)
> > Why
> > is this such a concern?
> 
> Because it's incompatible. In the DT world, perhaps this new *non-
> optional* feature/restriction should have come with a new
> "compatible" string such as "virtio-mmio-restricted-dma".
> 
> Adding it without backwards compatibility wasn't ideal.
> 
> > > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB
> > > feature, so
> > > that the device side can refuse, if the guest *isn't* agreeing to
> > > use
> > > the bounce buffer in the situations where it must do so.
> > 
> > 
> > OTOH then setting this feature and if you make the device force it,
> > you are breaking guests restricted-dma-pool which worked
> > previously, no?
> 
> Yes. So a platform offering virtio-mmio with restricted DMA, if the
> driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to
> accept that negotiation anyway, and *hope* that the driver/OS are
> going to use the buffer anyway.
> 
> I just didn't want to make that same mistake again when formalising
> and documenting this, and especially when attempting to extend it to
> PCI.

Of course, the beauty of the restricted-dma-pool as supported by DT is
that it's a *system* memory buffer, which is actually OK as long as
it's reserved address space and not just part of normal system memory
that an unsuspecting guest might use for general purposes. So the
trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access
to that space.

It doesn't *have* to be on-device. That just seemed like the more
natural way to do it for PCI.

I suppose we *could* allow for the virtio-pci transport to do it the
same way as virtio-mmio though. The VIRTIO_PCI_CAP_SWIOTLB capability¹
could reference a range of system memory space, just like the
`restricted-dma-pool` property does.

It's a weird abstraction especially for a physical PCI device to do
that because the system memory space is outside its ownership. But in a
physical device it could be writable, and you could consider it the
responsibility of the system firmware to configure it appropriately, in
accordance with the IOMMU and other DMA restrictions of the platform.

That does solve it for the CoCo case without addressing the P2P staging
case that Christoph mentions, though.

¹ I will rename it, Christoph, if it survives at all. Probably
VIRTIO_F_RESTRICTED_DMA and VIRTIO_PCI_CAP_RESTRICTED_DMA but of course
it depends on the semantics we conclude it should have.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month ago

On Fri, Apr 04, 2025 at 11:15:33AM +0100, David Woodhouse wrote:
> Of course, the beauty of the restricted-dma-pool as supported by DT is
> that it's a *system* memory buffer, which is actually OK as long as
> it's reserved address space and not just part of normal system memory
> that an unsuspecting guest might use for general purposes. So the
> trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access
> to that space.
> 
> It doesn't *have* to be on-device. That just seemed like the more
> natural way to do it for PCI.

It isn't very natural.  For actual PCI devices it is a lot less efficient
compared to bounce buffering in special host memory.  For virtual device
that's not the case, but at least for Linux guest it still makes the setup
a lot more complicated as we can't just reuse the restricted-dma-pool
but need a new version of it that uses MMIO helpers and device specific
setup code.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month ago

On Fri, Apr 04, 2025 at 11:15:33AM +0100, David Woodhouse wrote:
> On Fri, 2025-04-04 at 10:27 +0100, David Woodhouse wrote:
> > On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@redhat.com>
> > wrote:
> > > On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote:
> > > > On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote:
> > > > > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse
> > > > > wrote:
> > > > > > What's annoying is that this should work out of the box
> > > > > > *already* with
> > > > > > virtio-mmio and a `restricted-dma-pool` — for systems which
> > > > > > aren't
> > > > > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms.
> > > > > 
> > > > > 
> > > > > That specifically would be just a driver bugfix then?
> > > > 
> > > > I actually think it works out of the box and there isn't even a
> > > > bug to
> > > > fix. Haven't tested yet.
> > > > 
> > > > The sad part is that the system does it all automatically *if* it
> > > > has
> > > > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even
> > > > notices that the dma_ops it's using are the swiotlb ops using the
> > > > provided buffer.
> > > > 
> > > > Which is *kind* of nice... except that when on a guest OS which
> > > > *isn't*
> > > > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore
> > > > the
> > > > `restricted-dma-pool` node and try DMA to system memory anyway,
> > > > which
> > > > will fail.
> > > 
> > > I mean, it's easy to misconfigure Linux, this is why we love it ;)
> > > Why
> > > is this such a concern?
> > 
> > Because it's incompatible. In the DT world, perhaps this new *non-
> > optional* feature/restriction should have come with a new
> > "compatible" string such as "virtio-mmio-restricted-dma".
> > 
> > Adding it without backwards compatibility wasn't ideal.
> > 
> > > > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB
> > > > feature, so
> > > > that the device side can refuse, if the guest *isn't* agreeing to
> > > > use
> > > > the bounce buffer in the situations where it must do so.
> > > 
> > > 
> > > OTOH then setting this feature and if you make the device force it,
> > > you are breaking guests restricted-dma-pool which worked
> > > previously, no?
> > 
> > Yes. So a platform offering virtio-mmio with restricted DMA, if the
> > driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to
> > accept that negotiation anyway, and *hope* that the driver/OS are
> > going to use the buffer anyway.
> > 
> > I just didn't want to make that same mistake again when formalising
> > and documenting this, and especially when attempting to extend it to
> > PCI.
> 
> Of course, the beauty of the restricted-dma-pool as supported by DT is
> that it's a *system* memory buffer, which is actually OK as long as
> it's reserved address space and not just part of normal system memory
> that an unsuspecting guest might use for general purposes. So the
> trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access
> to that space.
> 
> It doesn't *have* to be on-device. That just seemed like the more
> natural way to do it for PCI.
> 
> I suppose we *could* allow for the virtio-pci transport to do it the
> same way as virtio-mmio though. The VIRTIO_PCI_CAP_SWIOTLB capability¹
> could reference a range of system memory space, just like the
> `restricted-dma-pool` property does.
> 
> It's a weird abstraction especially for a physical PCI device to do
> that because the system memory space is outside its ownership. But in a
> physical device it could be writable, and you could consider it the
> responsibility of the system firmware to configure it appropriately, in
> accordance with the IOMMU and other DMA restrictions of the platform.
> 
> That does solve it for the CoCo case without addressing the P2P staging
> case that Christoph mentions, though.
> 
> 
> ¹ I will rename it, Christoph, if it survives at all. Probably
> VIRTIO_F_RESTRICTED_DMA and VIRTIO_PCI_CAP_RESTRICTED_DMA but of course
> it depends on the semantics we conclude it should have.

OK. So basically, all this does, is a promise by driver to only
DMA into a range of memory?
This part, I get. I wouldn't put it in a capability, just in config
space then.

What I don't get, is what does the *device* want, exactly?

Here's a vague idea to explain the question:

some embedded devices can have addressing restrictions, such
as the number of bits of an address.
In another example, the legacy balloon device only supports addresses
up to 48 bit.
These can still be useful if driver does not use the inaccessible
addresses.



So far so good? Does this look like a generalization of your idea?


Now, a question: DMA API under linux at least, can actually work
around device limitations using a bounce buffer. It does, however,
need to know what to work around.
So, maybe the device needs to expose a range?
But what is the range in your case?
 

Maybe the new registers list the range of addresses device can access,
and driver promises to be within that range by negotiating the
feature bit?


-- 
MST

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month ago

On Fri, Apr 04, 2025 at 06:37:16AM -0400, Michael S. Tsirkin wrote:
> some embedded devices can have addressing restrictions, such
> as the number of bits of an address.
> In another example, the legacy balloon device only supports addresses
> up to 48 bit.
> These can still be useful if driver does not use the inaccessible
> addresses.

Let's not mix that up.  The DMA API handles addressing limitations
just fine, which is totally different of only allowing access to
a specific region no matter where it sits.

> Now, a question: DMA API under linux at least, can actually work
> around device limitations using a bounce buffer. It does, however,
> need to know what to work around.
> So, maybe the device needs to expose a range?
> But what is the range in your case?
>  
> 
> Maybe the new registers list the range of addresses device can access,
> and driver promises to be within that range by negotiating the
> feature bit?

We do not have an issue with a device here.  We have a system design
that does not allow devices or the hyper visor to access random guest
memory, and because of that DMA or the virtual or physical kind needs
to go through a designated area that is accessible outside the guest.

Trying to frame this as a device limitation is already causing so much
trouble in this thread and it will just get worse.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On Fri, 2025-04-04 at 06:37 -0400, Michael S. Tsirkin wrote:
> On Fri, Apr 04, 2025 at 11:15:33AM +0100, David Woodhouse wrote:
> > On Fri, 2025-04-04 at 10:27 +0100, David Woodhouse wrote:
> > > On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@redhat.com>
> > > wrote:
> > > > On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote:
> > > > > On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote:
> > > > > > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse
> > > > > > wrote:
> > > > > > > What's annoying is that this should work out of the box
> > > > > > > *already* with
> > > > > > > virtio-mmio and a `restricted-dma-pool` — for systems which
> > > > > > > aren't
> > > > > > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms.
> > > > > > 
> > > > > > 
> > > > > > That specifically would be just a driver bugfix then?
> > > > > 
> > > > > I actually think it works out of the box and there isn't even a
> > > > > bug to
> > > > > fix. Haven't tested yet.
> > > > > 
> > > > > The sad part is that the system does it all automatically *if* it
> > > > > has
> > > > > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even
> > > > > notices that the dma_ops it's using are the swiotlb ops using the
> > > > > provided buffer.
> > > > > 
> > > > > Which is *kind* of nice... except that when on a guest OS which
> > > > > *isn't*
> > > > > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore
> > > > > the
> > > > > `restricted-dma-pool` node and try DMA to system memory anyway,
> > > > > which
> > > > > will fail.
> > > > 
> > > > I mean, it's easy to misconfigure Linux, this is why we love it ;)
> > > > Why
> > > > is this such a concern?
> > > 
> > > Because it's incompatible. In the DT world, perhaps this new *non-
> > > optional* feature/restriction should have come with a new
> > > "compatible" string such as "virtio-mmio-restricted-dma".
> > > 
> > > Adding it without backwards compatibility wasn't ideal.
> > > 
> > > > > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB
> > > > > feature, so
> > > > > that the device side can refuse, if the guest *isn't* agreeing to
> > > > > use
> > > > > the bounce buffer in the situations where it must do so.
> > > > 
> > > > 
> > > > OTOH then setting this feature and if you make the device force it,
> > > > you are breaking guests restricted-dma-pool which worked
> > > > previously, no?
> > > 
> > > Yes. So a platform offering virtio-mmio with restricted DMA, if the
> > > driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to
> > > accept that negotiation anyway, and *hope* that the driver/OS are
> > > going to use the buffer anyway.
> > > 
> > > I just didn't want to make that same mistake again when formalising
> > > and documenting this, and especially when attempting to extend it to
> > > PCI.
> > 
> > Of course, the beauty of the restricted-dma-pool as supported by DT is
> > that it's a *system* memory buffer, which is actually OK as long as
> > it's reserved address space and not just part of normal system memory
> > that an unsuspecting guest might use for general purposes. So the
> > trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access
> > to that space.
> > 
> > It doesn't *have* to be on-device. That just seemed like the more
> > natural way to do it for PCI.
> > 
> > I suppose we *could* allow for the virtio-pci transport to do it the
> > same way as virtio-mmio though. The VIRTIO_PCI_CAP_SWIOTLB capability¹
> > could reference a range of system memory space, just like the
> > `restricted-dma-pool` property does.
> > 
> > It's a weird abstraction especially for a physical PCI device to do
> > that because the system memory space is outside its ownership. But in a
> > physical device it could be writable, and you could consider it the
> > responsibility of the system firmware to configure it appropriately, in
> > accordance with the IOMMU and other DMA restrictions of the platform.
> > 
> > That does solve it for the CoCo case without addressing the P2P staging
> > case that Christoph mentions, though.
> > 
> > 
> > ¹ I will rename it, Christoph, if it survives at all. Probably
> > VIRTIO_F_RESTRICTED_DMA and VIRTIO_PCI_CAP_RESTRICTED_DMA but of course
> > it depends on the semantics we conclude it should have.
> 
> OK. So basically, all this does, is a promise by driver to only
> DMA into a range of memory?

Basically, yes.

> This part, I get. I wouldn't put it in a capability, just in config
> space then.

Sure... but how? There are some things which are defined to be at fixed
locations in config space, like the vendor/device IDs, COMMAND, STATUS,
BARs, etc..

And for the rest of the optional things which might be in config space
of a given device... isn't that exactly what capabilities are for?

> What I don't get, is what does the *device* want, exactly?

The device wants to know that a driver won't try to use it without
understanding the restriction. Because otherwise the driver will just
give it system addresses for DMA and be sad, without any coherent
error/failure report about why.

(You could ask the same question about what the *device* wants with
VIRTIO_F_ACCESS_PLATFORM, and the answer is much the same).

Or maybe not the *device* per se, but the *system integrator* wants to
know that only operating systems which understand the restriction
described above, will attempt to drive the device in question.

We could achieve that by presenting the device with a completely new
PCI device/vendor ID so that old drivers don't match, or in the DT
model you could make a new "compatible" string for it. I chose to use a
VIRTIO_F_ bit for it instead, which seemed natural and allows the
device model (under the influence of the system integrator) to *choose*
whether a failure to negotiate such bit is fatal or not.

> Here's a vague idea to explain the question:
> 
> some embedded devices can have addressing restrictions, such
> as the number of bits of an address.
> In another example, the legacy balloon device only supports addresses
> up to 48 bit.
> These can still be useful if driver does not use the inaccessible
> addresses.
> 
> 
> 
> So far so good? Does this look like a generalization of your idea?

Makes sense.

> Now, a question: DMA API under linux at least, can actually work
> around device limitations using a bounce buffer. It does, however,
> need to know what to work around.
> So, maybe the device needs to expose a range?
> But what is the range in your case?

The main thing in the CoCo case is that the range in question must not
contain any memory which an unenlightened guest might treat as normal
system RAM (because it has to be accessible by the VMM, and that would
make it insecure if it's being used a general-purpose RAM).

So on x86 it might be an e820-reserved region for example.

I don't think we want the guest OS just *assuming* that there's usable
memory in that e820-reserved region, just because some device says that
it's actually capable of DMA to those addresses.

So it would probably want a separate object, like the separate
`restricted-dma-pool` in DT, which explicitly identifies that range as
a DMA bounce-buffer pool. We probably *can* do that even in ACPI with a
PRP0001 device today, using a `restricted-dma-pool` compatible
property.

Then the OS would need to spot this range in the config space, and say
"oh, I *do* have a swiotlb pool this device can reach", and use that.

Which is slightly less pretty than the DT model where the device's node
explicitly references the handle of the `restricted-dma-pool` node, but
I think that's OK. And it does kind of keep "device" capability
separate from "system" properties, in a comprehensible way.

> Maybe the new registers list the range of addresses device can access,
> and driver promises to be within that range by negotiating the
> feature bit?

That should work for the CoCo use case, yes.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month ago

On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote:
> > What I don't get, is what does the *device* want, exactly?
> 
> The device wants to know that a driver won't try to use it without
> understanding the restriction. Because otherwise the driver will just
> give it system addresses for DMA and be sad, without any coherent
> error/failure report about why.
> 
> (You could ask the same question about what the *device* wants with
> VIRTIO_F_ACCESS_PLATFORM, and the answer is much the same).
> 
> Or maybe not the *device* per se, but the *system integrator* wants to
> know that only operating systems which understand the restriction
> described above, will attempt to drive the device in question.
> 
> We could achieve that by presenting the device with a completely new
> PCI device/vendor ID so that old drivers don't match, or in the DT
> model you could make a new "compatible" string for it. I chose to use a
> VIRTIO_F_ bit for it instead, which seemed natural and allows the
> device model (under the influence of the system integrator) to *choose*
> whether a failure to negotiate such bit is fatal or not.

Let's focus on the mmio part, for simplicity.
So IIUC there's a devicetree attribute restricted dma, that
guests currently simply ignore.
You want to fix it in the guest, but you also want to find a clean way
to detect that it's fixed. Right?

And if so, my question is, why this specific bug especially?
There likely are a ton of bugs, some more catastrophic than just
crashing the guest, like data corruption.
Is it because we were supposed to add it to the virtio spec but
did not?

-- 
MST

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On Mon, 2025-04-07 at 08:14 -0400, Michael S. Tsirkin wrote:
> On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote:
> > > What I don't get, is what does the *device* want, exactly?
> > 
> > The device wants to know that a driver won't try to use it without
> > understanding the restriction. Because otherwise the driver will just
> > give it system addresses for DMA and be sad, without any coherent
> > error/failure report about why.
> > 
> > (You could ask the same question about what the *device* wants with
> > VIRTIO_F_ACCESS_PLATFORM, and the answer is much the same).
> > 
> > Or maybe not the *device* per se, but the *system integrator* wants to
> > know that only operating systems which understand the restriction
> > described above, will attempt to drive the device in question.
> > 
> > We could achieve that by presenting the device with a completely new
> > PCI device/vendor ID so that old drivers don't match, or in the DT
> > model you could make a new "compatible" string for it. I chose to use a
> > VIRTIO_F_ bit for it instead, which seemed natural and allows the
> > device model (under the influence of the system integrator) to *choose*
> > whether a failure to negotiate such bit is fatal or not.
> 
> Let's focus on the mmio part, for simplicity.
> So IIUC there's a devicetree attribute restricted dma, that
> guests currently simply ignore.

No, I think Linux guests with CONFIG_DMA_RESTRICTED_POOL enabled will
honour it just fine and do the right thing. For virtio-mmio.

> You want to fix it in the guest, but you also want to find a clean way
> to detect that it's fixed. Right?

For virtio-mmio I believe there's no bug to fix.

Sure, it would have been nice if a feature bit had been added when this
attribute was defined, so that a device which *knows* it has this
restriction could gracefully decline to operate with a driver which
doesn't understand it.

But as you pointed out, that ship has sailed. Enforcing a new feature
bit now for virtio-mmio devices would break compatibility even with
existing drivers in kernels which *do* have CONFIG_DMA_RESTRICTED_POOL
enabled and which do work today.

There's nothing to fix here for virtio-mmio.

> And if so, my question is, why this specific bug especially?
> There likely are a ton of bugs, some more catastrophic than just
> crashing the guest, like data corruption.
> Is it because we were supposed to add it to the virtio spec but
> did not?

If I could easily expose virtio-mmio on an ACPI-afflicted guest instead
of using virtio-pci, maybe that might suffice¹.

But while we might be able to hack that up for Linux guests, I don't
think that's feasible for the general case. So what I'm trying to do
here is just bring the same capability to virtio-pci, that already
exists for virtio-mmio.

And if we do so I *would* like to get it right this time and add that
feature bit.

¹ It'd want MSI support, and the CONFIG_DMA_RESTRICTED_POOL support in
the kernel is a bit OF-specific and probably needs to be made a bit
more generic in some places. But I'm prepared to consider those as
implementation details, and the latter probably needs to be done anyway
for any of the approaches we've considered for supporting this in
virtio-pci.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month ago

On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote:
> We could achieve that by presenting the device with a completely new
> PCI device/vendor ID so that old drivers don't match, or in the DT
> model you could make a new "compatible" string for it. I chose to use a
> VIRTIO_F_ bit for it instead, which seemed natural and allows the
> device model (under the influence of the system integrator) to *choose*
> whether a failure to negotiate such bit is fatal or not.

Stop thinking about devices.  Your CoCo VM will have that exact same
limitation for all devices, because none of them can DMA into random
memory.  The solution to the compatibility problem is to just not boot
an OS aware of the exact CoCo scheme into a CoCo VM.  Bonus points
for figuring out something clever at the system level that let's the
boot fail gracefully early on for that case.

> The main thing in the CoCo case is that the range in question must not
> contain any memory which an unenlightened guest might treat as normal
> system RAM (because it has to be accessible by the VMM, and that would
> make it insecure if it's being used a general-purpose RAM).

Yes.

> So on x86 it might be an e820-reserved region for example.

Hasn't e820 replaced with something more "elaborate" for UEFI systems
anyway?

> I don't think we want the guest OS just *assuming* that there's usable
> memory in that e820-reserved region, just because some device says that
> it's actually capable of DMA to those addresses.
> 
> So it would probably want a separate object, like the separate
> `restricted-dma-pool` in DT, which explicitly identifies that range as
> a DMA bounce-buffer pool. We probably *can* do that even in ACPI with a
> PRP0001 device today, using a `restricted-dma-pool` compatible
> property.
> 
> Then the OS would need to spot this range in the config space, and say
> "oh, I *do* have a swiotlb pool this device can reach", and use that.

Yes, that's largely how it should work.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On Mon, 2025-04-07 at 00:30 -0700, Christoph Hellwig wrote:
> On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote:
> > We could achieve that by presenting the device with a completely new
> > PCI device/vendor ID so that old drivers don't match, or in the DT
> > model you could make a new "compatible" string for it. I chose to use a
> > VIRTIO_F_ bit for it instead, which seemed natural and allows the
> > device model (under the influence of the system integrator) to *choose*
> > whether a failure to negotiate such bit is fatal or not.
> 
> Stop thinking about devices.  Your CoCo VM will have that exact same
> limitation for all devices, because none of them can DMA into random
> memory.

Nah, most of them are just fine because they're actual passthrough PCI
devices behind a proper 2-stage IOMMU.

> > So on x86 it might be an e820-reserved region for example.
> 
> Hasn't e820 replaced with something more "elaborate" for UEFI systems
> anyway?

Sure, it's no longer literally a BIOS call with 0xe820 in any registers
but Linux still calls it that.

> > I don't think we want the guest OS just *assuming* that there's usable
> > memory in that e820-reserved region, just because some device says that
> > it's actually capable of DMA to those addresses.
> > 
> > So it would probably want a separate object, like the separate
> > `restricted-dma-pool` in DT, which explicitly identifies that range as
> > a DMA bounce-buffer pool. We probably *can* do that even in ACPI with a
> > PRP0001 device today, using a `restricted-dma-pool` compatible
> > property.
> > 
> > Then the OS would need to spot this range in the config space, and say
> > "oh, I *do* have a swiotlb pool this device can reach", and use that.
> 
> Yes, that's largely how it should work.

The problem in ACPI is matching the device to that SWIOTLB pool. I
think we can expose a `restricted-dma-pool` node via PRP0001 but then
we need to associate a particular device (or set of devices) to that
pool. In DT we do that by referencing it from a `memory-region` node of
the device itself.

The idea above was that the affected devices would state that they are
only capable of DMA to that range, and that's how we'd match them up.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month ago

On Mon, Apr 07, 2025 at 08:54:46AM +0100, David Woodhouse wrote:
> On Mon, 2025-04-07 at 00:30 -0700, Christoph Hellwig wrote:
> > On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote:
> > > We could achieve that by presenting the device with a completely new
> > > PCI device/vendor ID so that old drivers don't match, or in the DT
> > > model you could make a new "compatible" string for it. I chose to use a
> > > VIRTIO_F_ bit for it instead, which seemed natural and allows the
> > > device model (under the influence of the system integrator) to *choose*
> > > whether a failure to negotiate such bit is fatal or not.
> > 
> > Stop thinking about devices.  Your CoCo VM will have that exact same
> > limitation for all devices, because none of them can DMA into random
> > memory.
> 
> Nah, most of them are just fine because they're actual passthrough PCI
> devices behind a proper 2-stage IOMMU.

Except for all virtual devices.

> > > Then the OS would need to spot this range in the config space, and say
> > > "oh, I *do* have a swiotlb pool this device can reach", and use that.
> > 
> > Yes, that's largely how it should work.
> 
> The problem in ACPI is matching the device to that SWIOTLB pool. I
> think we can expose a `restricted-dma-pool` node via PRP0001 but then
> we need to associate a particular device (or set of devices) to that
> pool. In DT we do that by referencing it from a `memory-region` node of
> the device itself.

I don't think you actually _need_ to have an explicity device vs pool
match.  All pools in host memory (assuming there is more than one)
should be usable for all devices bar actual addressing limits that are
handled in the dma layer already.  The only things you need is:

 a) a way to declare one or more pools
 b) a way to destinguish between devices behind a two stage IOMMU vs not
    to figure out if they need to use a pool

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On Mon, 2025-04-07 at 02:05 -0700, Christoph Hellwig wrote:
> On Mon, Apr 07, 2025 at 08:54:46AM +0100, David Woodhouse wrote:
> > On Mon, 2025-04-07 at 00:30 -0700, Christoph Hellwig wrote:
> > > On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote:
> > > > We could achieve that by presenting the device with a completely new
> > > > PCI device/vendor ID so that old drivers don't match, or in the DT
> > > > model you could make a new "compatible" string for it. I chose to use a
> > > > VIRTIO_F_ bit for it instead, which seemed natural and allows the
> > > > device model (under the influence of the system integrator) to *choose*
> > > > whether a failure to negotiate such bit is fatal or not.
> > > 
> > > Stop thinking about devices.  Your CoCo VM will have that exact same
> > > limitation for all devices, because none of them can DMA into random
> > > memory.
> > 
> > Nah, most of them are just fine because they're actual passthrough PCI
> > devices behind a proper 2-stage IOMMU.
> 
> Except for all virtual devices.

Yes, that's what I'm saying.

And that's also why it's reasonable to have a solution which handles
this for virtio devices, without necessarily having to handle it for
*arbitrary* emulated PCI devices across the whole system, and without
having to change core concepts of DMA handling across all operating
systems.

This isn't just about Linux guests, and especially not just about Linux
guests running running 6.16+ kernels.

A solution which can live in a device driver is a *lot* easier to
actually get into the hands of users. Not just Windows users, but even
the slower-moving Linux distros.

> > > > Then the OS would need to spot this range in the config space, and say
> > > > "oh, I *do* have a swiotlb pool this device can reach", and use that.
> > > 
> > > Yes, that's largely how it should work.
> > 
> > The problem in ACPI is matching the device to that SWIOTLB pool. I
> > think we can expose a `restricted-dma-pool` node via PRP0001 but then
> > we need to associate a particular device (or set of devices) to that
> > pool. In DT we do that by referencing it from a `memory-region` node of
> > the device itself.
> 
> I don't think you actually _need_ to have an explicity device vs pool
> match.  All pools in host memory (assuming there is more than one)
> should be usable for all devices bar actual addressing limits that are
> handled in the dma layer already.  The only things you need is:
> 
>  a) a way to declare one or more pools
>  b) a way to destinguish between devices behind a two stage IOMMU vs not
>     to figure out if they need to use a pool

I'm not averse to that, but it's different to the `restricted-dma-pool`
model that's defined today which has explicit matching. So I'd like to
reconcile them — and preferably literally use PRP0001 to expose
`restricted-dma-pool` even under ACPI.

Maybe it's as simple as a flag/property on the `restricted-dma-pool`
node which declares that it's a 'catch-all', and that *all* devices
which aren't explicitly bound to an IOMMU or other DMA operations (e.g.
explicitly bound to a different restricted-dma-pool) should use it?

That's actually what of_dma_configure_id() does today even with the
existing `restricted-dma-pool` which *is* explicitly matched to a
device; the pool still only gets used if of_configure_iommu() doesn't
find a better option.

But I would also be entirely OK with following the existing model and
having the virtio device itself provide a reference to the restricted-
dma-pool. Either by its address, or by some other handle/reference.

Referencing it by address, which is what Michael and I were discussing,
is simple enough. And in this model it *is* a device restriction — that
virtio device knows full well that it can only perform DMA to that
range of addresses.

And it also allows for a standalone device driver to go check for the
corresponding ACPI device, claim that memory and set up the bounce
buffering for *itself*, in operating systems which don't support it.
Although I suppose a standalone device driver can do the same even in
the 'catch-all' model.

Either way, we'd still ideally want a virtio feature bit to say "don't
touch me if you don't understand my DMA restrictions", to prevent older
drivers (on older operating systems) from failing.

(I know you say that's a 'restriction' not a 'feature', but that's just
fine. That's why it's a *negotiation* not simply the device advertising
optional features which the driver may choose to use, or ignore, as it
sees fit.)

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month ago

On Mon, Apr 07, 2025 at 11:09:54AM +0100, David Woodhouse wrote:
> > Except for all virtual devices.
> 
> Yes, that's what I'm saying.
> 
> And that's also why it's reasonable to have a solution which handles
> this for virtio devices, without necessarily having to handle it for
> *arbitrary* emulated PCI devices across the whole system, and without
> having to change core concepts of DMA handling across all operating
> systems.

Except that you might not have a working two stage iommu everywhere
anyway.  So fix it for real to cover all these cases and skip your
layering violations as a nice little benefit.

> This isn't just about Linux guests, and especially not just about Linux
> guests running running 6.16+ kernels.

That shoudn't matter.  No interface should grow stupid hacks ever just
to support legacy guests.  Get the interface right first and then
implement it in all places you need.

> A solution which can live in a device driver is a *lot* easier to
> actually get into the hands of users. Not just Windows users, but even
> the slower-moving Linux distros.

Don't go there, that just gets us stupid crap like the Intel hiding
NVMe behind AHCI hack or PCIe firmware first error handling.

Get the interface right and then work on making it fit later.

> > should be usable for all devices bar actual addressing limits that are
> > handled in the dma layer already.  The only things you need is:
> > 
> >  a) a way to declare one or more pools
> >  b) a way to destinguish between devices behind a two stage IOMMU vs not
> >     to figure out if they need to use a pool
> 
> I'm not averse to that, but it's different to the `restricted-dma-pool`
> model that's defined today which has explicit matching. So I'd like to
> reconcile them — and preferably literally use PRP0001 to expose
> `restricted-dma-pool` even under ACPI.

A per-device flag is probably fine and easier for some things like
pool sizing.  It also would be worse for other things like eventually
implementing percpu pools.

> Maybe it's as simple as a flag/property on the `restricted-dma-pool`
> node which declares that it's a 'catch-all', and that *all* devices
> which aren't explicitly bound to an IOMMU or other DMA operations (e.g.
> explicitly bound to a different restricted-dma-pool) should use it?

Yeah.  Similar what we do with the generic shared-dma-pool.

> Either way, we'd still ideally want a virtio feature bit to say "don't
> touch me if you don't understand my DMA restrictions", to prevent older
> drivers (on older operating systems) from failing.

As said before they really need system level awareness of a coco host.
That's not something to be hacked into virtio drivers.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On Mon, 2025-04-07 at 07:06 -0700, Christoph Hellwig wrote:
> On Mon, Apr 07, 2025 at 11:09:54AM +0100, David Woodhouse wrote:
> > > should be usable for all devices bar actual addressing limits that are
> > > handled in the dma layer already.  The only things you need is:
> > > 
> > >  a) a way to declare one or more pools
> > >  b) a way to destinguish between devices behind a two stage IOMMU vs not
> > >     to figure out if they need to use a pool
> > 
> > I'm not averse to that, but it's different to the `restricted-dma-pool`
> > model that's defined today which has explicit matching. So I'd like to
> > reconcile them — and preferably literally use PRP0001 to expose
> > `restricted-dma-pool` even under ACPI.
> 
> A per-device flag is probably fine and easier for some things like
> pool sizing.  It also would be worse for other things like eventually
> implementing percpu pools.

What exactly are you thinking of when you say a 'per-device' flag?

> > Maybe it's as simple as a flag/property on the `restricted-dma-pool`
> > node which declares that it's a 'catch-all', and that *all* devices
> > which aren't explicitly bound to an IOMMU or other DMA operations (e.g.
> > explicitly bound to a different restricted-dma-pool) should use it?
> 
> Yeah.  Similar what we do with the generic shared-dma-pool.

Yeah, I was trying to work that one out, and it looks like I'm almost
describing the 'linux,dma-default' property. Except that seems to be
only for coherent allocations on the `shared-dma-pool`.

Maybe it makes sense for 'linux,dma-default' on a restricted-dma-pool
to mean that it should be used as a swiotlb for *all* DMA?

I don't much like the "linux," prefix but can live with that if that's
the precedent.

> > Either way, we'd still ideally want a virtio feature bit to say "don't
> > touch me if you don't understand my DMA restrictions", to prevent older
> > drivers (on older operating systems) from failing.
> 
> As said before they really need system level awareness of a coco host.
> That's not something to be hacked into virtio drivers.

Nah. As long as they use the IOMMU they're offered, guests don't need
any awareness at all that they're running in an environment where the
VMM can't access all their memory. There is precisely *zero* enablement
required in the guest for that.

The only problem is virtual devices which are provided by that VMM.

If the guests *do* have system level awareness of the particular
variant of CoCo host they're running on, then sure, they can explicitly
manage sharing/encryption/whatever with a system-specific set of DMA
operations and the problem goes away. But that's a long way off.

How does this sound as a plan...

 1. Define 'linux,dma-default' on a 'restricted-dma-pool' object to
    mean that it should be used as a fallback SWIOTLB for *all* devices
    (basically, as if each device explicitly referenced it as they do
    today).

 2. Provide a way to represent `restricted-dma-pool` in ACPI (which
    ought to work naturally as PRP0001 and it's mostly a matter of
    untangling it from OF-specific code on the guest side).

 3. Add a VIRTIO_F_RESTRICTED_DMA feature to virtio, so that devices
    which *know* they have such a limitation can gracefully decline
    to operate with a driver which doesn't support it.

 4. Add a way for a virtio-pci device to explicitly reference a given
    `restricted-dma-pool` like virtio-mmio devices can today. And it
    probably makes sense for a virtio-pci device which requires the
    VIRTIO_F_RESTRICTED_DMA feature to explicitly reference such a
    pool even if that pool has the `linux,dma-default' property?

With #1 and #2 we provide this facility even for passthrough PCI
devices. And with #3 and #4 we make it work *better* for virtio, which
is the more important use case (since those two-stage IOMMUs *do* exist
in the cases of CoCo-supporting hardware; it's just that they don't
enable *virtual* devices).

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month ago

On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote:
> On Fri, 2025-04-04 at 06:37 -0400, Michael S. Tsirkin wrote:
> > On Fri, Apr 04, 2025 at 11:15:33AM +0100, David Woodhouse wrote:
> > > On Fri, 2025-04-04 at 10:27 +0100, David Woodhouse wrote:
> > > > On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@redhat.com>
> > > > wrote:
> > > > > On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote:
> > > > > > On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote:
> > > > > > > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse
> > > > > > > wrote:
> > > > > > > > What's annoying is that this should work out of the box
> > > > > > > > *already* with
> > > > > > > > virtio-mmio and a `restricted-dma-pool` — for systems which
> > > > > > > > aren't
> > > > > > > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms.
> > > > > > > 
> > > > > > > 
> > > > > > > That specifically would be just a driver bugfix then?
> > > > > > 
> > > > > > I actually think it works out of the box and there isn't even a
> > > > > > bug to
> > > > > > fix. Haven't tested yet.
> > > > > > 
> > > > > > The sad part is that the system does it all automatically *if* it
> > > > > > has
> > > > > > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even
> > > > > > notices that the dma_ops it's using are the swiotlb ops using the
> > > > > > provided buffer.
> > > > > > 
> > > > > > Which is *kind* of nice... except that when on a guest OS which
> > > > > > *isn't*
> > > > > > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore
> > > > > > the
> > > > > > `restricted-dma-pool` node and try DMA to system memory anyway,
> > > > > > which
> > > > > > will fail.
> > > > > 
> > > > > I mean, it's easy to misconfigure Linux, this is why we love it ;)
> > > > > Why
> > > > > is this such a concern?
> > > > 
> > > > Because it's incompatible. In the DT world, perhaps this new *non-
> > > > optional* feature/restriction should have come with a new
> > > > "compatible" string such as "virtio-mmio-restricted-dma".
> > > > 
> > > > Adding it without backwards compatibility wasn't ideal.
> > > > 
> > > > > > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB
> > > > > > feature, so
> > > > > > that the device side can refuse, if the guest *isn't* agreeing to
> > > > > > use
> > > > > > the bounce buffer in the situations where it must do so.
> > > > > 
> > > > > 
> > > > > OTOH then setting this feature and if you make the device force it,
> > > > > you are breaking guests restricted-dma-pool which worked
> > > > > previously, no?
> > > > 
> > > > Yes. So a platform offering virtio-mmio with restricted DMA, if the
> > > > driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to
> > > > accept that negotiation anyway, and *hope* that the driver/OS are
> > > > going to use the buffer anyway.
> > > > 
> > > > I just didn't want to make that same mistake again when formalising
> > > > and documenting this, and especially when attempting to extend it to
> > > > PCI.
> > > 
> > > Of course, the beauty of the restricted-dma-pool as supported by DT is
> > > that it's a *system* memory buffer, which is actually OK as long as
> > > it's reserved address space and not just part of normal system memory
> > > that an unsuspecting guest might use for general purposes. So the
> > > trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access
> > > to that space.
> > > 
> > > It doesn't *have* to be on-device. That just seemed like the more
> > > natural way to do it for PCI.
> > > 
> > > I suppose we *could* allow for the virtio-pci transport to do it the
> > > same way as virtio-mmio though. The VIRTIO_PCI_CAP_SWIOTLB capability¹
> > > could reference a range of system memory space, just like the
> > > `restricted-dma-pool` property does.
> > > 
> > > It's a weird abstraction especially for a physical PCI device to do
> > > that because the system memory space is outside its ownership. But in a
> > > physical device it could be writable, and you could consider it the
> > > responsibility of the system firmware to configure it appropriately, in
> > > accordance with the IOMMU and other DMA restrictions of the platform.
> > > 
> > > That does solve it for the CoCo case without addressing the P2P staging
> > > case that Christoph mentions, though.
> > > 
> > > 
> > > ¹ I will rename it, Christoph, if it survives at all. Probably
> > > VIRTIO_F_RESTRICTED_DMA and VIRTIO_PCI_CAP_RESTRICTED_DMA but of course
> > > it depends on the semantics we conclude it should have.
> > 
> > OK. So basically, all this does, is a promise by driver to only
> > DMA into a range of memory?
> 
> Basically, yes.
> 
> > This part, I get. I wouldn't put it in a capability, just in config
> > space then.
> 
> Sure... but how? There are some things which are defined to be at fixed
> locations in config space, like the vendor/device IDs, COMMAND, STATUS,
> BARs, etc..
> 
> And for the rest of the optional things which might be in config space
> of a given device... isn't that exactly what capabilities are for?


Sorry I am unclear. Not the pci config space. The virtio config space.
After admin_queue_num ?

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On 6 April 2025 19:28:08 BST, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>On Fri, Apr 04, 2025 at 12:15:52PM +0100, David Woodhouse wrote:
>> On Fri, 2025-04-04 at 06:37 -0400, Michael S. Tsirkin wrote:
>> > On Fri, Apr 04, 2025 at 11:15:33AM +0100, David Woodhouse wrote:
>> > > On Fri, 2025-04-04 at 10:27 +0100, David Woodhouse wrote:
>> > > > On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@redhat.com>
>> > > > wrote:
>> > > > > On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote:
>> > > > > > On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote:
>> > > > > > > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse
>> > > > > > > wrote:
>> > > > > > > > What's annoying is that this should work out of the box
>> > > > > > > > *already* with
>> > > > > > > > virtio-mmio and a `restricted-dma-pool` — for systems which
>> > > > > > > > aren't
>> > > > > > > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms.
>> > > > > > > 
>> > > > > > > 
>> > > > > > > That specifically would be just a driver bugfix then?
>> > > > > > 
>> > > > > > I actually think it works out of the box and there isn't even a
>> > > > > > bug to
>> > > > > > fix. Haven't tested yet.
>> > > > > > 
>> > > > > > The sad part is that the system does it all automatically *if* it
>> > > > > > has
>> > > > > > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even
>> > > > > > notices that the dma_ops it's using are the swiotlb ops using the
>> > > > > > provided buffer.
>> > > > > > 
>> > > > > > Which is *kind* of nice... except that when on a guest OS which
>> > > > > > *isn't*
>> > > > > > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore
>> > > > > > the
>> > > > > > `restricted-dma-pool` node and try DMA to system memory anyway,
>> > > > > > which
>> > > > > > will fail.
>> > > > > 
>> > > > > I mean, it's easy to misconfigure Linux, this is why we love it ;)
>> > > > > Why
>> > > > > is this such a concern?
>> > > > 
>> > > > Because it's incompatible. In the DT world, perhaps this new *non-
>> > > > optional* feature/restriction should have come with a new
>> > > > "compatible" string such as "virtio-mmio-restricted-dma".
>> > > > 
>> > > > Adding it without backwards compatibility wasn't ideal.
>> > > > 
>> > > > > > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB
>> > > > > > feature, so
>> > > > > > that the device side can refuse, if the guest *isn't* agreeing to
>> > > > > > use
>> > > > > > the bounce buffer in the situations where it must do so.
>> > > > > 
>> > > > > 
>> > > > > OTOH then setting this feature and if you make the device force it,
>> > > > > you are breaking guests restricted-dma-pool which worked
>> > > > > previously, no?
>> > > > 
>> > > > Yes. So a platform offering virtio-mmio with restricted DMA, if the
>> > > > driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to
>> > > > accept that negotiation anyway, and *hope* that the driver/OS are
>> > > > going to use the buffer anyway.
>> > > > 
>> > > > I just didn't want to make that same mistake again when formalising
>> > > > and documenting this, and especially when attempting to extend it to
>> > > > PCI.
>> > > 
>> > > Of course, the beauty of the restricted-dma-pool as supported by DT is
>> > > that it's a *system* memory buffer, which is actually OK as long as
>> > > it's reserved address space and not just part of normal system memory
>> > > that an unsuspecting guest might use for general purposes. So the
>> > > trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access
>> > > to that space.
>> > > 
>> > > It doesn't *have* to be on-device. That just seemed like the more
>> > > natural way to do it for PCI.
>> > > 
>> > > I suppose we *could* allow for the virtio-pci transport to do it the
>> > > same way as virtio-mmio though. The VIRTIO_PCI_CAP_SWIOTLB capability¹
>> > > could reference a range of system memory space, just like the
>> > > `restricted-dma-pool` property does.
>> > > 
>> > > It's a weird abstraction especially for a physical PCI device to do
>> > > that because the system memory space is outside its ownership. But in a
>> > > physical device it could be writable, and you could consider it the
>> > > responsibility of the system firmware to configure it appropriately, in
>> > > accordance with the IOMMU and other DMA restrictions of the platform.
>> > > 
>> > > That does solve it for the CoCo case without addressing the P2P staging
>> > > case that Christoph mentions, though.
>> > > 
>> > > 
>> > > ¹ I will rename it, Christoph, if it survives at all. Probably
>> > > VIRTIO_F_RESTRICTED_DMA and VIRTIO_PCI_CAP_RESTRICTED_DMA but of course
>> > > it depends on the semantics we conclude it should have.
>> > 
>> > OK. So basically, all this does, is a promise by driver to only
>> > DMA into a range of memory?
>> 
>> Basically, yes.
>> 
>> > This part, I get. I wouldn't put it in a capability, just in config
>> > space then.
>> 
>> Sure... but how? There are some things which are defined to be at fixed
>> locations in config space, like the vendor/device IDs, COMMAND, STATUS,
>> BARs, etc..
>> 
>> And for the rest of the optional things which might be in config space
>> of a given device... isn't that exactly what capabilities are for?
>
>
>Sorry I am unclear. Not the pci config space. The virtio config space.
>After admin_queue_num ?
>
>

Makes sense. Let me knock up a second RFC based on that, and test using PRP0001 as a vehicle for `restricted-dma-pool` in an ACPI-afflicted guest.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Zhu Lingshan 1 month, 1 week ago

On 4/2/2025 7:04 PM, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> Device-tree bindings for `restricted-dma-pool` were defined in 2021, which
> allow devices to be restricted to a given SWIOTLB pool instead of allowing
> DMA to arbitrary system memory:
> https://lore.kernel.org/all/20210624155526.2775863-1-tientzu@chromium.org/
>
> This facility was not specific to virtio-mmio, but does apply to it. No
> attempt was made to ensure backward-compatibility for virtio-mmio devices.
>
> Define a VIRTIO_F_SWIOTLB feature which allows the device and driver to
> agree on the use of the SWIOTLB, if present. This enables the device to
> refuse to operate further if the driver does not support the SWIOTLB
> requirement expressed in the device-tree.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  content.tex | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/content.tex b/content.tex
> index c17ffa6..63d075f 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>  Currently these device-independent feature bits are defined:
>  
>  \begin{description}
> +  \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device
> +  transport provides a memory region which is to be used for bounce
> +  buffering, rather than permitting direct memory access to system memory.
Hello David

IMHO, if we need a bounce buffer, why not place it on the host memory?
Because if the  bounce buffer resides in the device memory, it requires CPU traverse the pci bus,
that can be 10 times slower than accessing host memory.
This performance overhead can be worse when transmitting small packets,
and stress the CPU cores, even lead to IO starvation when CPU over-commit.
The device side memory is a host thing to the guest anyway.

Thanks
Zhu Lingshan
>    \item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates
>    that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT
>    flag set, as described in \ref{sec:Basic Facilities of a Virtio
> @@ -807,6 +810,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>    the driver. When clear, this overrides any platform-specific description of
>    whether device access is limited or translated in any way, e.g.
>    whether an IOMMU may be present.
> +  If a the device transport provides a software IOTLB bounce buffer,
> +  addresses within its range are not subject to the requirements of
> +  VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''.
>    \item[VIRTIO_F_RING_PACKED(34)] This feature indicates
>    support for the packed virtqueue layout as described in
>    \ref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}~\nameref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}.
> @@ -885,6 +891,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>  VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical
>  addresses to the device.
>  
> +A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST
> +then pass only addresses within the Software IOTLB bounce buffer to the
> +device.
> +
>  A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered.
>  
>  A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered.
> @@ -921,6 +931,8 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>  A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not
>  accepted.
>  
> +A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted.
> +
>  If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use
>  buffers in the same order in which they have been available.
>

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month, 1 week ago

On Thu, Apr 03, 2025 at 03:13:30PM +0800, Zhu Lingshan wrote:
> On 4/2/2025 7:04 PM, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> >
> > Device-tree bindings for `restricted-dma-pool` were defined in 2021, which
> > allow devices to be restricted to a given SWIOTLB pool instead of allowing
> > DMA to arbitrary system memory:
> > https://lore.kernel.org/all/20210624155526.2775863-1-tientzu@chromium.org/
> >
> > This facility was not specific to virtio-mmio, but does apply to it. No
> > attempt was made to ensure backward-compatibility for virtio-mmio devices.
> >
> > Define a VIRTIO_F_SWIOTLB feature which allows the device and driver to
> > agree on the use of the SWIOTLB, if present. This enables the device to
> > refuse to operate further if the driver does not support the SWIOTLB
> > requirement expressed in the device-tree.
> >
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> >  content.tex | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/content.tex b/content.tex
> > index c17ffa6..63d075f 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> >  Currently these device-independent feature bits are defined:
> >  
> >  \begin{description}
> > +  \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device
> > +  transport provides a memory region which is to be used for bounce
> > +  buffering, rather than permitting direct memory access to system memory.
> Hello David
> 
> IMHO, if we need a bounce buffer, why not place it on the host memory?
> Because if the  bounce buffer resides in the device memory, it requires CPU traverse the pci bus,
> that can be 10 times slower than accessing host memory.
> This performance overhead can be worse when transmitting small packets,
> and stress the CPU cores, even lead to IO starvation when CPU over-commit.
> The device side memory is a host thing to the guest anyway.
> 
> Thanks
> Zhu Lingshan


Indeed I personally do not exactly get why implement a virtual system
without an IOMMU when virtio-iommu is available.

I have a feeling it's about lack of windows drivers for virtio-iommu
at this point.


> >    \item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates
> >    that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT
> >    flag set, as described in \ref{sec:Basic Facilities of a Virtio
> > @@ -807,6 +810,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> >    the driver. When clear, this overrides any platform-specific description of
> >    whether device access is limited or translated in any way, e.g.
> >    whether an IOMMU may be present.
> > +  If a the device transport provides a software IOTLB bounce buffer,
> > +  addresses within its range are not subject to the requirements of
> > +  VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''.
> >    \item[VIRTIO_F_RING_PACKED(34)] This feature indicates
> >    support for the packed virtqueue layout as described in
> >    \ref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}~\nameref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}.
> > @@ -885,6 +891,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> >  VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical
> >  addresses to the device.
> >  
> > +A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST
> > +then pass only addresses within the Software IOTLB bounce buffer to the
> > +device.
> > +
> >  A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered.
> >  
> >  A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered.
> > @@ -921,6 +931,8 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> >  A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not
> >  accepted.
> >  
> > +A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted.
> > +
> >  If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use
> >  buffers in the same order in which they have been available.
> >

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote:
> 
> Indeed I personally do not exactly get why implement a virtual system
> without an IOMMU when virtio-iommu is available.
> 
> I have a feeling it's about lack of windows drivers for virtio-iommu
> at this point.

And a pKVM (etc.) implementation of virtio-iommu which would allow the
*trusted* part of the hypervisor to know which guest memory should be
shared with the VMM implementing the virtio device models?

You'd also end up in a situation where you have a virtio-iommu for some
devices, and a real two-stage IOMMU (e.g. SMMU or AMD's vIOMMU) for
other devices. Are guest operating systems going to cope well with
that? Do the available discovery mechanisms for all the relevant IOMMUs
even *allow* for that to be expressed?

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month, 1 week ago

On Thu, Apr 03, 2025 at 08:54:45AM +0100, David Woodhouse wrote:
> On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote:
> > 
> > Indeed I personally do not exactly get why implement a virtual system
> > without an IOMMU when virtio-iommu is available.
> > 
> > I have a feeling it's about lack of windows drivers for virtio-iommu
> > at this point.
> 
> And a pKVM (etc.) implementation of virtio-iommu which would allow the
> *trusted* part of the hypervisor to know which guest memory should be
> shared with the VMM implementing the virtio device models?

Is there a blocker here?

> You'd also end up in a situation where you have a virtio-iommu for some
> devices, and a real two-stage IOMMU (e.g. SMMU or AMD's vIOMMU) for
> other devices. Are guest operating systems going to cope well with
> that?

They should. In particular because systems with multiple IOMMUs already
exist.

> Do the available discovery mechanisms for all the relevant IOMMUs
> even *allow* for that to be expressed?

I think yes. But, it's been a while since I played with this, let me
check what works, what does not, and get back to you on this.

-- 
MST

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Thu, 2025-04-03 at 04:13 -0400, Michael S. Tsirkin wrote:
> On Thu, Apr 03, 2025 at 08:54:45AM +0100, David Woodhouse wrote:
> > On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote:
> > > 
> > > Indeed I personally do not exactly get why implement a virtual system
> > > without an IOMMU when virtio-iommu is available.
> > > 
> > > I have a feeling it's about lack of windows drivers for virtio-iommu
> > > at this point.
> > 
> > And a pKVM (etc.) implementation of virtio-iommu which would allow the
> > *trusted* part of the hypervisor to know which guest memory should be
> > shared with the VMM implementing the virtio device models?
> 
> Is there a blocker here?

Only the amount of complexity in what should be a minimal Trusted
Compute Base. (And ideally subject to formal methods of proving its
correctness too.)

And frankly, if we were going to accept a virtio-iommu in the TCB why
not just implement enough virtqueue knowledge to build something where
the trusted part just snoops on the *actual* e.g. virtio-net device to
know which buffers the VMM was *invited* to access, and facilitate
that?

We looked at doing that. It's awful.

> > You'd also end up in a situation where you have a virtio-iommu for some
> > devices, and a real two-stage IOMMU (e.g. SMMU or AMD's vIOMMU) for
> > other devices. Are guest operating systems going to cope well with
> > that?
> 
> They should. In particular because systems with multiple IOMMUs already
> exist.
> 
> > Do the available discovery mechanisms for all the relevant IOMMUs
> > even *allow* for that to be expressed?
> 
> I think yes. But, it's been a while since I played with this, let me
> check what works, what does not, and get back to you on this.

Even if it could work in theory, I'll be astonished if it actually
works in practice across a wide set of operating systems, and if it
*ever* works for Windows.

Compared with the simple option of presenting a device which
conceptually doesn't even *do* DMA, which is confined to its own
modular device driver...

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month, 1 week ago

On Thu, Apr 03, 2025 at 09:22:57AM +0100, David Woodhouse wrote:
> On Thu, 2025-04-03 at 04:13 -0400, Michael S. Tsirkin wrote:
> > On Thu, Apr 03, 2025 at 08:54:45AM +0100, David Woodhouse wrote:
> > > On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote:
> > > > 
> > > > Indeed I personally do not exactly get why implement a virtual system
> > > > without an IOMMU when virtio-iommu is available.
> > > > 
> > > > I have a feeling it's about lack of windows drivers for virtio-iommu
> > > > at this point.
> > > 
> > > And a pKVM (etc.) implementation of virtio-iommu which would allow the
> > > *trusted* part of the hypervisor to know which guest memory should be
> > > shared with the VMM implementing the virtio device models?
> > 
> > Is there a blocker here?
> 
> Only the amount of complexity in what should be a minimal Trusted
> Compute Base. (And ideally subject to formal methods of proving its
> correctness too.)

Shrug. Does not have to be complex. Could be a "simple mode" for
virtio-iommu where it just accepts one buffer. No?

> And frankly, if we were going to accept a virtio-iommu in the TCB why
> not just implement enough virtqueue knowledge to build something where
> the trusted part just snoops on the *actual* e.g. virtio-net device to
> know which buffers the VMM was *invited* to access, and facilitate
> that?

Because it's awful? Buffers are a datapath thing. Stay away from there.

> We looked at doing that. It's awful.

Indeed.

> > > You'd also end up in a situation where you have a virtio-iommu for some
> > > devices, and a real two-stage IOMMU (e.g. SMMU or AMD's vIOMMU) for
> > > other devices. Are guest operating systems going to cope well with
> > > that?
> > 
> > They should. In particular because systems with multiple IOMMUs already
> > exist.
> > 
> > > Do the available discovery mechanisms for all the relevant IOMMUs
> > > even *allow* for that to be expressed?
> > 
> > I think yes. But, it's been a while since I played with this, let me
> > check what works, what does not, and get back to you on this.
> 
> Even if it could work in theory, I'll be astonished if it actually
> works in practice across a wide set of operating systems, and if it
> *ever* works for Windows.

Well it used to work. I won't have time to play with it until sometime
next week, if it's relevant.  If I poke at my windows system, I see 

> Compared with the simple option of presenting a device which
> conceptually doesn't even *do* DMA, which is confined to its own
> modular device driver... 

I'm not (yet) nacking this hack, though I already heartily dislike the
fact that it is mostly a PV-only thing since it can not be offloaded to
a real device efficiently *and* requires copies to move data
between devices. But, let's see if more issues surface.


-- 
MST

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Zhu Lingshan 1 month, 1 week ago

On 4/3/2025 4:22 PM, David Woodhouse wrote:
> On Thu, 2025-04-03 at 04:13 -0400, Michael S. Tsirkin wrote:
>> On Thu, Apr 03, 2025 at 08:54:45AM +0100, David Woodhouse wrote:
>>> On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote:
>>>> Indeed I personally do not exactly get why implement a virtual system
>>>> without an IOMMU when virtio-iommu is available.
>>>>
>>>> I have a feeling it's about lack of windows drivers for virtio-iommu
>>>> at this point.
>>> And a pKVM (etc.) implementation of virtio-iommu which would allow the
>>> *trusted* part of the hypervisor to know which guest memory should be
>>> shared with the VMM implementing the virtio device models?
>> Is there a blocker here?
> Only the amount of complexity in what should be a minimal Trusted
> Compute Base. (And ideally subject to formal methods of proving its
> correctness too.)
>
> And frankly, if we were going to accept a virtio-iommu in the TCB why
> not just implement enough virtqueue knowledge to build something where
> the trusted part just snoops on the *actual* e.g. virtio-net device to
> know which buffers the VMM was *invited* to access, and facilitate
> that?
you trust CPU  and its IOMMU, and the virtio-iommu is provided by the hypervisor,
emulated by the CPU. If you don't trust virtio-iommu, then you should not trust
the bounce buffer, because it is unencrypted, more like a security leak.

Actually everything is suspicious even the CPU, but you have to trust a TCB and
try to minimize the TCB. I remember there is an attestation mechanism to help
examine the infrastructure.  We need a balance and a tradeoff.

Thanks
Zhu Lingshan
>
> We looked at doing that. It's awful.
>
>>> You'd also end up in a situation where you have a virtio-iommu for some
>>> devices, and a real two-stage IOMMU (e.g. SMMU or AMD's vIOMMU) for
>>> other devices. Are guest operating systems going to cope well with
>>> that?
>> They should. In particular because systems with multiple IOMMUs already
>> exist.
>>
>>> Do the available discovery mechanisms for all the relevant IOMMUs
>>> even *allow* for that to be expressed?
>> I think yes. But, it's been a while since I played with this, let me
>> check what works, what does not, and get back to you on this.
> Even if it could work in theory, I'll be astonished if it actually
> works in practice across a wide set of operating systems, and if it
> *ever* works for Windows.
>
> Compared with the simple option of presenting a device which
> conceptually doesn't even *do* DMA, which is confined to its own
> modular device driver...

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Thu, 2025-04-03 at 16:34 +0800, Zhu Lingshan wrote:
> On 4/3/2025 4:22 PM, David Woodhouse wrote:
> > On Thu, 2025-04-03 at 04:13 -0400, Michael S. Tsirkin wrote:
> > > On Thu, Apr 03, 2025 at 08:54:45AM +0100, David Woodhouse wrote:
> > > > On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote:
> > > > > Indeed I personally do not exactly get why implement a virtual system
> > > > > without an IOMMU when virtio-iommu is available.
> > > > > 
> > > > > I have a feeling it's about lack of windows drivers for virtio-iommu
> > > > > at this point.
> > > > And a pKVM (etc.) implementation of virtio-iommu which would allow the
> > > > *trusted* part of the hypervisor to know which guest memory should be
> > > > shared with the VMM implementing the virtio device models?
> > > Is there a blocker here?
> > Only the amount of complexity in what should be a minimal Trusted
> > Compute Base. (And ideally subject to formal methods of proving its
> > correctness too.)
> > 
> > And frankly, if we were going to accept a virtio-iommu in the TCB why
> > not just implement enough virtqueue knowledge to build something where
> > the trusted part just snoops on the *actual* e.g. virtio-net device to
> > know which buffers the VMM was *invited* to access, and facilitate
> > that?
> you trust CPU  and its IOMMU, and the virtio-iommu is provided by the hypervisor,
> emulated by the CPU. If you don't trust virtio-iommu, then you should not trust
> the bounce buffer, because it is unencrypted, more like a security leak.
> 
> Actually everything is suspicious even the CPU, but you have to trust a TCB and
> try to minimize the TCB. I remember there is an attestation mechanism to help
> examine the infrastructure.  We need a balance and a tradeoff.

In the pKVM model, we have a minimal trusted part of the hypervisor,
which some are calling a "lowvisor", which enforces the property that
even the rest of Linux/KVM and the VMM are not able to access arbitrary
guest memory.

For true PCI passthrough devices, hardware has a two-stage IOMMU which
allows the guest to control which parts of its memory are accessible by
the devices.

The problem is those device models which are emulated in the VMM,
because the VMM no longer has blanket access to the guest's memory.

The simplest answer is just for the device models presented by the VMM
to *not* do DMA access to guest system memory. Stick a bounce-buffer on
the device itself, and do I/O through that.

Yes, as you say, we have to trust the CPU and its IOMMU. And the
microcode and low-level firmware, etc.

But we do *not* trust most of Linux/KVM and the VMM. We only trust up
to the pKVM "lowvisor" level. So if there is going to be a virtio-
iommu, that's where it would have to be implemented. Otherwise, the VMM
is just saying to the lowvisor, "please grant me access to <this> guest
page because...erm... I said so". Which is not quite how the trust
model is supposed to work.

As noted in the original cover letter to this series, there are other
options too. We could implement enlightenments in the guest to
share/unshare pages, and corresponding dma_ops which simply invoke
those hypercalls. That's a bit less complexity in the TCB, but does end
up with a set of guest enlightenments which need to be supported in the
core of every operating system.

Compared with the option of a simple device driver for a device which
conceptually doesn't do DMA at all.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Zhu Lingshan 1 month ago

On 4/3/2025 4:57 PM, David Woodhouse wrote:
> On Thu, 2025-04-03 at 16:34 +0800, Zhu Lingshan wrote:
>> On 4/3/2025 4:22 PM, David Woodhouse wrote:
>>> On Thu, 2025-04-03 at 04:13 -0400, Michael S. Tsirkin wrote:
>>>> On Thu, Apr 03, 2025 at 08:54:45AM +0100, David Woodhouse wrote:
>>>>> On Thu, 2025-04-03 at 03:34 -0400, Michael S. Tsirkin wrote:
>>>>>> Indeed I personally do not exactly get why implement a virtual system
>>>>>> without an IOMMU when virtio-iommu is available.
>>>>>>
>>>>>> I have a feeling it's about lack of windows drivers for virtio-iommu
>>>>>> at this point.
>>>>> And a pKVM (etc.) implementation of virtio-iommu which would allow the
>>>>> *trusted* part of the hypervisor to know which guest memory should be
>>>>> shared with the VMM implementing the virtio device models?
>>>> Is there a blocker here?
>>> Only the amount of complexity in what should be a minimal Trusted
>>> Compute Base. (And ideally subject to formal methods of proving its
>>> correctness too.)
>>>
>>> And frankly, if we were going to accept a virtio-iommu in the TCB why
>>> not just implement enough virtqueue knowledge to build something where
>>> the trusted part just snoops on the *actual* e.g. virtio-net device to
>>> know which buffers the VMM was *invited* to access, and facilitate
>>> that?
>> you trust CPU  and its IOMMU, and the virtio-iommu is provided by the hypervisor,
>> emulated by the CPU. If you don't trust virtio-iommu, then you should not trust
>> the bounce buffer, because it is unencrypted, more like a security leak.
>>
>> Actually everything is suspicious even the CPU, but you have to trust a TCB and
>> try to minimize the TCB. I remember there is an attestation mechanism to help
>> examine the infrastructure.  We need a balance and a tradeoff.
> In the pKVM model, we have a minimal trusted part of the hypervisor,
> which some are calling a "lowvisor", which enforces the property that
> even the rest of Linux/KVM and the VMM are not able to access arbitrary
> guest memory.
>
> For true PCI passthrough devices, hardware has a two-stage IOMMU which
> allows the guest to control which parts of its memory are accessible by
> the devices.
>
> The problem is those device models which are emulated in the VMM,
> because the VMM no longer has blanket access to the guest's memory.
>
> The simplest answer is just for the device models presented by the VMM
> to *not* do DMA access to guest system memory. Stick a bounce-buffer on
> the device itself, and do I/O through that.
>
> Yes, as you say, we have to trust the CPU and its IOMMU. And the
> microcode and low-level firmware, etc.
>
> But we do *not* trust most of Linux/KVM and the VMM. We only trust up
> to the pKVM "lowvisor" level. So if there is going to be a virtio-
> iommu, that's where it would have to be implemented. Otherwise, the VMM
> is just saying to the lowvisor, "please grant me access to <this> guest
> page because...erm... I said so". Which is not quite how the trust
> model is supposed to work.
>
> As noted in the original cover letter to this series, there are other
> options too. We could implement enlightenments in the guest to
> share/unshare pages, and corresponding dma_ops which simply invoke
> those hypercalls. That's a bit less complexity in the TCB, but does end
> up with a set of guest enlightenments which need to be supported in the
> core of every operating system.
>
> Compared with the option of a simple device driver for a device which
> conceptually doesn't do DMA at all.
Hello David,

I see you want to limit the device DMA range for security reasons,
a possible solution is not to do DMA at all, and provide a bounce
buffer from the device side. However, such a bounce buffer resides on the
device is unencrypted, means itself is insecure.

I have a rough proposal, first here are some assumptions:
1) we trust the guest because it belongs to the tenant, and we can not change any guest drivers, only the tenants upgrade guest kernel.
2) we only trust a minimal TCB, even don't trust most of the hypervisor code.
3) an emulated device is a part of the hypervisor.
4) even when the host / hypervisor compromised, we still want to secure the guest data.
5) virtio devices and virtio driver exchange data through the virtqueues, means virtio devices initiate DMA against queue buffers.

So here is a draft design:
1) queue buffer encryption, per queue or per device.
2) this feature is controlled by a feature bit.
3) driver initializes(writing to a device side specific registers / fields ) an encryption algorithm and keys before DRIVER_OK, can be null, means unencrypted.
The device side registers are WRITE ONLY, means can be overwritten in the worst case, but can not be hacked.
4) Write side(the producer, an example is the driver fills data in available queue buffers) encrypts, and the read side(the consumer) deciphers.
5) provide multiple encryption algorithm for different type of workloads and resource expenditure.

The pros:
1) a natural design, proven to work for many years.
2) even when host / hypervisor compromised, the DMA buffers are still encrypted and secure.
3) don't need to change any virtio-iommu implementation
4) compatible and actually orthogonal to CoCo, no conflictions.
5) this solution works for both emulated and real pass-through devices.
6）unlike TDX-IO, this is an implementation in virtio devices, so this design does not conflict with any existing use cases, supports live migration.

The cons:
1) need to change driver, but new features always need new implementations in the driver code.
2) consume more CPU cycles, but bounce buffers also need to cost more CPU resource, and CoCo attestation costs more CPU cycles.

So, does this low hanging fruit sounds good to you? This is just my brainstorming and a rough proposal.
Please let me know your thoughts.

Thanks
Zhu Lingshan

>
>

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Thu, 2025-04-03 at 15:13 +0800, Zhu Lingshan wrote:
> 
> Hello David
> 
> IMHO, if we need a bounce buffer, why not place it on the host memory?

As explained in the cover letter, this is designed for the case where
the device *cannot* easily access host memory.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Zhu Lingshan 1 month, 1 week ago

On 4/3/2025 3:24 PM, David Woodhouse wrote:
> On Thu, 2025-04-03 at 15:13 +0800, Zhu Lingshan wrote:
>> Hello David
>>
>> IMHO, if we need a bounce buffer, why not place it on the host memory?
> As explained in the cover letter, this is designed for the case where
> the device *cannot* easily access host memory.

Do you mean CoCo encrypted memory? If so, what is the difference between
the host memory side bounce buffer and a device memory bounce buffer
to a guest except the transport(DDR or PCI)? The device should be
able to access host memory side bounce buffer.

Thanks
Zhu Lingshan

>

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On Thu, 2025-04-03 at 15:31 +0800, Zhu Lingshan wrote:
> On 4/3/2025 3:24 PM, David Woodhouse wrote:
> > On Thu, 2025-04-03 at 15:13 +0800, Zhu Lingshan wrote:
> > > Hello David
> > > 
> > > IMHO, if we need a bounce buffer, why not place it on the host memory?
> > As explained in the cover letter, this is designed for the case where
> > the device *cannot* easily access host memory.
> 
> Do you mean CoCo encrypted memory? If so, what is the difference between
> the host memory side bounce buffer and a device memory bounce buffer
> to a guest except the transport(DDR or PCI)? The device should be
> able to access host memory side bounce buffer.

In the CoCo case, yes. As long as it's *known* to be memory which is
shared with the hypervisor, and unsuspecting guests won't accidentally
use it as general purpose RAM, then pKVM (or whatever) can happily
allow Linux and the VMM to access it.

That's basically how it works for virtio-mmio, as I just mentioned in a
different subthread. The region referenced by `restricted-dma-pool` is
in the guest memory address space, and the "platform" is configured to
allow DMA *only* there.

Doing the same on the PCI transport (referencing an address range in
system address space rather than on the device itself) is slightly
unnatural, but as I said if we consider it the responsibility of the
firmware to *program* that then we could maybe squint and accept it
even in the case of physical devices. I'm not sure I like it much
though.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month, 1 week ago

On Wed, Apr 02, 2025 at 12:04:45PM +0100, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Device-tree bindings for `restricted-dma-pool` were defined in 2021, which
> allow devices to be restricted to a given SWIOTLB pool instead of allowing
> DMA to arbitrary system memory:
> https://lore.kernel.org/all/20210624155526.2775863-1-tientzu@chromium.org/
> 
> This facility was not specific to virtio-mmio, but does apply to it. No
> attempt was made to ensure backward-compatibility for virtio-mmio devices.
> 
> Define a VIRTIO_F_SWIOTLB feature which allows the device and driver to
> agree on the use of the SWIOTLB, if present. This enables the device to
> refuse to operate further if the driver does not support the SWIOTLB
> requirement expressed in the device-tree.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  content.tex | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index c17ffa6..63d075f 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -773,6 +773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>  Currently these device-independent feature bits are defined:
>  
>  \begin{description}
> +  \item[VIRTIO_F_SWIOTLB (27)] This feature indicates that the device
> +  transport provides a memory region which is to be used for bounce
> +  buffering, rather than permitting direct memory access to system memory.
>    \item[VIRTIO_F_INDIRECT_DESC (28)] Negotiating this feature indicates
>    that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT
>    flag set, as described in \ref{sec:Basic Facilities of a Virtio
> @@ -807,6 +810,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>    the driver. When clear, this overrides any platform-specific description of
>    whether device access is limited or translated in any way, e.g.
>    whether an IOMMU may be present.
> +  If a the device transport provides a software IOTLB bounce buffer,
> +  addresses within its range are not subject to the requirements of
> +  VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''.

I don't get this part. the system designers currently have a choice
whether to have these controlled by VIRTIO_F_ACCESS_PLATFORM or not.
with PCI, for example, BAR on the same device is naturally not
behind an iommu.


>    \item[VIRTIO_F_RING_PACKED(34)] This feature indicates
>    support for the packed virtqueue layout as described in
>    \ref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}~\nameref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}.
> @@ -885,6 +891,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>  VIRTIO_F_ACCESS_PLATFORM is not offered, then a driver MUST pass only physical
>  addresses to the device.
>  
> +A driver SHOULD accept VIRTIO_F_SWIOTLB if it is offered, and it MUST
> +then pass only addresses within the Software IOTLB bounce buffer to the
> +device.
> +
>  A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered.
>  
>  A driver SHOULD accept VIRTIO_F_ORDER_PLATFORM if it is offered.
> @@ -921,6 +931,8 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>  A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not
>  accepted.
>  
> +A device MAY fail to operate further if VIRTIO_F_SWIOTLB is not accepted.
> +
>  If VIRTIO_F_IN_ORDER has been negotiated, a device MUST use
>  buffers in the same order in which they have been available.
>  
> -- 
> 2.49.0
>

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Wed, 2025-04-02 at 10:54 -0400, Michael S. Tsirkin wrote:
> > +  If a the device transport provides a software IOTLB bounce buffer,
> > +  addresses within its range are not subject to the requirements of
> > +  VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''.
> 
> I don't get this part. the system designers currently have a choice
> whether to have these controlled by VIRTIO_F_ACCESS_PLATFORM or not.
> with PCI, for example, BAR on the same device is naturally not
> behind an iommu.

In the PCI case this *is* a BAR on the same device, and is naturally
not behind an IOMMU as you say. This is just stating the obvious, for
clarity.

For virtio-mmio it also isn't translated by an IOMMU; that was the
*point* of the `restricted-dma-pool` support.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month, 1 week ago

On Wed, Apr 02, 2025 at 04:12:39PM +0100, David Woodhouse wrote:
> On Wed, 2025-04-02 at 10:54 -0400, Michael S. Tsirkin wrote:
> > > +  If a the device transport provides a software IOTLB bounce buffer,
> > > +  addresses within its range are not subject to the requirements of
> > > +  VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''.
> > 
> > I don't get this part. the system designers currently have a choice
> > whether to have these controlled by VIRTIO_F_ACCESS_PLATFORM or not.
> > with PCI, for example, BAR on the same device is naturally not
> > behind an iommu.
> 
> In the PCI case this *is* a BAR on the same device, and is naturally
> not behind an IOMMU as you say. This is just stating the obvious, for
> clarity.

Then the platform already does this right, and it's better not to
try and override it in the spec.

> For virtio-mmio it also isn't translated by an IOMMU; that was the
> *point* of the `restricted-dma-pool` support.
> 

Clear VIRTIO_F_ACCESS_PLATFORM then?

Generally, it is preferable to keep all features orthogonal if
at all possible.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Wed, 2025-04-02 at 11:20 -0400, Michael S. Tsirkin wrote:
> On Wed, Apr 02, 2025 at 04:12:39PM +0100, David Woodhouse wrote:
> > On Wed, 2025-04-02 at 10:54 -0400, Michael S. Tsirkin wrote:
> > > > +  If a the device transport provides a software IOTLB bounce buffer,
> > > > +  addresses within its range are not subject to the requirements of
> > > > +  VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''.
> > > 
> > > I don't get this part. the system designers currently have a choice
> > > whether to have these controlled by VIRTIO_F_ACCESS_PLATFORM or not.
> > > with PCI, for example, BAR on the same device is naturally not
> > > behind an iommu.
> > 
> > In the PCI case this *is* a BAR on the same device, and is naturally
> > not behind an IOMMU as you say. This is just stating the obvious, for
> > clarity.
> 
> Then the platform already does this right, and it's better not to
> try and override it in the spec.

It isn't intended as an "override". This *is* what will happen if the
platform does it right. By mandating it in the spec, the intent is to
reduce the chances of platforms doing it *wrong*? (Or of drivers making
wrong assumptions).

> > For virtio-mmio it also isn't translated by an IOMMU; that was the
> > *point* of the `restricted-dma-pool` support.
> > 
> 
> Clear VIRTIO_F_ACCESS_PLATFORM then?

> I don't want to say that VIRTIO_F_ACCESS_PLATFORM and VIRTIO_F_SWIOTLB

> are mutually exclusive...

> Generally, it is preferable to keep all features orthogonal if
> at all possible.

...precisely because they *should* be orthogonal.

VIRTIO_F_ACCESS_PLATFORM defines how system memory is accessed;
basically whether DMA goes through an IOMMU or not. And as you point
out, the "on-device" buffer used with VIRTIO_F_SWIOTLB should never
pass through the IOMMU anyway, so it *is* naturally orthogonal.

And I think it does make sense for both to be set in some cases, for
both physical and virtual devices.

For physical devices that would mean "Got an IOMMU? Sure, go ahead and
use it. If not, if you don't trust me, you can just disable my bus
mastering and just use the SWIOTLB".

It's basically the same for a virtual device. In a confidential compute
model, the device model (in the VMM) might not be *able* to access the
guest memory unless the core guest OS explicitly permits that, through
some kind of pKVM enlightenment to allow pages to be shared, or a
vIOMMU, or marking hardware pages unencrypted. So having both bits set
would mean "Know how to drive that enlightenment? Sure, go ahead and
use it. Otherwise, use the SWIOTLB".

In fact that's exactly what that Linux code for `restricted-dma-pool`
already does — when setting up the dma_ops for the device, if it finds
an actual set of IOMMU operations, it'll use those. And if not, that's
when it falls back to using the provided SWIOTLB.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month, 1 week ago

On Wed, Apr 02, 2025 at 04:47:18PM +0100, David Woodhouse wrote:
> On Wed, 2025-04-02 at 11:20 -0400, Michael S. Tsirkin wrote:
> > On Wed, Apr 02, 2025 at 04:12:39PM +0100, David Woodhouse wrote:
> > > On Wed, 2025-04-02 at 10:54 -0400, Michael S. Tsirkin wrote:
> > > > > +  If a the device transport provides a software IOTLB bounce buffer,
> > > > > +  addresses within its range are not subject to the requirements of
> > > > > +  VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''.
> > > > 
> > > > I don't get this part. the system designers currently have a choice
> > > > whether to have these controlled by VIRTIO_F_ACCESS_PLATFORM or not.
> > > > with PCI, for example, BAR on the same device is naturally not
> > > > behind an iommu.
> > > 
> > > In the PCI case this *is* a BAR on the same device, and is naturally
> > > not behind an IOMMU as you say. This is just stating the obvious, for
> > > clarity.
> > 
> > Then the platform already does this right, and it's better not to
> > try and override it in the spec.
> 
> It isn't intended as an "override". This *is* what will happen if the
> platform does it right. By mandating it in the spec, the intent is to
> reduce the chances of platforms doing it *wrong*? (Or of drivers making
> wrong assumptions).

The text you wrote makes it seem that even if the platform says use
an IOMMU, it should be bypassed.

> > > For virtio-mmio it also isn't translated by an IOMMU; that was the
> > > *point* of the `restricted-dma-pool` support.
> > > 
> > 
> > Clear VIRTIO_F_ACCESS_PLATFORM then?
> 
> > I don't want to say that VIRTIO_F_ACCESS_PLATFORM and VIRTIO_F_SWIOTLB
> 
> > are mutually exclusive...
> 
> 
> > Generally, it is preferable to keep all features orthogonal if
> > at all possible.
> 
> ...precisely because they *should* be orthogonal.
> 
> VIRTIO_F_ACCESS_PLATFORM defines how system memory is accessed;
> basically whether DMA goes through an IOMMU or not. And as you point
> out, the "on-device" buffer used with VIRTIO_F_SWIOTLB should never
> pass through the IOMMU anyway, so it *is* naturally orthogonal.
> 
> 
> And I think it does make sense for both to be set in some cases, for
> both physical and virtual devices.
> 
> 
> For physical devices that would mean "Got an IOMMU? Sure, go ahead and
> use it. If not, if you don't trust me, you can just disable my bus
> mastering and just use the SWIOTLB".
> 
> It's basically the same for a virtual device. In a confidential compute
> model, the device model (in the VMM) might not be *able* to access the
> guest memory unless the core guest OS explicitly permits that, through
> some kind of pKVM enlightenment to allow pages to be shared, or a
> vIOMMU, or marking hardware pages unencrypted. So having both bits set
> would mean "Know how to drive that enlightenment? Sure, go ahead and
> use it. Otherwise, use the SWIOTLB".
> 
> In fact that's exactly what that Linux code for `restricted-dma-pool`
> already does — when setting up the dma_ops for the device, if it finds
> an actual set of IOMMU operations, it'll use those. And if not, that's
> when it falls back to using the provided SWIOTLB.
> 

I just feel your text for VIRTIO_F_ACCESS_PLATFORM make it
seem like that is not the case.

VIRTIO_F_ACCESS_PLATFORM says a lot of things, e.g.  it has implications
for encrypted VMs, and so on.

I would drop this text, and maybe add some clarification in the mmio transport,
as needed.


-- 
MST

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Wed, 2025-04-02 at 11:51 -0400, Michael S. Tsirkin wrote:
> On Wed, Apr 02, 2025 at 04:47:18PM +0100, David Woodhouse wrote:
> > On Wed, 2025-04-02 at 11:20 -0400, Michael S. Tsirkin wrote:
> > > On Wed, Apr 02, 2025 at 04:12:39PM +0100, David Woodhouse wrote:
> > > > On Wed, 2025-04-02 at 10:54 -0400, Michael S. Tsirkin wrote:
> > > > > > +  If a the device transport provides a software IOTLB bounce buffer,
> > > > > > +  addresses within its range are not subject to the requirements of
> > > > > > +  VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''

...

> The text you wrote makes it seem that even if the platform says use
> an IOMMU, it should be bypassed.

It was trying just to state the obvious, that addresses within the
range of the *on-device* memory buffer are not handled by the IOMMU.

> I would drop this text, and maybe add some clarification in the mmio transport,
> as needed.

It would be PCI too. I guess we could move the "obvious" comment that
'addresses within the range of the SWIOTLB bounce buffer region are
considered to be "on-device" and are thus not affected by the
requirements of VIRTIO_F_ACCESS_PLATFORM' into *both* the MMIO and PCI
transport docs? But then it's basically just saying the same thing in
two different locations?

I don't think we're debating what the actual implementations should
*do* ...  are we? To me it's obvious that what I'm trying to say here
*should* always be true.

We're just debating the wording and where to put it, yes?

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month, 1 week ago

On Wed, Apr 02, 2025 at 05:16:28PM +0100, David Woodhouse wrote:
> On Wed, 2025-04-02 at 11:51 -0400, Michael S. Tsirkin wrote:
> > On Wed, Apr 02, 2025 at 04:47:18PM +0100, David Woodhouse wrote:
> > > On Wed, 2025-04-02 at 11:20 -0400, Michael S. Tsirkin wrote:
> > > > On Wed, Apr 02, 2025 at 04:12:39PM +0100, David Woodhouse wrote:
> > > > > On Wed, 2025-04-02 at 10:54 -0400, Michael S. Tsirkin wrote:
> > > > > > > +  If a the device transport provides a software IOTLB bounce buffer,
> > > > > > > +  addresses within its range are not subject to the requirements of
> > > > > > > +  VIRTIO_F_ACCESS_PLATFORM as they are considered to be ``on-device''
> 
> ...
> 
> > The text you wrote makes it seem that even if the platform says use
> > an IOMMU, it should be bypassed.
> 
> It was trying just to state the obvious, that addresses within the
> range of the *on-device* memory buffer are not handled by the IOMMU.
> 
> > I would drop this text, and maybe add some clarification in the mmio transport,
> > as needed.
> 
> It would be PCI too. I guess we could move the "obvious" comment that
> 'addresses within the range of the SWIOTLB bounce buffer region are
> considered to be "on-device" and are thus not affected by the
> requirements of VIRTIO_F_ACCESS_PLATFORM' into *both* the MMIO and PCI
> transport docs? But then it's basically just saying the same thing in
> two different locations?
> 
> I don't think we're debating what the actual implementations should
> *do* ...  are we? To me it's obvious that what I'm trying to say here
> *should* always be true.
> 
> We're just debating the wording and where to put it, yes?

yes.

I know a bit more about PCI, and for PCI I prefer just not saying
anything. The platform already defines whether it is behind an iommu
or not, and duplication is not good.

For mmio it is my understanding that the "restricted" does the same
already? or is it required in the spec for some reason?

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Wed, 2025-04-02 at 12:43 -0400, Michael S. Tsirkin wrote:
> 
> yes.
> 
> I know a bit more about PCI, and for PCI I prefer just not saying
> anything. The platform already defines whether it is behind an iommu
> or not, and duplication is not good.

Not a hill for me to die on I suppose, but I would personally prefer to
spell it out in words of one syllable or fewer, to make *sure* that
device and driver authors get it right even though it's "obvious".

After all, if we could trust them to do their thinking, we would never
have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM
existing in the first place; the legacy behaviour we get when that bit
*isn't* set would never have happened.

> For mmio it is my understanding that the "restricted" does the same
> already? or is it required in the spec for some reason?

No, it's exactly the same. But I still don't trust driver authors to
realise the obvious, or VMM implementations either for that matter.

I'm not sure I see the *harm* in spelling out explicitly for the hard-
of-thinking.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month, 1 week ago

On Wed, Apr 02, 2025 at 06:10:53PM +0100, David Woodhouse wrote:
> On Wed, 2025-04-02 at 12:43 -0400, Michael S. Tsirkin wrote:
> > 
> > yes.
> > 
> > I know a bit more about PCI, and for PCI I prefer just not saying
> > anything. The platform already defines whether it is behind an iommu
> > or not, and duplication is not good.
> 
> Not a hill for me to die on I suppose, but I would personally prefer to
> spell it out in words of one syllable or fewer, to make *sure* that
> device and driver authors get it right even though it's "obvious".
> 
> After all, if we could trust them to do their thinking, we would never
> have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM
> existing in the first place; the legacy behaviour we get when that bit
> *isn't* set would never have happened.

Oh, you are wrong here. It's not implementer's fault.
virtio just was not designed with real DMA
in mind, and micro-optimizing by bypassing the DMA API
was very much intentional.


> > For mmio it is my understanding that the "restricted" does the same
> > already? or is it required in the spec for some reason?
> 
> No, it's exactly the same. But I still don't trust driver authors to
> realise the obvious, or VMM implementations either for that matter.
> 
> I'm not sure I see the *harm* in spelling out explicitly for the hard-
> of-thinking.

I don't want people to make assumptions, like crashing if device is
behind an iommu or whatnot.

We can go with something informative.

"It is expected that for most implementations, when using this feature,
the behaviour with and without ACCESS_PLATFORM is the same"

-- 
MST

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Thu, 2025-04-03 at 03:31 -0400, Michael S. Tsirkin wrote:
> On Wed, Apr 02, 2025 at 06:10:53PM +0100, David Woodhouse wrote:
> > On Wed, 2025-04-02 at 12:43 -0400, Michael S. Tsirkin wrote:
> > > 
> > > yes.
> > > 
> > > I know a bit more about PCI, and for PCI I prefer just not saying
> > > anything. The platform already defines whether it is behind an iommu
> > > or not, and duplication is not good.
> > 
> > Not a hill for me to die on I suppose, but I would personally prefer to
> > spell it out in words of one syllable or fewer, to make *sure* that
> > device and driver authors get it right even though it's "obvious".
> > 
> > After all, if we could trust them to do their thinking, we would never
> > have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM
> > existing in the first place; the legacy behaviour we get when that bit
> > *isn't* set would never have happened.
> 
> Oh, you are wrong here. It's not implementer's fault.
> virtio just was not designed with real DMA
> in mind, and micro-optimizing by bypassing the DMA API
> was very much intentional.

That's one point of view, I suppose. In the early days of IOMMUs, and
DMA ops coming to mainstream platforms, we found a *lot* of device
drivers that had the same "micro-optimisation" of just handing physical
addresses directly to devices. We called them 'bugs' though.

And the thing that was different for virtio-pci was that the host side
had the *same* bug, as I recall, so we had to introduce a feature bit
to declare/negotiate the "natural" behaviour.

But we're a long way from the original topic here.

> > > For mmio it is my understanding that the "restricted" does the same
> > > already? or is it required in the spec for some reason?
> > 
> > No, it's exactly the same. But I still don't trust driver authors to
> > realise the obvious, or VMM implementations either for that matter.
> > 
> > I'm not sure I see the *harm* in spelling out explicitly for the hard-
> > of-thinking.
> 
> I don't want people to make assumptions, like crashing if device is
> behind an iommu or whatnot.

Why would that happen? If we explicitly document that "on-device memory
access don't go through an external IOMMU that sits all the way over
the other side of the PCI bus between the device and system memory",
which is what I was trying to say...  it doesn't *matter* if the device
is behind an IOMMU or not. It doesn't ever *do* any DMA on the PCI bus.

> We can go with something informative.
> 
> "It is expected that for most implementations, when using this feature,
> the behaviour with and without ACCESS_PLATFORM is the same"

I'd prefer to say nothing. Saying nothing relies on people to do their
thinking and realise that on-device access doesn't cross the PCI bus.
This version actually seems to hint that it's a *choice*, and hints
that it might be OK if the external IOMMU *does* intervene in on-device
memory accesses.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month, 1 week ago

On Thu, Apr 03, 2025 at 08:45:22AM +0100, David Woodhouse wrote:
> On Thu, 2025-04-03 at 03:31 -0400, Michael S. Tsirkin wrote:
> > On Wed, Apr 02, 2025 at 06:10:53PM +0100, David Woodhouse wrote:
> > > On Wed, 2025-04-02 at 12:43 -0400, Michael S. Tsirkin wrote:
> > > > 
> > > > yes.
> > > > 
> > > > I know a bit more about PCI, and for PCI I prefer just not saying
> > > > anything. The platform already defines whether it is behind an iommu
> > > > or not, and duplication is not good.
> > > 
> > > Not a hill for me to die on I suppose, but I would personally prefer to
> > > spell it out in words of one syllable or fewer, to make *sure* that
> > > device and driver authors get it right even though it's "obvious".
> > > 
> > > After all, if we could trust them to do their thinking, we would never
> > > have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM
> > > existing in the first place; the legacy behaviour we get when that bit
> > > *isn't* set would never have happened.
> > 
> > Oh, you are wrong here. It's not implementer's fault.
> > virtio just was not designed with real DMA
> > in mind, and micro-optimizing by bypassing the DMA API
> > was very much intentional.
> 
> That's one point of view, I suppose. In the early days of IOMMUs, and
> DMA ops coming to mainstream platforms, we found a *lot* of device
> drivers that had the same "micro-optimisation" of just handing physical
> addresses directly to devices. We called them 'bugs' though.

Indeed. virtio was developed way after these days though.
We just thought we are being clever.

> And the thing that was different for virtio-pci was that the host side
> had the *same* bug, as I recall, so we had to introduce a feature bit
> to declare/negotiate the "natural" behaviour.
> 
> But we're a long way from the original topic here.


In a sense. But see the cache mode discussion: if this proposed
interface can not be implemented efficiently on actual hardware, it does
begin to look a little bit like repeating the same mistake.

> > > > For mmio it is my understanding that the "restricted" does the same
> > > > already? or is it required in the spec for some reason?
> > > 
> > > No, it's exactly the same. But I still don't trust driver authors to
> > > realise the obvious, or VMM implementations either for that matter.
> > > 
> > > I'm not sure I see the *harm* in spelling out explicitly for the hard-
> > > of-thinking.
> > 
> > I don't want people to make assumptions, like crashing if device is
> > behind an iommu or whatnot.
> 
> Why would that happen? If we explicitly document that "on-device memory
> access don't go through an external IOMMU that sits all the way over
> the other side of the PCI bus between the device and system memory",
> which is what I was trying to say...  it doesn't *matter* if the device
> is behind an IOMMU or not. It doesn't ever *do* any DMA on the PCI bus.


Saying this explicitly in the pci transport like you write here is fine.

> > We can go with something informative.
> > 
> > "It is expected that for most implementations, when using this feature,
> > the behaviour with and without ACCESS_PLATFORM is the same"
> 
> I'd prefer to say nothing. Saying nothing relies on people to do their
> thinking and realise that on-device access doesn't cross the PCI bus.
> This version actually seems to hint that it's a *choice*, and hints
> that it might be OK if the external IOMMU *does* intervene in on-device
> memory accesses.
> 

Nothing is fine, too.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month, 1 week ago

On Wed, Apr 02, 2025 at 06:10:53PM +0100, David Woodhouse wrote:
> > I know a bit more about PCI, and for PCI I prefer just not saying
> > anything. The platform already defines whether it is behind an iommu
> > or not, and duplication is not good.
> 
> Not a hill for me to die on I suppose, but I would personally prefer to
> spell it out in words of one syllable or fewer, to make *sure* that
> device and driver authors get it right even though it's "obvious".
> 
> After all, if we could trust them to do their thinking, we would never
> have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM
> existing in the first place; the legacy behaviour we get when that bit
> *isn't* set would never have happened.

You'll need to define the semanics for VIRTIO_F_ACCESS_PLATFORM only
then.  An the only sane answer there is: don't allow non-translated
regions at all an in a broader sense stop people to use
VIRTIO_F_ACCESS_PLATFORM at all or at least for anything that requires
a new feature bit.

> > For mmio it is my understanding that the "restricted" does the same
> > already? or is it required in the spec for some reason?
> 
> No, it's exactly the same. But I still don't trust driver authors to
> realise the obvious, or VMM implementations either for that matter.
> 
> I'm not sure I see the *harm* in spelling out explicitly for the hard-
> of-thinking.

Write a whitepaper than and explain how it maps to the existing perfectly
working features.  Note that VIRTIO_F_ACCESS_PLATFORM just like
everything in virtio would actually benefit from being turned into
proper spec language, but anecdotes about random use cases are not
helpful.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Thu, 2025-04-03 at 00:29 -0700, Christoph Hellwig wrote:
> On Wed, Apr 02, 2025 at 06:10:53PM +0100, David Woodhouse wrote:
> > > I know a bit more about PCI, and for PCI I prefer just not saying
> > > anything. The platform already defines whether it is behind an iommu
> > > or not, and duplication is not good.
> > 
> > Not a hill for me to die on I suppose, but I would personally prefer to
> > spell it out in words of one syllable or fewer, to make *sure* that
> > device and driver authors get it right even though it's "obvious".
> > 
> > After all, if we could trust them to do their thinking, we would never
> > have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM
> > existing in the first place; the legacy behaviour we get when that bit
> > *isn't* set would never have happened.
> 
> You'll need to define the semanics for VIRTIO_F_ACCESS_PLATFORM only
> then.  
> 

You mean the semantics for VIRTIO_F_ACCESS_PLATFORM only, without
VIRTIO_F_SWIOTLB? Are those not defined already?

> An the only sane answer there is: don't allow non-translated
> regions at all an in a broader sense stop people to use
> VIRTIO_F_ACCESS_PLATFORM at all or at least for anything that requires
> a new feature bit.
> 
> > > For mmio it is my understanding that the "restricted" does the same
> > > already? or is it required in the spec for some reason?
> > 
> > No, it's exactly the same. But I still don't trust driver authors to
> > realise the obvious, or VMM implementations either for that matter.
> > 
> > I'm not sure I see the *harm* in spelling out explicitly for the hard-
> > of-thinking.
> 
> Write a whitepaper than and explain how it maps to the existing perfectly
> working features.  Note that VIRTIO_F_ACCESS_PLATFORM just like
> everything in virtio would actually benefit from being turned into
> proper spec language, but anecdotes about random use cases are not
> helpful.

Hm. I was just trying to point out what seemed obvious, that when a PCI
device does 'DMA' to an address region which is actually within one of
its *own* BARs, it isn't going to reach the PCI bus and get translated
by an IOMMU. If it's causing this much contention, I'll just drop it.
It didn't *change* anything anyway, except hopefully avoiding bugs in
implementations.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month, 1 week ago

On Thu, Apr 03, 2025 at 08:37:20AM +0100, David Woodhouse wrote:
> On Thu, 2025-04-03 at 00:29 -0700, Christoph Hellwig wrote:
> > On Wed, Apr 02, 2025 at 06:10:53PM +0100, David Woodhouse wrote:
> > > > I know a bit more about PCI, and for PCI I prefer just not saying
> > > > anything. The platform already defines whether it is behind an iommu
> > > > or not, and duplication is not good.
> > > 
> > > Not a hill for me to die on I suppose, but I would personally prefer to
> > > spell it out in words of one syllable or fewer, to make *sure* that
> > > device and driver authors get it right even though it's "obvious".
> > > 
> > > After all, if we could trust them to do their thinking, we would never
> > > have had the awful situation that led to VIRTIO_F_ACCESS_PLATFORM
> > > existing in the first place; the legacy behaviour we get when that bit
> > > *isn't* set would never have happened.
> > 
> > You'll need to define the semanics for VIRTIO_F_ACCESS_PLATFORM only
> > then.  
> > 
> 
> You mean the semantics for VIRTIO_F_ACCESS_PLATFORM only, without
> VIRTIO_F_SWIOTLB? Are those not defined already?
> 
> > An the only sane answer there is: don't allow non-translated
> > regions at all an in a broader sense stop people to use
> > VIRTIO_F_ACCESS_PLATFORM at all or at least for anything that requires
> > a new feature bit.
> > 
> > > > For mmio it is my understanding that the "restricted" does the same
> > > > already? or is it required in the spec for some reason?
> > > 
> > > No, it's exactly the same. But I still don't trust driver authors to
> > > realise the obvious, or VMM implementations either for that matter.
> > > 
> > > I'm not sure I see the *harm* in spelling out explicitly for the hard-
> > > of-thinking.
> > 
> > Write a whitepaper than and explain how it maps to the existing perfectly
> > working features.  Note that VIRTIO_F_ACCESS_PLATFORM just like
> > everything in virtio would actually benefit from being turned into
> > proper spec language, but anecdotes about random use cases are not
> > helpful.
> 
> Hm. I was just trying to point out what seemed obvious, that when a PCI
> device does 'DMA' to an address region which is actually within one of
> its *own* BARs, it isn't going to reach the PCI bus and get translated
> by an IOMMU. If it's causing this much contention, I'll just drop it.
> It didn't *change* anything anyway, except hopefully avoiding bugs in
> implementations.
> 



If you want, I would just spell this in the transport text then.
"Note that on most existing platforms,
 and since the BAR is part of the device itself,
 access controls generally to do not apply to device accesses there,
 and this is true even when VIRTIO_F_ACCESS_PLATFORM has been negotiated".

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month, 1 week ago

On Thu, Apr 03, 2025 at 08:37:20AM +0100, David Woodhouse wrote:
> Hm. I was just trying to point out what seemed obvious, that when a PCI
> device does 'DMA' to an address region which is actually within one of
> its *own* BARs,

PCIe devices can't do DMA to their own BARs by definition, see the route
to self rule.

Pretending that they do it by parsing the addresses is bound to fail
because the addresses seen by the driver and the device can be
different.

NVMe got this wrong not just once but twice and is still suffering from
this misunderstanding.  If you want to enhance a protocol to support
addressing a local indirection buffer do not treat it as fake DMA
but rather use explicit addressing for it, or you will be in a world of
trouble.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month, 1 week ago

On Thu, 2025-04-03 at 00:39 -0700, Christoph Hellwig wrote:
> On Thu, Apr 03, 2025 at 08:37:20AM +0100, David Woodhouse wrote:
> > Hm. I was just trying to point out what seemed obvious, that when a PCI
> > device does 'DMA' to an address region which is actually within one of
> > its *own* BARs,
> 
> PCIe devices can't do DMA to their own BARs by definition, see the route
> to self rule.
> 
> Pretending that they do it by parsing the addresses is bound to fail
> because the addresses seen by the driver and the device can be
> different.
> 
> NVMe got this wrong not just once but twice and is still suffering from
> this misunderstanding.  If you want to enhance a protocol to support
> addressing a local indirection buffer do not treat it as fake DMA
> but rather use explicit addressing for it, or you will be in a world of
> trouble.

This is, of course, the other benefit of pointing out the "obvious".

Because you can get corrected when you've got it wrong :)

Thanks. I'll take a closer look at handling that. I think it's
reasonable for the negotiation of the VIRTIO_F_SWIOTLB feature to be
the thing that switches *all* addresses to be on-device, and the on-
device buffer can't be accessed unless VIRTIO_F_SWIOTLB has been
negotiated.

Which neatly sidesteps the original thing I was trying to clarify
anyway.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month ago

On Thu, Apr 03, 2025 at 09:10:41AM +0100, David Woodhouse wrote:
> Thanks. I'll take a closer look at handling that. I think it's
> reasonable for the negotiation of the VIRTIO_F_SWIOTLB feature to be
> the thing that switches *all* addresses to be on-device, and the on-
> device buffer can't be accessed unless VIRTIO_F_SWIOTLB has been
> negotiated.
> 
> Which neatly sidesteps the original thing I was trying to clarify
> anyway.

Switching all addressing does not sound like a good idea.  The main
thing these indirect buffers are used for is as a staging points for
P2P DMA, in which case they often are only used for some transfers or
even parts of a transfer.  At least for physical virtio devices P2P
is probably not far off with the current GPU craze, and I wouldn't
be surprised if people found uses for paravirt P2P as well.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by David Woodhouse 1 month ago

On 4 April 2025 07:29:02 BST, Christoph Hellwig <hch@infradead.org> wrote:
>On Thu, Apr 03, 2025 at 09:10:41AM +0100, David Woodhouse wrote:
>> Thanks. I'll take a closer look at handling that. I think it's
>> reasonable for the negotiation of the VIRTIO_F_SWIOTLB feature to be
>> the thing that switches *all* addresses to be on-device, and the on-
>> device buffer can't be accessed unless VIRTIO_F_SWIOTLB has been
>> negotiated.
>> 
>> Which neatly sidesteps the original thing I was trying to clarify
>> anyway.
>
>Switching all addressing does not sound like a good idea.  The main
>thing these indirect buffers are used for is as a staging points for
>P2P DMA, in which case they often are only used for some transfers or
>even parts of a transfer.  At least for physical virtio devices P2P
>is probably not far off with the current GPU craze, and I wouldn't
>be surprised if people found uses for paravirt P2P as well.

That isn't the original use case I had for it in virtio but that's useful input, thanks.

Plumbing a 65th address space bit through all descriptors seems impractical. Would it suffice for the driver to *specify* the location in the device's DMA address space that the on-device buffer appears, to allow it to avoid conflicts with system memory addresses? Better ideas?

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month ago

On Fri, Apr 04, 2025 at 07:39:02AM +0100, David Woodhouse wrote:
> Plumbing a 65th address space bit through all descriptors seems impractical. Would it suffice for the driver to *specify* the location in the device's DMA address space that the on-device buffer appears, to allow it to avoid conflicts with system memory addresses? Better ideas?

That's what NVMe currently does in it's second attempt to get things
wrong.  It mostly works for the current use cases, but there might
actually not be a single canonical address where it is mapped.  This
was mostly theoretical, but with SVM and especially SIOV it is becoming
a real thing now.

In the worst case you might have to limit yourself to one address space
per request.  Or have a bitmap which buffer in the request uses which
address space.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month ago

On Thu, Apr 03, 2025 at 11:44:40PM -0700, Christoph Hellwig wrote:
> That's what NVMe currently does in it's second attempt to get things
> wrong.  It mostly works for the current use cases, but there might
> actually not be a single canonical address where it is mapped.  This
> was mostly theoretical, but with SVM and especially SIOV it is becoming
> a real thing now.
> 
> In the worst case you might have to limit yourself to one address space
> per request.  Or have a bitmap which buffer in the request uses which
> address space.

Also with unordered I/O in the newer PCIe specs we might need another
such address space bit to force use of that for certain ranges, although
system use cases for UIO are still not finalized.

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Michael S. Tsirkin 1 month, 1 week ago

On Thu, Apr 03, 2025 at 12:39:39AM -0700, Christoph Hellwig wrote:
> On Thu, Apr 03, 2025 at 08:37:20AM +0100, David Woodhouse wrote:
> > Hm. I was just trying to point out what seemed obvious, that when a PCI
> > device does 'DMA' to an address region which is actually within one of
> > its *own* BARs,
> 
> PCIe devices can't do DMA to their own BARs by definition, see the route
> to self rule.
> 
> Pretending that they do it by parsing the addresses is bound to fail
> because the addresses seen by the driver and the device can be
> different.
> 
> NVMe got this wrong not just once but twice and is still suffering from
> this misunderstanding.  If you want to enhance a protocol to support
> addressing a local indirection buffer do not treat it as fake DMA
> but rather use explicit addressing for it, or you will be in a world of
> trouble.

Hm. This is what this proposal does, indeed. Can be changed though -
since all accesses are now within the SWIOTLB, they can be treated
as offsets instead.

-- 
MST

Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

Posted by Christoph Hellwig 1 month, 1 week ago

On Thu, Apr 03, 2025 at 03:43:31AM -0400, Michael S. Tsirkin wrote:
> Hm. This is what this proposal does, indeed. Can be changed though -
> since all accesses are now within the SWIOTLB, they can be treated
> as offsets instead.

Yes, marking them as offsets into the BAR (including potentially
picking a specific BAR or region if there is more than one) is the
right interface and implemented by various hardware that way.