Kconfig.host | 3 + MAINTAINERS | 11 + accel/hvf/hvf-all.c | 10 +- backends/Kconfig | 2 +- docs/specs/pci-ids.rst | 3 + docs/system/device-emulation.rst | 1 + docs/system/devices/vfio-apple.rst | 160 +++++ hw/vfio-user/device.c | 16 +- hw/vfio/Kconfig | 4 +- hw/vfio/ap.c | 4 +- hw/vfio/apple-device.c | 945 +++++++++++++++++++++++++++++ hw/vfio/apple-dext-client.c | 681 +++++++++++++++++++++ hw/vfio/apple-dext-client.h | 253 ++++++++ hw/vfio/apple-dma.c | 540 +++++++++++++++++ hw/vfio/apple.h | 74 +++ hw/vfio/ccw.c | 2 +- hw/vfio/container-apple.c | 241 ++++++++ hw/vfio/device.c | 42 ++ hw/vfio/meson.build | 12 +- hw/vfio/migration.c | 5 +- hw/vfio/pci.c | 50 +- hw/vfio/pci.h | 1 + hw/vfio/region.c | 108 ++-- hw/vfio/types.h | 2 + hw/vfio/vfio-helpers.h | 2 +- hw/vfio/vfio-migration-internal.h | 4 +- hw/vfio/vfio-region.h | 4 + include/compat/linux/ioctl.h | 2 + include/compat/linux/types.h | 26 + include/hw/pci/pci.h | 1 + include/hw/vfio/vfio-container.h | 1 + include/hw/vfio/vfio-device.h | 40 +- meson.build | 10 +- util/event_notifier-posix.c | 5 +- 34 files changed, 3197 insertions(+), 68 deletions(-) create mode 100644 docs/system/devices/vfio-apple.rst create mode 100644 hw/vfio/apple-device.c create mode 100644 hw/vfio/apple-dext-client.c create mode 100644 hw/vfio/apple-dext-client.h create mode 100644 hw/vfio/apple-dma.c create mode 100644 hw/vfio/apple.h create mode 100644 hw/vfio/container-apple.c create mode 100644 include/compat/linux/ioctl.h create mode 100644 include/compat/linux/types.h
This series adds VFIO PCI device passthrough support for Apple Silicon
Macs running macOS, using a DriverKit extension (dext) as the host
backend instead of the Linux VFIO kernel driver.
I'm sending this as an RFC because I'd like feedback before investing
further in upstreaming. The code is functional. I've tested it with
an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU
gaming works but is slow (~30 fps on high settings in Cyberpunk 2077
[1]), likely due to the BAR access penalty described below. AI
inference workloads appear less affected. Ollama with Qwen 3.5
generates around 140 tok/sec on the same setup [2].
How it works:
On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio
for device access and DMA mapping. On macOS, there is no equivalent
kernel interface. Instead, a userspace DriverKit extension
(VFIOUserPCIDriver) mediates access to the physical PCI device through
IOKit's IOUserClient and PCIDriverKit APIs.
The series keeps the existing VFIOPCIDevice model and reuses QEMU's
passthrough infrastructure. A few ioctl callsites are refactored into
io_ops callbacks, the build system is extended for Darwin, and the
Apple-specific backend plugs in behind those abstractions.
The guest sees two PCI devices: the passthrough device itself
(vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion
DMA mapping device (apple-dma-pci). On the QEMU side, an
AppleVFIOContainer implements the IOMMU backend, and a C client
library wraps the IOUserClient calls to the dext for config space,
BAR MMIO, interrupts, reset, and DMA.
DMA limitations:
This is the biggest platform constraint. Unlike a typical IOMMU
mapping operation where the caller specifies the IOVA, the
PCIDriverKit API (IODMACommand::PrepareForDMA) returns a
system-assigned IOVA. There is no way to request a specific address.
This means the guest's requested DMA addresses cannot be used
directly. The guest kernel module must intercept DMA mapping calls
and forward them through the companion device to get the actual
hardware IOVA.
There are also hard platform limits: approximately 1.5 GB total
mapped memory and roughly 64k concurrent mappings. Not all
workloads will fit within these limits, though GPU gaming and LLM
inference have worked in practice.
BAR access has performance issues as well. HVF does not expose
controls to map device memory as cacheable in the guest, creating a
significant performance penalty on BAR MMIO. Uncached mappings work
correctly but slowly compared to what the hardware could do.
What works:
- PCI config space passthrough
- BAR MMIO via direct-mapped device memory
- MSI/MSI-X interrupts via async notification from the dext
- Device reset (FLR with hot-reset fallback)
- DMA mapping for guest device drivers
What doesn't work:
- Expansion ROM / VBIOS passthrough
- PCI BAR quirks
- VGA region passthrough
- Migration and dirty page tracking
- Hot-unplug
Questions for reviewers:
1. Is this something the VFIO maintainers would consider carrying
upstream? The refactoring patches (3-6) are benign, but the Apple
backend is a new platform with real limitations. That said, if Apple
lifts some of the DART/HVF restrictions in a future macOS release, the
code changes to take advantage would likely be minor. I'd like to
understand whether this is in scope before doing the work to
address review feedback on the full series.
2. The apple-dma-pci companion device: should this be a virtio device
instead? I went with a simple custom PCI device because the virtio
infrastructure didn't buy much for what is essentially a {map, unmap}
register interface, but if virtio is preferred, what is the process
for allocating a device ID? If a custom PCI device is the right
approach, I've tentatively allocated 1b36:0015. Is there a process
for reserving a device ID under the Red Hat PCI vendor, or is
claiming it in pci-ids.rst sufficient? The guest-side kernel module
hooks all DMA mapping functions for passed-through devices, which is
unusual enough that I'm not sure it's upstreamable in the Linux
kernel. I can maintain it out of tree if needed.
3. Should the macOS host-side DriverKit extension live in the QEMU
tree? It's not included in this series and requires Apple code
signing. I'm happy to keep it out of tree if that's preferred,
or include the source if reviewers want it co-located.
4. The existing VFIO code includes <linux/vfio.h> from the
linux-headers/ tree, which is intended to track upstream Linux
UAPI headers. To make this compile on macOS, I added minimal
stub headers (include/compat/linux/types.h and linux/ioctl.h)
so the existing vfio.h parses on macOS without modification. An
alternative would be to move an approximation of vfio.h into
standard-headers/, but that felt against the spirit of tracking
the latest upstream headers, and the standard-headers import
process strips ioctls which the VFIO code relies on. I felt
the stub approach was the least invasive, but I'm open to
changing it if there's a preferred way to handle this.
[1] https://imgur.com/a/xoRS9kT
[2] https://imgur.com/a/ui4pYF0
Scott J. Goldman (10):
vfio/pci: Use the write side of EventNotifier for IRQ signaling
accel/hvf: avoid executable mappings for RAM-device memory
vfio: Allow building on Darwin hosts
vfio: Prepare existing code for Apple VFIO backend
vfio: Add region_map and region_unmap callbacks to VFIODeviceIOOps
vfio: Add device_reset callback to VFIODeviceIOOps
vfio/apple: Add DriverKit dext client library
vfio/apple: Add IOMMU container and PCI device
vfio/apple: Add apple-dma-pci companion device
docs: Add vfio-apple documentation and MAINTAINERS entry
Kconfig.host | 3 +
MAINTAINERS | 11 +
accel/hvf/hvf-all.c | 10 +-
backends/Kconfig | 2 +-
docs/specs/pci-ids.rst | 3 +
docs/system/device-emulation.rst | 1 +
docs/system/devices/vfio-apple.rst | 160 +++++
hw/vfio-user/device.c | 16 +-
hw/vfio/Kconfig | 4 +-
hw/vfio/ap.c | 4 +-
hw/vfio/apple-device.c | 945 +++++++++++++++++++++++++++++
hw/vfio/apple-dext-client.c | 681 +++++++++++++++++++++
hw/vfio/apple-dext-client.h | 253 ++++++++
hw/vfio/apple-dma.c | 540 +++++++++++++++++
hw/vfio/apple.h | 74 +++
hw/vfio/ccw.c | 2 +-
hw/vfio/container-apple.c | 241 ++++++++
hw/vfio/device.c | 42 ++
hw/vfio/meson.build | 12 +-
hw/vfio/migration.c | 5 +-
hw/vfio/pci.c | 50 +-
hw/vfio/pci.h | 1 +
hw/vfio/region.c | 108 ++--
hw/vfio/types.h | 2 +
hw/vfio/vfio-helpers.h | 2 +-
hw/vfio/vfio-migration-internal.h | 4 +-
hw/vfio/vfio-region.h | 4 +
include/compat/linux/ioctl.h | 2 +
include/compat/linux/types.h | 26 +
include/hw/pci/pci.h | 1 +
include/hw/vfio/vfio-container.h | 1 +
include/hw/vfio/vfio-device.h | 40 +-
meson.build | 10 +-
util/event_notifier-posix.c | 5 +-
34 files changed, 3197 insertions(+), 68 deletions(-)
create mode 100644 docs/system/devices/vfio-apple.rst
create mode 100644 hw/vfio/apple-device.c
create mode 100644 hw/vfio/apple-dext-client.c
create mode 100644 hw/vfio/apple-dext-client.h
create mode 100644 hw/vfio/apple-dma.c
create mode 100644 hw/vfio/apple.h
create mode 100644 hw/vfio/container-apple.c
create mode 100644 include/compat/linux/ioctl.h
create mode 100644 include/compat/linux/types.h
--
2.50.1 (Apple Git-155)
On Sun, 5 Apr 2026, Scott J. Goldman wrote: > This series adds VFIO PCI device passthrough support for Apple Silicon > Macs running macOS, using a DriverKit extension (dext) as the host > backend instead of the Linux VFIO kernel driver. > > I'm sending this as an RFC because I'd like feedback before investing > further in upstreaming. The code is functional. I've tested it with > an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU > gaming works but is slow (~30 fps on high settings in Cyberpunk 2077 > [1]), likely due to the BAR access penalty described below. AI > inference workloads appear less affected. Ollama with Qwen 3.5 > generates around 140 tok/sec on the same setup [2]. > > How it works: > > On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio > for device access and DMA mapping. On macOS, there is no equivalent > kernel interface. Instead, a userspace DriverKit extension > (VFIOUserPCIDriver) mediates access to the physical PCI device through > IOKit's IOUserClient and PCIDriverKit APIs. > > The series keeps the existing VFIOPCIDevice model and reuses QEMU's > passthrough infrastructure. A few ioctl callsites are refactored into > io_ops callbacks, the build system is extended for Darwin, and the > Apple-specific backend plugs in behind those abstractions. > > The guest sees two PCI devices: the passthrough device itself > (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion > DMA mapping device (apple-dma-pci). On the QEMU side, an > AppleVFIOContainer implements the IOMMU backend, and a C client > library wraps the IOUserClient calls to the dext for config space, > BAR MMIO, interrupts, reset, and DMA. > > DMA limitations: > > This is the biggest platform constraint. Unlike a typical IOMMU > mapping operation where the caller specifies the IOVA, the > PCIDriverKit API (IODMACommand::PrepareForDMA) returns a > system-assigned IOVA. There is no way to request a specific address. > This means the guest's requested DMA addresses cannot be used > directly. The guest kernel module must intercept DMA mapping calls > and forward them through the companion device to get the actual > hardware IOVA. I don't know this so what I say might not make sense but I think there is iommu emulation in QEMU so could that be used to do this in QEMU and avoid needing a kernel module for it in the guest? Regards, BALATON Zoltan
On Sun Apr 5, 2026 at 3:36 AM PDT, BALATON Zoltan wrote: > On Sun, 5 Apr 2026, Scott J. Goldman wrote: >> This series adds VFIO PCI device passthrough support for Apple Silicon >> Macs running macOS, using a DriverKit extension (dext) as the host >> backend instead of the Linux VFIO kernel driver. >> >> I'm sending this as an RFC because I'd like feedback before investing >> further in upstreaming. The code is functional. I've tested it with >> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU >> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077 >> [1]), likely due to the BAR access penalty described below. AI >> inference workloads appear less affected. Ollama with Qwen 3.5 >> generates around 140 tok/sec on the same setup [2]. >> >> How it works: >> >> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio >> for device access and DMA mapping. On macOS, there is no equivalent >> kernel interface. Instead, a userspace DriverKit extension >> (VFIOUserPCIDriver) mediates access to the physical PCI device through >> IOKit's IOUserClient and PCIDriverKit APIs. >> >> The series keeps the existing VFIOPCIDevice model and reuses QEMU's >> passthrough infrastructure. A few ioctl callsites are refactored into >> io_ops callbacks, the build system is extended for Darwin, and the >> Apple-specific backend plugs in behind those abstractions. >> >> The guest sees two PCI devices: the passthrough device itself >> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion >> DMA mapping device (apple-dma-pci). On the QEMU side, an >> AppleVFIOContainer implements the IOMMU backend, and a C client >> library wraps the IOUserClient calls to the dext for config space, >> BAR MMIO, interrupts, reset, and DMA. >> >> DMA limitations: >> >> This is the biggest platform constraint. Unlike a typical IOMMU >> mapping operation where the caller specifies the IOVA, the >> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a >> system-assigned IOVA. There is no way to request a specific address. >> This means the guest's requested DMA addresses cannot be used >> directly. The guest kernel module must intercept DMA mapping calls >> and forward them through the companion device to get the actual >> hardware IOVA. > > I don't know this so what I say might not make sense but I think there is > iommu emulation in QEMU so could that be used to do this in QEMU and avoid > needing a kernel module for it in the guest? > > Regards, > BALATON Zoltan I think the challenge is that this is a passthrough device doing DMA directly on the physical PCIe bus. The device's DMA transactions go through the real hardware IOMMU (DART), not through QEMU. If the guest programs the device with IOVA 0x1000 (assigned by a virtual IOMMU), the device will issue a DMA read for 0x1000 on the physical bus. But DART only knows about the IOVAs that PrepareForDMA assigned, so the transaction would fault. My understanding is that on other platforms, this is handled in a simpler way because the host IOMMU can be programmed directly: 1. Guest boots with 2 GB of RAM. QEMU maps guest physical address (GPA) 0x10000000-0x90000000 to host physical memory at, say, 0x110000000-0x190000000. 2. QEMU programs the host IOMMU so the device's view of 0x10000000-0x90000000 translates to the real host addresses. 3. Guest programs the PCI device to DMA to GPA 0x20000000. 4. The device issues the transaction, the IOMMU translates it to 0x120000000, and it hits the right physical memory. On macOS, step 2 is missing. The DriverKit APIs don't provide a way to program arbitrary IOVA-to-HPA translations into DART. You can only hand a buffer to PrepareForDMA and get back whatever IOVA the system assigns. On top of that, the platform limits the total amount of DMA-mapped memory to roughly 1.5 GB across ~64k mappings, so you can't even map all of a 2 GB guest's RAM. I believe this limit comes from the host device tree and isn't modifiable by users, though it could potentially be changed by Apple in firmware. A vIOMMU could reduce the amount of memory that needs to be mapped (only what the guest actually uses for DMA, not all of guest RAM), but fundamentally you still need something akin to step 2 to make the device's physical DMA transactions land at the right addresses, and we don't have that on this platform.
> On 5. Apr 2026, at 09:28, Scott J. Goldman <scottjgo@gmail.com> wrote:
>
> This series adds VFIO PCI device passthrough support for Apple Silicon
> Macs running macOS, using a DriverKit extension (dext) as the host
> backend instead of the Linux VFIO kernel driver.
>
> I'm sending this as an RFC because I'd like feedback before investing
> further in upstreaming. The code is functional. I've tested it with
> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU
> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077
> [1]), likely due to the BAR access penalty described below. AI
> inference workloads appear less affected. Ollama with Qwen 3.5
> generates around 140 tok/sec on the same setup [2].
>
> How it works:
>
> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio
> for device access and DMA mapping. On macOS, there is no equivalent
> kernel interface. Instead, a userspace DriverKit extension
> (VFIOUserPCIDriver) mediates access to the physical PCI device through
> IOKit's IOUserClient and PCIDriverKit APIs.
>
> The series keeps the existing VFIOPCIDevice model and reuses QEMU's
> passthrough infrastructure. A few ioctl callsites are refactored into
> io_ops callbacks, the build system is extended for Darwin, and the
> Apple-specific backend plugs in behind those abstractions.
>
> The guest sees two PCI devices: the passthrough device itself
> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion
> DMA mapping device (apple-dma-pci). On the QEMU side, an
> AppleVFIOContainer implements the IOMMU backend, and a C client
> library wraps the IOUserClient calls to the dext for config space,
> BAR MMIO, interrupts, reset, and DMA.
>
> DMA limitations:
>
> This is the biggest platform constraint. Unlike a typical IOMMU
> mapping operation where the caller specifies the IOVA, the
> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a
> system-assigned IOVA. There is no way to request a specific address.
> This means the guest's requested DMA addresses cannot be used
> directly. The guest kernel module must intercept DMA mapping calls
> and forward them through the companion device to get the actual
> hardware IOVA.
Hello,
Ugh this one is not great. By the way, Apple has a private PCIe passthrough
API used by Virtualization.framework but that’s a different design.
Would bounce buffering using something akin the confidential compute path and
a pre-defined chunk of host memory accessible from the device, and then managing
the guest address map work? (see swiotlb).
If the last part isn’t possible, something minimal to export an swiotlb window
through device tree with giving the IOVA there would be good too.
And that will get rid of a need for a apple-dma-pci device.
> There are also hard platform limits: approximately 1.5 GB total
> mapped memory and roughly 64k concurrent mappings. Not all
> workloads will fit within these limits, though GPU gaming and LLM
> inference have worked in practice.
That’s not too dissimilar from the confidential compute limitations.
>
> BAR access has performance issues as well. HVF does not expose
> controls to map device memory as cacheable in the guest, creating a
> significant performance penalty on BAR MMIO. Uncached mappings work
> correctly but slowly compared to what the hardware could do.
That’s not a macOS limitation and not an Apple hardware limitation, but
it’s more fundamental to how PCIe works.
Unlike CXL, PCIe doesn’t have a coherency protocol story, and the alternative
of uncached and doing manual software-managed flushes isn’t really tenable.
>
> What works:
> - PCI config space passthrough
> - BAR MMIO via direct-mapped device memory
> - MSI/MSI-X interrupts via async notification from the dext
> - Device reset (FLR with hot-reset fallback)
> - DMA mapping for guest device drivers
>
This is very interesting to see :)
> What doesn't work:
> - Expansion ROM / VBIOS passthrough
> - PCI BAR quirks
> - VGA region passthrough
> - Migration and dirty page tracking
> - Hot-unplug
>
> Questions for reviewers:
>
> 1. Is this something the VFIO maintainers would consider carrying
> upstream? The refactoring patches (3-6) are benign, but the Apple
> backend is a new platform with real limitations. That said, if Apple
> lifts some of the DART/HVF restrictions in a future macOS release, the
> code changes to take advantage would likely be minor. I'd like to
> understand whether this is in scope before doing the work to
> address review feedback on the full series.
>
> 2. The apple-dma-pci companion device: should this be a virtio device
> instead? I went with a simple custom PCI device because the virtio
> infrastructure didn't buy much for what is essentially a {map, unmap}
> register interface, but if virtio is preferred, what is the process
> for allocating a device ID? If a custom PCI device is the right
> approach, I've tentatively allocated 1b36:0015. Is there a process
> for reserving a device ID under the Red Hat PCI vendor, or is
> claiming it in pci-ids.rst sufficient? The guest-side kernel module
> hooks all DMA mapping functions for passed-through devices, which is
> unusual enough that I'm not sure it's upstreamable in the Linux
> kernel. I can maintain it out of tree if needed.
I’d recommend using bounce buffers like the CoCo case if possible. I don’t
think that the apple-dma-pci definitely-not-an-IOMMU is a good idea.
>
> 3. Should the macOS host-side DriverKit extension live in the QEMU
> tree? It's not included in this series and requires Apple code
> signing. I'm happy to keep it out of tree if that's preferred,
> or include the source if reviewers want it co-located.
Both are fine I think. Could you share compatibility with the tinygrad
one at https://github.com/tinygrad/tinygrad/tree/7e54992bf600789dbe5d37b99fe12a19c32e36a1/extra/usbgpu/tbgpu/installer and prebuilt at https://raw.githubusercontent.com/tinygrad/tinygpu_releases/refs/heads/main/TinyGPU.zip?
>
> 4. The existing VFIO code includes <linux/vfio.h> from the
> linux-headers/ tree, which is intended to track upstream Linux
> UAPI headers. To make this compile on macOS, I added minimal
> stub headers (include/compat/linux/types.h and linux/ioctl.h)
> so the existing vfio.h parses on macOS without modification. An
> alternative would be to move an approximation of vfio.h into
> standard-headers/, but that felt against the spirit of tracking
> the latest upstream headers, and the standard-headers import
> process strips ioctls which the VFIO code relies on. I felt
> the stub approach was the least invasive, but I'm open to
> changing it if there's a preferred way to handle this.
>
> [1] https://imgur.com/a/xoRS9kT
> [2] https://imgur.com/a/ui4pYF0
>
> Scott J. Goldman (10):
> vfio/pci: Use the write side of EventNotifier for IRQ signaling
> accel/hvf: avoid executable mappings for RAM-device memory
> vfio: Allow building on Darwin hosts
> vfio: Prepare existing code for Apple VFIO backend
> vfio: Add region_map and region_unmap callbacks to VFIODeviceIOOps
> vfio: Add device_reset callback to VFIODeviceIOOps
> vfio/apple: Add DriverKit dext client library
> vfio/apple: Add IOMMU container and PCI device
> vfio/apple: Add apple-dma-pci companion device
> docs: Add vfio-apple documentation and MAINTAINERS entry
>
> Kconfig.host | 3 +
> MAINTAINERS | 11 +
> accel/hvf/hvf-all.c | 10 +-
> backends/Kconfig | 2 +-
> docs/specs/pci-ids.rst | 3 +
> docs/system/device-emulation.rst | 1 +
> docs/system/devices/vfio-apple.rst | 160 +++++
> hw/vfio-user/device.c | 16 +-
> hw/vfio/Kconfig | 4 +-
> hw/vfio/ap.c | 4 +-
> hw/vfio/apple-device.c | 945 +++++++++++++++++++++++++++++
> hw/vfio/apple-dext-client.c | 681 +++++++++++++++++++++
> hw/vfio/apple-dext-client.h | 253 ++++++++
> hw/vfio/apple-dma.c | 540 +++++++++++++++++
> hw/vfio/apple.h | 74 +++
> hw/vfio/ccw.c | 2 +-
> hw/vfio/container-apple.c | 241 ++++++++
> hw/vfio/device.c | 42 ++
> hw/vfio/meson.build | 12 +-
> hw/vfio/migration.c | 5 +-
> hw/vfio/pci.c | 50 +-
> hw/vfio/pci.h | 1 +
> hw/vfio/region.c | 108 ++--
> hw/vfio/types.h | 2 +
> hw/vfio/vfio-helpers.h | 2 +-
> hw/vfio/vfio-migration-internal.h | 4 +-
> hw/vfio/vfio-region.h | 4 +
> include/compat/linux/ioctl.h | 2 +
> include/compat/linux/types.h | 26 +
> include/hw/pci/pci.h | 1 +
> include/hw/vfio/vfio-container.h | 1 +
> include/hw/vfio/vfio-device.h | 40 +-
> meson.build | 10 +-
> util/event_notifier-posix.c | 5 +-
> 34 files changed, 3197 insertions(+), 68 deletions(-)
> create mode 100644 docs/system/devices/vfio-apple.rst
> create mode 100644 hw/vfio/apple-device.c
> create mode 100644 hw/vfio/apple-dext-client.c
> create mode 100644 hw/vfio/apple-dext-client.h
> create mode 100644 hw/vfio/apple-dma.c
> create mode 100644 hw/vfio/apple.h
> create mode 100644 hw/vfio/container-apple.c
> create mode 100644 include/compat/linux/ioctl.h
> create mode 100644 include/compat/linux/types.h
>
> --
> 2.50.1 (Apple Git-155)
>
>
> On 5. Apr 2026, at 10:01, Mohamed Mediouni <mohamed@unpredictable.fr> wrote:
>
>>
>> On 5. Apr 2026, at 09:28, Scott J. Goldman <scottjgo@gmail.com> wrote:
>>
>> This series adds VFIO PCI device passthrough support for Apple Silicon
>> Macs running macOS, using a DriverKit extension (dext) as the host
>> backend instead of the Linux VFIO kernel driver.
>>
>> I'm sending this as an RFC because I'd like feedback before investing
>> further in upstreaming. The code is functional. I've tested it with
>> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU
>> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077
>> [1]), likely due to the BAR access penalty described below. AI
>> inference workloads appear less affected. Ollama with Qwen 3.5
>> generates around 140 tok/sec on the same setup [2].
>>
>> How it works:
>>
>> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio
>> for device access and DMA mapping. On macOS, there is no equivalent
>> kernel interface. Instead, a userspace DriverKit extension
>> (VFIOUserPCIDriver) mediates access to the physical PCI device through
>> IOKit's IOUserClient and PCIDriverKit APIs.
>>
>> The series keeps the existing VFIOPCIDevice model and reuses QEMU's
>> passthrough infrastructure. A few ioctl callsites are refactored into
>> io_ops callbacks, the build system is extended for Darwin, and the
>> Apple-specific backend plugs in behind those abstractions.
>>
>> The guest sees two PCI devices: the passthrough device itself
>> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion
>> DMA mapping device (apple-dma-pci). On the QEMU side, an
>> AppleVFIOContainer implements the IOMMU backend, and a C client
>> library wraps the IOUserClient calls to the dext for config space,
>> BAR MMIO, interrupts, reset, and DMA.
>>
>> DMA limitations:
>>
>> This is the biggest platform constraint. Unlike a typical IOMMU
>> mapping operation where the caller specifies the IOVA, the
>> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a
>> system-assigned IOVA. There is no way to request a specific address.
>> This means the guest's requested DMA addresses cannot be used
>> directly. The guest kernel module must intercept DMA mapping calls
>> and forward them through the companion device to get the actual
>> hardware IOVA.
>
> Hello,
>
> Ugh this one is not great. By the way, Apple has a private PCIe passthrough
> API used by Virtualization.framework but that’s a different design.
>
> Would bounce buffering using something akin the confidential compute path and
> a pre-defined chunk of host memory accessible from the device, and then managing
> the guest address map work? (see swiotlb).
see restricted-dma-pool
I think in this specific case that ACPI support isn’t worth it and that FDT
will be good enough.
The limitation that I can see there if if you can’t match IOVA and GPA for that
restricted DMA pool, then you’ll need a small (and hopefully easy to merge) kernel
change.
>
> If the last part isn’t possible, something minimal to export an swiotlb window
> through device tree with giving the IOVA there would be good too.
>
> And that will get rid of a need for a apple-dma-pci device.
>
>> There are also hard platform limits: approximately 1.5 GB total
>> mapped memory and roughly 64k concurrent mappings. Not all
>> workloads will fit within these limits, though GPU gaming and LLM
>> inference have worked in practice.
>
> That’s not too dissimilar from the confidential compute limitations.
>
>>
>> BAR access has performance issues as well. HVF does not expose
>> controls to map device memory as cacheable in the guest, creating a
>> significant performance penalty on BAR MMIO. Uncached mappings work
>> correctly but slowly compared to what the hardware could do.
>
> That’s not a macOS limitation and not an Apple hardware limitation, but
> it’s more fundamental to how PCIe works.
>
> Unlike CXL, PCIe doesn’t have a coherency protocol story, and the alternative
> of uncached and doing manual software-managed flushes isn’t really tenable.
>
>>
>> What works:
>> - PCI config space passthrough
>> - BAR MMIO via direct-mapped device memory
>> - MSI/MSI-X interrupts via async notification from the dext
>> - Device reset (FLR with hot-reset fallback)
>> - DMA mapping for guest device drivers
>>
> This is very interesting to see :)
>
>> What doesn't work:
>> - Expansion ROM / VBIOS passthrough
>> - PCI BAR quirks
>> - VGA region passthrough
>> - Migration and dirty page tracking
>> - Hot-unplug
>>
>
>
>
>> Questions for reviewers:
>>
>> 1. Is this something the VFIO maintainers would consider carrying
>> upstream? The refactoring patches (3-6) are benign, but the Apple
>> backend is a new platform with real limitations. That said, if Apple
>> lifts some of the DART/HVF restrictions in a future macOS release, the
>> code changes to take advantage would likely be minor. I'd like to
>> understand whether this is in scope before doing the work to
>> address review feedback on the full series.
>>
>> 2. The apple-dma-pci companion device: should this be a virtio device
>> instead? I went with a simple custom PCI device because the virtio
>> infrastructure didn't buy much for what is essentially a {map, unmap}
>> register interface, but if virtio is preferred, what is the process
>> for allocating a device ID? If a custom PCI device is the right
>> approach, I've tentatively allocated 1b36:0015. Is there a process
>> for reserving a device ID under the Red Hat PCI vendor, or is
>> claiming it in pci-ids.rst sufficient? The guest-side kernel module
>> hooks all DMA mapping functions for passed-through devices, which is
>> unusual enough that I'm not sure it's upstreamable in the Linux
>> kernel. I can maintain it out of tree if needed.
>
> I’d recommend using bounce buffers like the CoCo case if possible. I don’t
> think that the apple-dma-pci definitely-not-an-IOMMU is a good idea.
>
>>
>> 3. Should the macOS host-side DriverKit extension live in the QEMU
>> tree? It's not included in this series and requires Apple code
>> signing. I'm happy to keep it out of tree if that's preferred,
>> or include the source if reviewers want it co-located.
>
> Both are fine I think. Could you share compatibility with the tinygrad
> one at https://github.com/tinygrad/tinygrad/tree/7e54992bf600789dbe5d37b99fe12a19c32e36a1/extra/usbgpu/tbgpu/installer and prebuilt at https://raw.githubusercontent.com/tinygrad/tinygpu_releases/refs/heads/main/TinyGPU.zip?
>
>>
>> 4. The existing VFIO code includes <linux/vfio.h> from the
>> linux-headers/ tree, which is intended to track upstream Linux
>> UAPI headers. To make this compile on macOS, I added minimal
>> stub headers (include/compat/linux/types.h and linux/ioctl.h)
>> so the existing vfio.h parses on macOS without modification. An
>> alternative would be to move an approximation of vfio.h into
>> standard-headers/, but that felt against the spirit of tracking
>> the latest upstream headers, and the standard-headers import
>> process strips ioctls which the VFIO code relies on. I felt
>> the stub approach was the least invasive, but I'm open to
>> changing it if there's a preferred way to handle this.
>>
>> [1] https://imgur.com/a/xoRS9kT
>> [2] https://imgur.com/a/ui4pYF0
>>
>> Scott J. Goldman (10):
>> vfio/pci: Use the write side of EventNotifier for IRQ signaling
>> accel/hvf: avoid executable mappings for RAM-device memory
>> vfio: Allow building on Darwin hosts
>> vfio: Prepare existing code for Apple VFIO backend
>> vfio: Add region_map and region_unmap callbacks to VFIODeviceIOOps
>> vfio: Add device_reset callback to VFIODeviceIOOps
>> vfio/apple: Add DriverKit dext client library
>> vfio/apple: Add IOMMU container and PCI device
>> vfio/apple: Add apple-dma-pci companion device
>> docs: Add vfio-apple documentation and MAINTAINERS entry
>>
>> Kconfig.host | 3 +
>> MAINTAINERS | 11 +
>> accel/hvf/hvf-all.c | 10 +-
>> backends/Kconfig | 2 +-
>> docs/specs/pci-ids.rst | 3 +
>> docs/system/device-emulation.rst | 1 +
>> docs/system/devices/vfio-apple.rst | 160 +++++
>> hw/vfio-user/device.c | 16 +-
>> hw/vfio/Kconfig | 4 +-
>> hw/vfio/ap.c | 4 +-
>> hw/vfio/apple-device.c | 945 +++++++++++++++++++++++++++++
>> hw/vfio/apple-dext-client.c | 681 +++++++++++++++++++++
>> hw/vfio/apple-dext-client.h | 253 ++++++++
>> hw/vfio/apple-dma.c | 540 +++++++++++++++++
>> hw/vfio/apple.h | 74 +++
>> hw/vfio/ccw.c | 2 +-
>> hw/vfio/container-apple.c | 241 ++++++++
>> hw/vfio/device.c | 42 ++
>> hw/vfio/meson.build | 12 +-
>> hw/vfio/migration.c | 5 +-
>> hw/vfio/pci.c | 50 +-
>> hw/vfio/pci.h | 1 +
>> hw/vfio/region.c | 108 ++--
>> hw/vfio/types.h | 2 +
>> hw/vfio/vfio-helpers.h | 2 +-
>> hw/vfio/vfio-migration-internal.h | 4 +-
>> hw/vfio/vfio-region.h | 4 +
>> include/compat/linux/ioctl.h | 2 +
>> include/compat/linux/types.h | 26 +
>> include/hw/pci/pci.h | 1 +
>> include/hw/vfio/vfio-container.h | 1 +
>> include/hw/vfio/vfio-device.h | 40 +-
>> meson.build | 10 +-
>> util/event_notifier-posix.c | 5 +-
>> 34 files changed, 3197 insertions(+), 68 deletions(-)
>> create mode 100644 docs/system/devices/vfio-apple.rst
>> create mode 100644 hw/vfio/apple-device.c
>> create mode 100644 hw/vfio/apple-dext-client.c
>> create mode 100644 hw/vfio/apple-dext-client.h
>> create mode 100644 hw/vfio/apple-dma.c
>> create mode 100644 hw/vfio/apple.h
>> create mode 100644 hw/vfio/container-apple.c
>> create mode 100644 include/compat/linux/ioctl.h
>> create mode 100644 include/compat/linux/types.h
>>
>> --
>> 2.50.1 (Apple Git-155)
On Sun Apr 5, 2026 at 1:14 AM PDT, Mohamed Mediouni wrote:
>
>
>> On 5. Apr 2026, at 10:01, Mohamed Mediouni <mohamed@unpredictable.fr> wrote:
>>
>>>
>>> On 5. Apr 2026, at 09:28, Scott J. Goldman <scottjgo@gmail.com> wrote:
>>>
>>> This series adds VFIO PCI device passthrough support for Apple Silicon
>>> Macs running macOS, using a DriverKit extension (dext) as the host
>>> backend instead of the Linux VFIO kernel driver.
>>>
>>> I'm sending this as an RFC because I'd like feedback before investing
>>> further in upstreaming. The code is functional. I've tested it with
>>> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU
>>> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077
>>> [1]), likely due to the BAR access penalty described below. AI
>>> inference workloads appear less affected. Ollama with Qwen 3.5
>>> generates around 140 tok/sec on the same setup [2].
>>>
>>> How it works:
>>>
>>> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio
>>> for device access and DMA mapping. On macOS, there is no equivalent
>>> kernel interface. Instead, a userspace DriverKit extension
>>> (VFIOUserPCIDriver) mediates access to the physical PCI device through
>>> IOKit's IOUserClient and PCIDriverKit APIs.
>>>
>>> The series keeps the existing VFIOPCIDevice model and reuses QEMU's
>>> passthrough infrastructure. A few ioctl callsites are refactored into
>>> io_ops callbacks, the build system is extended for Darwin, and the
>>> Apple-specific backend plugs in behind those abstractions.
>>>
>>> The guest sees two PCI devices: the passthrough device itself
>>> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion
>>> DMA mapping device (apple-dma-pci). On the QEMU side, an
>>> AppleVFIOContainer implements the IOMMU backend, and a C client
>>> library wraps the IOUserClient calls to the dext for config space,
>>> BAR MMIO, interrupts, reset, and DMA.
>>>
>>> DMA limitations:
>>>
>>> This is the biggest platform constraint. Unlike a typical IOMMU
>>> mapping operation where the caller specifies the IOVA, the
>>> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a
>>> system-assigned IOVA. There is no way to request a specific address.
>>> This means the guest's requested DMA addresses cannot be used
>>> directly. The guest kernel module must intercept DMA mapping calls
>>> and forward them through the companion device to get the actual
>>> hardware IOVA.
>>
>> Hello,
>>
>> Ugh this one is not great. By the way, Apple has a private PCIe passthrough
>> API used by Virtualization.framework but that’s a different design.
This is really interesting and I had not heard about this. Are you
able to elaborate on this one at all? Maybe this is something where an
internal API to manipulate the DART is available inside
Virtualization.framework?
>> Would bounce buffering using something akin the confidential compute path and
>> a pre-defined chunk of host memory accessible from the device, and then managing
>> the guest address map work? (see swiotlb).
I tested this approach early on, but ran into a couple issues:
1. Not only does PrepareForDMA() limit the total size of the pool, but
it also limits the size of individual allocations. IIRC it not very
large at around 16MB. Thankfully, I found that the allocator seemed
to just keep allocating continguously across multiple allocations, so
maybe that's fine?
2. Linux swiotlb default configuration is too small for GPU drivers. The
max single mapping is 256KB and the total pool size is 64MB. The
overall pool size is configurable but the max single mapping is
derived from IO_TLB_SEGSIZE and IO_TLB_SIZE which are compile-time
constants. During games, I have seen roughly ~900MB of active DMA
mappings and mappings much larger than 256kb.
I abandoned this approach because it seemed like the CPU penalty of
bouncing all the DMA buffers would be pretty severe and the swiotlb
allocator just didn't seem designed for this much memory pressure. I
also was hoping to avoid the requirement of recompiling the entire guest
kernel as a prerequisite for guests to use this passthrough feature. On
top of that, I wasn't sure if upstream would even be willing to take
changes to support this use case, since it's so far outside what the
existing swiotlb allocator would normally be doing.
That said, you were saying that CoCo is fine with this restriction? Do
other devices just not have drivers that are doing so much allocation? I
didn't actually try changing the constants and recompiling the guest
kernel in swiotlb to make the pool big enough for it to really work at
all with the nvidia guest driver, I will have to see what happens.
>
> see restricted-dma-pool
>
> I think in this specific case that ACPI support isn’t worth it and that FDT
> will be good enough.
Yes, this seems fine to me as well if we went the swiotlb route. It
could be a different `-machine` type or perhaps a machine-specific param
if we went this route, maybe.
>
> The limitation that I can see there if if you can’t match IOVA and GPA for that
> restricted DMA pool, then you’ll need a small (and hopefully easy to merge) kernel
> change.
>> If the last part isn’t possible, something minimal to export an swiotlb window
>> through device tree with giving the IOVA there would be good too.
>>
>> And that will get rid of a need for a apple-dma-pci device.
I am not 100% sure since I didn't try this exactly, but it seems like
you could have the DriverKit side allocate a big DMA buffer before the
guest starts, and then identity map the region somewhere inside the
guest with the `restricted-dma-pool` attribute attached to it. The
caveat being that you might have to pray that the region is contiguous
or introduce a much more complicated swiotlb subsystem allocator.
WRT a kernel patch to make it easier, can you elaborate on what you werelt thinking there?
>>> There are also hard platform limits: approximately 1.5 GB total
>>> mapped memory and roughly 64k concurrent mappings. Not all
>>> workloads will fit within these limits, though GPU gaming and LLM
>>> inference have worked in practice.
>>
>> That’s not too dissimilar from the confidential compute limitations.
>>
>>>
>>> BAR access has performance issues as well. HVF does not expose
>>> controls to map device memory as cacheable in the guest, creating a
>>> significant performance penalty on BAR MMIO. Uncached mappings work
>>> correctly but slowly compared to what the hardware could do.
>>
>> That’s not a macOS limitation and not an Apple hardware limitation, but
>> it’s more fundamental to how PCIe works.
>>
>> Unlike CXL, PCIe doesn’t have a coherency protocol story, and the alternative
>> of uncached and doing manual software-managed flushes isn’t really tenable.
Apologies, I misspoke. It's not cacheability that's the issue. I think
it's write-combining. Specifically the question is how the HVF sets the
attributes in the stage-2 page tables. The behavior is observable by
looking at the performance of sweeping writes across the BARs.
As part of the work to implement and test this change I wrote such a
benchmark as a client of the dext in the host, and a Linux kernel module
that runs in the guest. It takes BAR1 (VRAM aperture) and does a write
sweep of 8MB with 4 passes and measures the results.
Host (mapped with kIOWriteCombineCache): 386mb/sec
Host (mapped with kIInhibitCache): 46mb/sec
Guest (mapped with ioremap_wc) 31mb/s
Guest (mapped with ioremap): 31mb/s
In the case of BAR1, it is marked prefetchable so I believe you would
usually want to map it with write-combining. I'm not sure why the case
without write-combining is worse in the guest, but it's the same order
of magnitude. I think the real interesting thing there is that the
write-combining map in the guest performs identically to the one
without. To me, that indicates that perhaps the stage-2 bits are not set
properly. Even though the host has mapped the memory with
kIOWriteCombineCache, this wasn't propogated when HVF maps this into the
guest, which probably falls back to the lesser of the stage-1 vs stage-2
mappings (i.e. disabling write-combining).
>>
>>>
>>> What works:
>>> - PCI config space passthrough
>>> - BAR MMIO via direct-mapped device memory
>>> - MSI/MSI-X interrupts via async notification from the dext
>>> - Device reset (FLR with hot-reset fallback)
>>> - DMA mapping for guest device drivers
>>>
>> This is very interesting to see :)
Thanks! It's always nice to catch some interest/advice for a strange
project like this.
>>
>>> What doesn't work:
>>> - Expansion ROM / VBIOS passthrough
>>> - PCI BAR quirks
>>> - VGA region passthrough
>>> - Migration and dirty page tracking
>>> - Hot-unplug
>>>
>>
>>
>>
>>> Questions for reviewers:
>>>
>>> 1. Is this something the VFIO maintainers would consider carrying
>>> upstream? The refactoring patches (3-6) are benign, but the Apple
>>> backend is a new platform with real limitations. That said, if Apple
>>> lifts some of the DART/HVF restrictions in a future macOS release, the
>>> code changes to take advantage would likely be minor. I'd like to
>>> understand whether this is in scope before doing the work to
>>> address review feedback on the full series.
>>>
>>> 2. The apple-dma-pci companion device: should this be a virtio device
>>> instead? I went with a simple custom PCI device because the virtio
>>> infrastructure didn't buy much for what is essentially a {map, unmap}
>>> register interface, but if virtio is preferred, what is the process
>>> for allocating a device ID? If a custom PCI device is the right
>>> approach, I've tentatively allocated 1b36:0015. Is there a process
>>> for reserving a device ID under the Red Hat PCI vendor, or is
>>> claiming it in pci-ids.rst sufficient? The guest-side kernel module
>>> hooks all DMA mapping functions for passed-through devices, which is
>>> unusual enough that I'm not sure it's upstreamable in the Linux
>>> kernel. I can maintain it out of tree if needed.
>>
>> I’d recommend using bounce buffers like the CoCo case if possible. I don’t
>> think that the apple-dma-pci definitely-not-an-IOMMU is a good idea.
To be clear, it definitely is weird and bad, but it was seemingly the
least bad option that I was able to get working with minimal guest
changes (just one guest kmod).
>>
>>>
>>> 3. Should the macOS host-side DriverKit extension live in the QEMU
>>> tree? It's not included in this series and requires Apple code
>>> signing. I'm happy to keep it out of tree if that's preferred,
>>> or include the source if reviewers want it co-located.
>>
>> Both are fine I think. Could you share compatibility with the tinygrad
>> one at https://github.com/tinygrad/tinygrad/tree/7e54992bf600789dbe5d37b99fe12a19c32e36a1/extra/usbgpu/tbgpu/installer and prebuilt at https://raw.githubusercontent.com/tinygrad/tinygpu_releases/refs/heads/main/TinyGPU.zip?
This is a good question and not something I had considered. My module
probably works a little different than their module. It's possible I'm
wrong but my understanding was:
1. They got apple entitlements for AMD/NVIDIA driver vendor ids only.
That said, if it became compatible with QEMU, I suppose it would be
an easy case to make that it could be expanded to wildcard (another
developer indicated to me that Apple was willing to grant the
wildcard entitlement if the use case was justifiable)
2. The architecture of their driver is a little different. I believe
they are allocating DMA-able memory in the driver and mapping it down
to userland, so it's kind of the reverse of what I'm doing now. I
guess, conceivably they could change how they are doing this to unify
our efforts.
Thanks,
-sjg
> On 6. Apr 2026, at 01:20, Scott J. Goldman <scottjgo@gmail.com> wrote:
>
> On Sun Apr 5, 2026 at 1:14 AM PDT, Mohamed Mediouni wrote:
>>
>>
>>> On 5. Apr 2026, at 10:01, Mohamed Mediouni <mohamed@unpredictable.fr> wrote:
>>>
>>>>
>>>> On 5. Apr 2026, at 09:28, Scott J. Goldman <scottjgo@gmail.com> wrote:
>>>>
>>>> This series adds VFIO PCI device passthrough support for Apple Silicon
>>>> Macs running macOS, using a DriverKit extension (dext) as the host
>>>> backend instead of the Linux VFIO kernel driver.
>>>>
>>>> I'm sending this as an RFC because I'd like feedback before investing
>>>> further in upstreaming. The code is functional. I've tested it with
>>>> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU
>>>> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077
>>>> [1]), likely due to the BAR access penalty described below. AI
>>>> inference workloads appear less affected. Ollama with Qwen 3.5
>>>> generates around 140 tok/sec on the same setup [2].
>>>>
>>>> How it works:
>>>>
>>>> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio
>>>> for device access and DMA mapping. On macOS, there is no equivalent
>>>> kernel interface. Instead, a userspace DriverKit extension
>>>> (VFIOUserPCIDriver) mediates access to the physical PCI device through
>>>> IOKit's IOUserClient and PCIDriverKit APIs.
>>>>
>>>> The series keeps the existing VFIOPCIDevice model and reuses QEMU's
>>>> passthrough infrastructure. A few ioctl callsites are refactored into
>>>> io_ops callbacks, the build system is extended for Darwin, and the
>>>> Apple-specific backend plugs in behind those abstractions.
>>>>
>>>> The guest sees two PCI devices: the passthrough device itself
>>>> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion
>>>> DMA mapping device (apple-dma-pci). On the QEMU side, an
>>>> AppleVFIOContainer implements the IOMMU backend, and a C client
>>>> library wraps the IOUserClient calls to the dext for config space,
>>>> BAR MMIO, interrupts, reset, and DMA.
>>>>
>>>> DMA limitations:
>>>>
>>>> This is the biggest platform constraint. Unlike a typical IOMMU
>>>> mapping operation where the caller specifies the IOVA, the
>>>> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a
>>>> system-assigned IOVA. There is no way to request a specific address.
>>>> This means the guest's requested DMA addresses cannot be used
>>>> directly. The guest kernel module must intercept DMA mapping calls
>>>> and forward them through the companion device to get the actual
>>>> hardware IOVA.
>>>
>>> Hello,
>>>
>>> Ugh this one is not great. By the way, Apple has a private PCIe passthrough
>>> API used by Virtualization.framework but that’s a different design.
>
> This is really interesting and I had not heard about this. Are you
> able to elaborate on this one at all? Maybe this is something where an
> internal API to manipulate the DART is available inside
> Virtualization.framework?
Hello,
All of it needs using private entitlements currently.
It’s _VZPCIPassthroughDeviceConfiguration, a private class needing com.apple.private.virtualization to use.
The VMM process itself then uses the com.apple.private.PCIPassthrough.access entitlement. I’m not
sure whether OS versions even have all the code currently though.
>>> Would bounce buffering using something akin the confidential compute path and
>>> a pre-defined chunk of host memory accessible from the device, and then managing
>>> the guest address map work? (see swiotlb).
>
> I tested this approach early on, but ran into a couple issues:
>
> 1. Not only does PrepareForDMA() limit the total size of the pool, but
> it also limits the size of individual allocations. IIRC it not very
> large at around 16MB.
Sigh.
> Thankfully, I found that the allocator seemed
> to just keep allocating continguously across multiple allocations, so
> maybe that's fine?
That’s good… but it sounds brittle…
> 2. Linux swiotlb default configuration is too small for GPU drivers. The
> max single mapping is 256KB and the total pool size is 64MB. The
> overall pool size is configurable but the max single mapping is
> derived from IO_TLB_SEGSIZE and IO_TLB_SIZE which are compile-time
> constants. During games, I have seen roughly ~900MB of active DMA
> mappings and mappings much larger than 256kb.
Pre-defined mappings with restricted-dma-pool sound like a good idea there.
>
> I abandoned this approach because it seemed like the CPU penalty of
> bouncing all the DMA buffers would be pretty severe and the swiotlb
> allocator just didn't seem designed for this much memory pressure. I
> also was hoping to avoid the requirement of recompiling the entire guest
> kernel as a prerequisite for guests to use this passthrough feature. On
> top of that, I wasn't sure if upstream would even be willing to take
> changes to support this use case, since it's so far outside what the
> existing swiotlb allocator would normally be doing.
>
> That said, you were saying that CoCo is fine with this restriction? Do
> other devices just not have drivers that are doing so much allocation? I
> didn't actually try changing the constants and recompiling the guest
> kernel in swiotlb to make the pool big enough for it to really work at
> all with the nvidia guest driver, I will have to see what happens.
CoCo with bounce buffering works with NVIDIA GPUs. It had to be done because
no trusted I/O path (and implementing that is a quagmire).
A recent Intel post about it claiming production-readiness:
https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Confidential-AI-with-GPU-Acceleration-Bounce-Buffers-Offer-a/post/1740417
>
>>
>> see restricted-dma-pool
>>
>> I think in this specific case that ACPI support isn’t worth it and that FDT
>> will be good enough.
>
> Yes, this seems fine to me as well if we went the swiotlb route. It
> could be a different `-machine` type or perhaps a machine-specific param
> if we went this route, maybe.
>
>>
>> The limitation that I can see there if if you can’t match IOVA and GPA for that
>> restricted DMA pool, then you’ll need a small (and hopefully easy to merge) kernel
>> change.
>>> If the last part isn’t possible, something minimal to export an swiotlb window
>>> through device tree with giving the IOVA there would be good too.
>>>
>>> And that will get rid of a need for a apple-dma-pci device.
>
> I am not 100% sure since I didn't try this exactly, but it seems like
> you could have the DriverKit side allocate a big DMA buffer before the
> guest starts, and then identity map the region somewhere inside the
> guest with the `restricted-dma-pool` attribute attached to it. The
> caveat being that you might have to pray that the region is contiguous
> or introduce a much more complicated swiotlb subsystem allocator.
>
> WRT a kernel patch to make it easier, can you elaborate on what you werelt thinking there?
Using restricted-dma-pool - with changes to be able specify to IOVA base if necessary.
>
>>>> There are also hard platform limits: approximately 1.5 GB total
>>>> mapped memory and roughly 64k concurrent mappings. Not all
>>>> workloads will fit within these limits, though GPU gaming and LLM
>>>> inference have worked in practice.
>>>
>>> That’s not too dissimilar from the confidential compute limitations.
>>>
>>>>
>>>> BAR access has performance issues as well. HVF does not expose
>>>> controls to map device memory as cacheable in the guest, creating a
>>>> significant performance penalty on BAR MMIO. Uncached mappings work
>>>> correctly but slowly compared to what the hardware could do.
>>>
>>> That’s not a macOS limitation and not an Apple hardware limitation, but
>>> it’s more fundamental to how PCIe works.
>>>
>>> Unlike CXL, PCIe doesn’t have a coherency protocol story, and the alternative
>>> of uncached and doing manual software-managed flushes isn’t really tenable.
>
> Apologies, I misspoke. It's not cacheability that's the issue. I think
> it's write-combining. Specifically the question is how the HVF sets the
> attributes in the stage-2 page tables. The behavior is observable by
> looking at the performance of sweeping writes across the BARs.
>
Hello,
Oh that makes a lot more sense.
There are also other oddities going on there. For device memory, macOS does this thing:
https://github.com/apple-oss-distributions/xnu/blob/main/osfmk/arm64/sleh.c#L1756C1-L1756C33
This function is empty in open-source XNU, but it’s very much *not* empty in closed-source XNU (sigh).
> As part of the work to implement and test this change I wrote such a
> benchmark as a client of the dext in the host, and a Linux kernel module
> that runs in the guest. It takes BAR1 (VRAM aperture) and does a write
> sweep of 8MB with 4 passes and measures the results.
>
> Host (mapped with kIOWriteCombineCache): 386mb/sec
> Host (mapped with kIInhibitCache): 46mb/sec
>
> Guest (mapped with ioremap_wc) 31mb/s
> Guest (mapped with ioremap): 31mb/s
>
> In the case of BAR1, it is marked prefetchable so I believe you would
> usually want to map it with write-combining. I'm not sure why the case
> without write-combining is worse in the guest, but it's the same order
> of magnitude. I think the real interesting thing there is that the
> write-combining map in the guest performs identically to the one
> without. To me, that indicates that perhaps the stage-2 bits are not set
> properly.
That indeed looks like that...
> Even though the host has mapped the memory with
> kIOWriteCombineCache, this wasn't propogated when HVF maps this into the
> guest, which probably falls back to the lesser of the stage-1 vs stage-2
> mappings (i.e. disabling write-combining).
>
>
>>>
>>>>
>>>> What works:
>>>> - PCI config space passthrough
>>>> - BAR MMIO via direct-mapped device memory
>>>> - MSI/MSI-X interrupts via async notification from the dext
>>>> - Device reset (FLR with hot-reset fallback)
>>>> - DMA mapping for guest device drivers
>>>>
>>> This is very interesting to see :)
>
> Thanks! It's always nice to catch some interest/advice for a strange
> project like this.
>
>>>
>>>> What doesn't work:
>>>> - Expansion ROM / VBIOS passthrough
>>>> - PCI BAR quirks
>>>> - VGA region passthrough
>>>> - Migration and dirty page tracking
>>>> - Hot-unplug
>>>>
>>>
>>>
>>>
>>>> Questions for reviewers:
>>>>
>>>> 1. Is this something the VFIO maintainers would consider carrying
>>>> upstream? The refactoring patches (3-6) are benign, but the Apple
>>>> backend is a new platform with real limitations. That said, if Apple
>>>> lifts some of the DART/HVF restrictions in a future macOS release, the
>>>> code changes to take advantage would likely be minor. I'd like to
>>>> understand whether this is in scope before doing the work to
>>>> address review feedback on the full series.
>>>>
>>>> 2. The apple-dma-pci companion device: should this be a virtio device
>>>> instead? I went with a simple custom PCI device because the virtio
>>>> infrastructure didn't buy much for what is essentially a {map, unmap}
>>>> register interface, but if virtio is preferred, what is the process
>>>> for allocating a device ID? If a custom PCI device is the right
>>>> approach, I've tentatively allocated 1b36:0015. Is there a process
>>>> for reserving a device ID under the Red Hat PCI vendor, or is
>>>> claiming it in pci-ids.rst sufficient? The guest-side kernel module
>>>> hooks all DMA mapping functions for passed-through devices, which is
>>>> unusual enough that I'm not sure it's upstreamable in the Linux
>>>> kernel. I can maintain it out of tree if needed.
>>>
>>> I’d recommend using bounce buffers like the CoCo case if possible. I don’t
>>> think that the apple-dma-pci definitely-not-an-IOMMU is a good idea.
>
> To be clear, it definitely is weird and bad, but it was seemingly the
> least bad option that I was able to get working with minimal guest
> changes (just one guest kmod).
>
>>>
>>>>
>>>> 3. Should the macOS host-side DriverKit extension live in the QEMU
>>>> tree? It's not included in this series and requires Apple code
>>>> signing. I'm happy to keep it out of tree if that's preferred,
>>>> or include the source if reviewers want it co-located.
>>>
>>> Both are fine I think. Could you share compatibility with the tinygrad
>>> one at https://github.com/tinygrad/tinygrad/tree/7e54992bf600789dbe5d37b99fe12a19c32e36a1/extra/usbgpu/tbgpu/installer and prebuilt at https://raw.githubusercontent.com/tinygrad/tinygpu_releases/refs/heads/main/TinyGPU.zip?
>
> This is a good question and not something I had considered. My module
> probably works a little different than their module. It's possible I'm
> wrong but my understanding was:
>
> 1. They got apple entitlements for AMD/NVIDIA driver vendor ids only.
> That said, if it became compatible with QEMU, I suppose it would be
> an easy case to make that it could be expanded to wildcard (another
> developer indicated to me that Apple was willing to grant the
> wildcard entitlement if the use case was justifiable)
> 2. The architecture of their driver is a little different. I believe
> they are allocating DMA-able memory in the driver and mapping it down
> to userland, so it's kind of the reverse of what I'm doing now. I
> guess, conceivably they could change how they are doing this to unify
> our efforts.
>
> Thanks,
> -sjg
>
On Sun Apr 5, 2026 at 5:16 PM PDT, Mohamed Mediouni wrote: > > >> On 6. Apr 2026, at 01:20, Scott J. Goldman <scottjgo@gmail.com> wrote: >> >> On Sun Apr 5, 2026 at 1:14 AM PDT, Mohamed Mediouni wrote: >>> >>> >>>> On 5. Apr 2026, at 10:01, Mohamed Mediouni <mohamed@unpredictable.fr> wrote: >>>> >>>>> >>>>> On 5. Apr 2026, at 09:28, Scott J. Goldman <scottjgo@gmail.com> wrote: >>>>> >>>>> This series adds VFIO PCI device passthrough support for Apple Silicon >>>>> Macs running macOS, using a DriverKit extension (dext) as the host >>>>> backend instead of the Linux VFIO kernel driver. >>>>> >>>>> I'm sending this as an RFC because I'd like feedback before investing >>>>> further in upstreaming. The code is functional. I've tested it with >>>>> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU >>>>> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077 >>>>> [1]), likely due to the BAR access penalty described below. AI >>>>> inference workloads appear less affected. Ollama with Qwen 3.5 >>>>> generates around 140 tok/sec on the same setup [2]. >>>>> >>>>> How it works: >>>>> >>>>> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio >>>>> for device access and DMA mapping. On macOS, there is no equivalent >>>>> kernel interface. Instead, a userspace DriverKit extension >>>>> (VFIOUserPCIDriver) mediates access to the physical PCI device through >>>>> IOKit's IOUserClient and PCIDriverKit APIs. >>>>> >>>>> The series keeps the existing VFIOPCIDevice model and reuses QEMU's >>>>> passthrough infrastructure. A few ioctl callsites are refactored into >>>>> io_ops callbacks, the build system is extended for Darwin, and the >>>>> Apple-specific backend plugs in behind those abstractions. >>>>> >>>>> The guest sees two PCI devices: the passthrough device itself >>>>> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion >>>>> DMA mapping device (apple-dma-pci). On the QEMU side, an >>>>> AppleVFIOContainer implements the IOMMU backend, and a C client >>>>> library wraps the IOUserClient calls to the dext for config space, >>>>> BAR MMIO, interrupts, reset, and DMA. >>>>> >>>>> DMA limitations: >>>>> >>>>> This is the biggest platform constraint. Unlike a typical IOMMU >>>>> mapping operation where the caller specifies the IOVA, the >>>>> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a >>>>> system-assigned IOVA. There is no way to request a specific address. >>>>> This means the guest's requested DMA addresses cannot be used >>>>> directly. The guest kernel module must intercept DMA mapping calls >>>>> and forward them through the companion device to get the actual >>>>> hardware IOVA. >>>> >>>> Hello, >>>> >>>> Ugh this one is not great. By the way, Apple has a private PCIe passthrough >>>> API used by Virtualization.framework but that's a different design. >> >> This is really interesting and I had not heard about this. Are you >> able to elaborate on this one at all? Maybe this is something where an >> internal API to manipulate the DART is available inside >> Virtualization.framework? > > Hello, > > All of it needs using private entitlements currently. > > It's _VZPCIPassthroughDeviceConfiguration, a private class needing com.apple.private.virtualization to use. > > The VMM process itself then uses the com.apple.private.PCIPassthrough.access entitlement. I'm not > sure whether OS versions even have all the code currently though. > Appreciate the pointers here. It looks like, as you said, the framework taps into a bunch of code that isn't shipped to us mere mortals. I can see from some of the code in Virtualization.framework the general shape of what they're doing, though. It looks like they implement a virtio-iommu device that ultimately calls into the host kernel with some internal APIs to do the DART mappings. >>>> Would bounce buffering using something akin the confidential compute path and >>>> a pre-defined chunk of host memory accessible from the device, and then managing >>>> the guest address map work? (see swiotlb). >> >> I tested this approach early on, but ran into a couple issues: >> >> 1. Not only does PrepareForDMA() limit the total size of the pool, but >> it also limits the size of individual allocations. IIRC it not very >> large at around 16MB. > Sigh. > >> Thankfully, I found that the allocator seemed >> to just keep allocating continguously across multiple allocations, so >> maybe that's fine? > That's good… but it sounds brittle… > >> 2. Linux swiotlb default configuration is too small for GPU drivers. The >> max single mapping is 256KB and the total pool size is 64MB. The >> overall pool size is configurable but the max single mapping is >> derived from IO_TLB_SEGSIZE and IO_TLB_SIZE which are compile-time >> constants. During games, I have seen roughly ~900MB of active DMA >> mappings and mappings much larger than 256kb. > > Pre-defined mappings with restricted-dma-pool sound like a good idea there. >> >> I abandoned this approach because it seemed like the CPU penalty of >> bouncing all the DMA buffers would be pretty severe and the swiotlb >> allocator just didn't seem designed for this much memory pressure. I >> also was hoping to avoid the requirement of recompiling the entire guest >> kernel as a prerequisite for guests to use this passthrough feature. On >> top of that, I wasn't sure if upstream would even be willing to take >> changes to support this use case, since it's so far outside what the >> existing swiotlb allocator would normally be doing. >> >> That said, you were saying that CoCo is fine with this restriction? Do >> other devices just not have drivers that are doing so much allocation? I >> didn't actually try changing the constants and recompiling the guest >> kernel in swiotlb to make the pool big enough for it to really work at >> all with the nvidia guest driver, I will have to see what happens. > > CoCo with bounce buffering works with NVIDIA GPUs. It had to be done because > no trusted I/O path (and implementing that is a quagmire). > > A recent Intel post about it claiming production-readiness: > > https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Confidential-AI-with-GPU-Acceleration-Bounce-Buffers-Offer-a/post/1740417 I dug in here and implemented the restricted-dma-pool solution. Still needs some cleanups but it's working enough to test. To start with the bad news: 1. As I mentioned previously, the mainline kernel has a max 256k limit for any single swiotlb mapping. This has been debated a few times on LKML, but the consensus has been generally that it should not be changed or made configurable. You can see the threads: - https://lkml.org/lkml/2015/3/3/84 - https://patchwork.kernel.org/project/linux-mips/patch/20210914151016.3174924-1-Roman_Skakun@epam.com/ 2. NVIDIA drivers immediately make a contiguous 528384 byte allocation, at least on my hardware (NVIDIA RTX 5090), which is required as part of initializing the firmware on the card. This obviously fails immediately. It happens on both the NVIDIA-provided "open" drivers[1] and the in-tree `nouveau` [2], so it's more a hardware-specific issue than just a driver problem. If you hack around that (allocate 3 smaller buffers and hope they are contiguous), you'll see that both drivers assume coherent DMA memory (moreso in the nvidia driver than nouveau, but it's a problem in both). They map DMA buffers and then write data into the buffers afterward. So you end up sending empty swiotlb buffers to the card and it'll ultimately fail to initialize. It's possible the press release was referring to using the closed NVIDIA drivers, but those are now deprecated and don't support my newer GPU. But, there is good news: 1. The IOVA range that seems to always come from PCIDriverKit is pretty far outside the default qemu mapping from `-machine virt`, so the range can be cleanly identity mapped in the VM without overlap. One of the restrictions I noted earlier (16MB max contiguous mapping) was actually just a bug in my code. A large contiguous mapping seems to work fine, though the ~1.5GB limit is still real. 2. restricted-dma-pool DT attribute can be assigned per-device. So it doesn't affect other drivers on the system, and potentially that means you can have different pools for multiple devices (have not actually tried this yet, but seems like it would work). 3. More normal devices can work. I purchased a thunderbolt nvme enclosure and it works with the swiotlb bounce buffering with no kernel modifications. 4. With a sufficient amount of hacks in the driver, the NVIDIA "open" driver can be made to work, albeit with already slow gaming performance reduced to about 30% (~10fps) vs paravirt dma mapping (~30fps). I wasn't able to get CUDA working, but presumably that just needs more elbow grease. After sleeping on this a bit, I think my proposal would be: - The `restricted-dma-pool` method can be the default. For most devices this will work seamlessly, though users may have to specify a size for the pool, since the optimal size will vary for each device. - The apple-dma-pci thing can be downgraded from an actual device to an out-of-tree workaround. I have not yet tested it, but presumably it can use ivshmem or a virtual serial port to communicate the mappings. It's mostly a guest-side hack so it doesn't really need qemu involvement necessarily. - I doubt Apple will actually approve this for distribution, but I can write a kext that uses the kernel API to manipulate the DART directly. I didn't realize this was an option before. This can act as kind of a companion for my dext and as follow-on to this patchset, I can teach the vIOMMU device to use it. Eventually if Apple exposes this as something you can use in a dext, then the functionality can be moved into the dext and all of these concerns become moot. Until then, it can be an optimization if you're willing to run without SIP. If you think this is OK, I can prepare a new version of the patchset. Thanks, -sjg [1] https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/src/kernel/gpu/gsp/kernel_gsp.c#L5404 [2] https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c#L1827
> On 8. Apr 2026, at 09:02, Scott J. Goldman <scottjgo@gmail.com> wrote: > > On Sun Apr 5, 2026 at 5:16 PM PDT, Mohamed Mediouni wrote: >> >> >>> On 6. Apr 2026, at 01:20, Scott J. Goldman <scottjgo@gmail.com> wrote: >>> >>> On Sun Apr 5, 2026 at 1:14 AM PDT, Mohamed Mediouni wrote: >>>> >>>> >>>>> On 5. Apr 2026, at 10:01, Mohamed Mediouni <mohamed@unpredictable.fr> wrote: >>>>> >>>>>> >>>>>> On 5. Apr 2026, at 09:28, Scott J. Goldman <scottjgo@gmail.com> wrote: >>>>>> >>>>>> This series adds VFIO PCI device passthrough support for Apple Silicon >>>>>> Macs running macOS, using a DriverKit extension (dext) as the host >>>>>> backend instead of the Linux VFIO kernel driver. >>>>>> >>>>>> I'm sending this as an RFC because I'd like feedback before investing >>>>>> further in upstreaming. The code is functional. I've tested it with >>>>>> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU >>>>>> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077 >>>>>> [1]), likely due to the BAR access penalty described below. AI >>>>>> inference workloads appear less affected. Ollama with Qwen 3.5 >>>>>> generates around 140 tok/sec on the same setup [2]. >>>>>> >>>>>> How it works: >>>>>> >>>>>> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio >>>>>> for device access and DMA mapping. On macOS, there is no equivalent >>>>>> kernel interface. Instead, a userspace DriverKit extension >>>>>> (VFIOUserPCIDriver) mediates access to the physical PCI device through >>>>>> IOKit's IOUserClient and PCIDriverKit APIs. >>>>>> >>>>>> The series keeps the existing VFIOPCIDevice model and reuses QEMU's >>>>>> passthrough infrastructure. A few ioctl callsites are refactored into >>>>>> io_ops callbacks, the build system is extended for Darwin, and the >>>>>> Apple-specific backend plugs in behind those abstractions. >>>>>> >>>>>> The guest sees two PCI devices: the passthrough device itself >>>>>> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion >>>>>> DMA mapping device (apple-dma-pci). On the QEMU side, an >>>>>> AppleVFIOContainer implements the IOMMU backend, and a C client >>>>>> library wraps the IOUserClient calls to the dext for config space, >>>>>> BAR MMIO, interrupts, reset, and DMA. >>>>>> >>>>>> DMA limitations: >>>>>> >>>>>> This is the biggest platform constraint. Unlike a typical IOMMU >>>>>> mapping operation where the caller specifies the IOVA, the >>>>>> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a >>>>>> system-assigned IOVA. There is no way to request a specific address. >>>>>> This means the guest's requested DMA addresses cannot be used >>>>>> directly. The guest kernel module must intercept DMA mapping calls >>>>>> and forward them through the companion device to get the actual >>>>>> hardware IOVA. >>>>> >>>>> Hello, >>>>> >>>>> Ugh this one is not great. By the way, Apple has a private PCIe passthrough >>>>> API used by Virtualization.framework but that's a different design. >>> >>> This is really interesting and I had not heard about this. Are you >>> able to elaborate on this one at all? Maybe this is something where an >>> internal API to manipulate the DART is available inside >>> Virtualization.framework? >> >> Hello, >> >> All of it needs using private entitlements currently. >> >> It's _VZPCIPassthroughDeviceConfiguration, a private class needing com.apple.private.virtualization to use. >> >> The VMM process itself then uses the com.apple.private.PCIPassthrough.access entitlement. I'm not >> sure whether OS versions even have all the code currently though. >> > > Appreciate the pointers here. It looks like, as you said, the framework > taps into a bunch of code that isn't shipped to us mere mortals. I can > see from some of the code in Virtualization.framework the general shape > of what they're doing, though. > > It looks like they implement a virtio-iommu device that ultimately calls > into the host kernel with some internal APIs to do the DART mappings. > Hello, Some more details: The VMM side when using Virtualization.framework is at /System/Library/Frameworks/Virtualization.framework/XPCServices/com.apple.Virtualization.VirtualMachine.xpc/Contents/MacOS/com.apple.Virtualization.VirtualMachine as Virtualization.framework And that directly communicates with IOPCIDevice... And the source code side of PCIDriverKit is at https://github.com/apple-oss-distributions/IOPCIFamily/tree/main/PCIDriverKit And for PrepareForDMA at https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IOUserServer.cpp#L1001 IOMemoryDescriptor has this option: https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/DriverKit/IOMemoryDescriptor.iig#L56 - kIOMemoryMapFixedAddress But not sure whether that’s allowed for user-mode drivers Hopefully that helps. Thank you, -Mohamed
On Wed Apr 8, 2026 at 12:09 PM PDT, Mohamed Mediouni wrote:
>
>
>> On 8. Apr 2026, at 09:02, Scott J. Goldman <scottjgo@gmail.com> wrote:
>>
>> On Sun Apr 5, 2026 at 5:16 PM PDT, Mohamed Mediouni wrote:
>>>
>>>
>>>> On 6. Apr 2026, at 01:20, Scott J. Goldman <scottjgo@gmail.com> wrote:
>>>>
>>>> On Sun Apr 5, 2026 at 1:14 AM PDT, Mohamed Mediouni wrote:
>>>>>
>>>>>
>>>>>> On 5. Apr 2026, at 10:01, Mohamed Mediouni <mohamed@unpredictable.fr> wrote:
>>>>>>
>>>>>>>
>>>>>>> On 5. Apr 2026, at 09:28, Scott J. Goldman <scottjgo@gmail.com> wrote:
>>>>>>>
>>>>>>> This series adds VFIO PCI device passthrough support for Apple Silicon
>>>>>>> Macs running macOS, using a DriverKit extension (dext) as the host
>>>>>>> backend instead of the Linux VFIO kernel driver.
>>>>>>>
>>>>>>> I'm sending this as an RFC because I'd like feedback before investing
>>>>>>> further in upstreaming. The code is functional. I've tested it with
>>>>>>> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU
>>>>>>> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077
>>>>>>> [1]), likely due to the BAR access penalty described below. AI
>>>>>>> inference workloads appear less affected. Ollama with Qwen 3.5
>>>>>>> generates around 140 tok/sec on the same setup [2].
>>>>>>>
>>>>>>> How it works:
>>>>>>>
>>>>>>> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio
>>>>>>> for device access and DMA mapping. On macOS, there is no equivalent
>>>>>>> kernel interface. Instead, a userspace DriverKit extension
>>>>>>> (VFIOUserPCIDriver) mediates access to the physical PCI device through
>>>>>>> IOKit's IOUserClient and PCIDriverKit APIs.
>>>>>>>
>>>>>>> The series keeps the existing VFIOPCIDevice model and reuses QEMU's
>>>>>>> passthrough infrastructure. A few ioctl callsites are refactored into
>>>>>>> io_ops callbacks, the build system is extended for Darwin, and the
>>>>>>> Apple-specific backend plugs in behind those abstractions.
>>>>>>>
>>>>>>> The guest sees two PCI devices: the passthrough device itself
>>>>>>> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion
>>>>>>> DMA mapping device (apple-dma-pci). On the QEMU side, an
>>>>>>> AppleVFIOContainer implements the IOMMU backend, and a C client
>>>>>>> library wraps the IOUserClient calls to the dext for config space,
>>>>>>> BAR MMIO, interrupts, reset, and DMA.
>>>>>>>
>>>>>>> DMA limitations:
>>>>>>>
>>>>>>> This is the biggest platform constraint. Unlike a typical IOMMU
>>>>>>> mapping operation where the caller specifies the IOVA, the
>>>>>>> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a
>>>>>>> system-assigned IOVA. There is no way to request a specific address.
>>>>>>> This means the guest's requested DMA addresses cannot be used
>>>>>>> directly. The guest kernel module must intercept DMA mapping calls
>>>>>>> and forward them through the companion device to get the actual
>>>>>>> hardware IOVA.
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Ugh this one is not great. By the way, Apple has a private PCIe passthrough
>>>>>> API used by Virtualization.framework but that's a different design.
>>>>
>>>> This is really interesting and I had not heard about this. Are you
>>>> able to elaborate on this one at all? Maybe this is something where an
>>>> internal API to manipulate the DART is available inside
>>>> Virtualization.framework?
>>>
>>> Hello,
>>>
>>> All of it needs using private entitlements currently.
>>>
>>> It's _VZPCIPassthroughDeviceConfiguration, a private class needing com.apple.private.virtualization to use.
>>>
>>> The VMM process itself then uses the com.apple.private.PCIPassthrough.access entitlement. I'm not
>>> sure whether OS versions even have all the code currently though.
>>>
>>
>> Appreciate the pointers here. It looks like, as you said, the framework
>> taps into a bunch of code that isn't shipped to us mere mortals. I can
>> see from some of the code in Virtualization.framework the general shape
>> of what they're doing, though.
>>
>> It looks like they implement a virtio-iommu device that ultimately calls
>> into the host kernel with some internal APIs to do the DART mappings.
>>
>
> Hello,
>
> Some more details:
>
> The VMM side when using Virtualization.framework is at /System/Library/Frameworks/Virtualization.framework/XPCServices/com.apple.Virtualization.VirtualMachine.xpc/Contents/MacOS/com.apple.Virtualization.VirtualMachine
> as Virtualization.framework
>
> And that directly communicates with IOPCIDevice...
>
> And the source code side of PCIDriverKit is at https://github.com/apple-oss-distributions/IOPCIFamily/tree/main/PCIDriverKit
>
> And for PrepareForDMA at https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IOUserServer.cpp#L1001
>
> IOMemoryDescriptor has this option: https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/DriverKit/IOMemoryDescriptor.iig#L56 - kIOMemoryMapFixedAddress
>
> But not sure whether that’s allowed for user-mode drivers
>
> Hopefully that helps.
>
Appreciate the pointers. It seems like the flag does work on DriverKit
user-level drivers. Unfortunately it controls the virtual address
placement in the process, not the IOVA for DMA. If you follow the path
through:
PrepareForDMA_Impl(options, memDesc, offset, length, ...)
-> IODMACommand::prepare(offset, length)
-> md->dmaCommandOperation(kIOMDDMAMap, &mapArgs)
-> IOGeneralMemoryDescriptor::dmaMap(mapper, ..., &mapArgs.fAlloc)
You can see that the code doesn't pass these flags through to the
ultimate call to iovmMapMemory. The base class path is at:
https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IOMemoryDescriptor.cpp#L4581-L4587C54
and the IOGeneralMemoryDescriptor override (which is the path taken for client memory) has the same gap:
https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IOMemoryDescriptor.cpp#L4671-L4742
where the mapOptions are also set in IODMACommand::prepare():
https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IODMACommand.cpp#L985C1-L996C1
I think the flag that would have to be in the path would be
kIODMAMapFixedAddress.
Thanks,
-sjg
> On 8. Apr 2026, at 22:45, Scott J. Goldman <scottjgo@gmail.com> wrote: > > On Wed Apr 8, 2026 at 12:09 PM PDT, Mohamed Mediouni wrote: >> >> >>> On 8. Apr 2026, at 09:02, Scott J. Goldman <scottjgo@gmail.com> wrote: >>> >>> On Sun Apr 5, 2026 at 5:16 PM PDT, Mohamed Mediouni wrote: >>>> >>>> >>>>> On 6. Apr 2026, at 01:20, Scott J. Goldman <scottjgo@gmail.com> wrote: >>>>> >>>>> On Sun Apr 5, 2026 at 1:14 AM PDT, Mohamed Mediouni wrote: >>>>>> >>>>>> >>>>>>> On 5. Apr 2026, at 10:01, Mohamed Mediouni <mohamed@unpredictable.fr> wrote: >>>>>>> >>>>>>>> >>>>>>>> On 5. Apr 2026, at 09:28, Scott J. Goldman <scottjgo@gmail.com> wrote: >>>>>>>> >>>>>>>> This series adds VFIO PCI device passthrough support for Apple Silicon >>>>>>>> Macs running macOS, using a DriverKit extension (dext) as the host >>>>>>>> backend instead of the Linux VFIO kernel driver. >>>>>>>> >>>>>>>> I'm sending this as an RFC because I'd like feedback before investing >>>>>>>> further in upstreaming. The code is functional. I've tested it with >>>>>>>> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU >>>>>>>> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077 >>>>>>>> [1]), likely due to the BAR access penalty described below. AI >>>>>>>> inference workloads appear less affected. Ollama with Qwen 3.5 >>>>>>>> generates around 140 tok/sec on the same setup [2]. >>>>>>>> >>>>>>>> How it works: >>>>>>>> >>>>>>>> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio >>>>>>>> for device access and DMA mapping. On macOS, there is no equivalent >>>>>>>> kernel interface. Instead, a userspace DriverKit extension >>>>>>>> (VFIOUserPCIDriver) mediates access to the physical PCI device through >>>>>>>> IOKit's IOUserClient and PCIDriverKit APIs. >>>>>>>> >>>>>>>> The series keeps the existing VFIOPCIDevice model and reuses QEMU's >>>>>>>> passthrough infrastructure. A few ioctl callsites are refactored into >>>>>>>> io_ops callbacks, the build system is extended for Darwin, and the >>>>>>>> Apple-specific backend plugs in behind those abstractions. >>>>>>>> >>>>>>>> The guest sees two PCI devices: the passthrough device itself >>>>>>>> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion >>>>>>>> DMA mapping device (apple-dma-pci). On the QEMU side, an >>>>>>>> AppleVFIOContainer implements the IOMMU backend, and a C client >>>>>>>> library wraps the IOUserClient calls to the dext for config space, >>>>>>>> BAR MMIO, interrupts, reset, and DMA. >>>>>>>> >>>>>>>> DMA limitations: >>>>>>>> >>>>>>>> This is the biggest platform constraint. Unlike a typical IOMMU >>>>>>>> mapping operation where the caller specifies the IOVA, the >>>>>>>> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a >>>>>>>> system-assigned IOVA. There is no way to request a specific address. >>>>>>>> This means the guest's requested DMA addresses cannot be used >>>>>>>> directly. The guest kernel module must intercept DMA mapping calls >>>>>>>> and forward them through the companion device to get the actual >>>>>>>> hardware IOVA. >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> Ugh this one is not great. By the way, Apple has a private PCIe passthrough >>>>>>> API used by Virtualization.framework but that's a different design. >>>>> >>>>> This is really interesting and I had not heard about this. Are you >>>>> able to elaborate on this one at all? Maybe this is something where an >>>>> internal API to manipulate the DART is available inside >>>>> Virtualization.framework? >>>> >>>> Hello, >>>> >>>> All of it needs using private entitlements currently. >>>> >>>> It's _VZPCIPassthroughDeviceConfiguration, a private class needing com.apple.private.virtualization to use. >>>> >>>> The VMM process itself then uses the com.apple.private.PCIPassthrough.access entitlement. I'm not >>>> sure whether OS versions even have all the code currently though. >>>> >>> >>> Appreciate the pointers here. It looks like, as you said, the framework >>> taps into a bunch of code that isn't shipped to us mere mortals. I can >>> see from some of the code in Virtualization.framework the general shape >>> of what they're doing, though. >>> >>> It looks like they implement a virtio-iommu device that ultimately calls >>> into the host kernel with some internal APIs to do the DART mappings. >>> >> >> Hello, >> >> Some more details: >> >> The VMM side when using Virtualization.framework is at /System/Library/Frameworks/Virtualization.framework/XPCServices/com.apple.Virtualization.VirtualMachine.xpc/Contents/MacOS/com.apple.Virtualization.VirtualMachine >> as Virtualization.framework >> >> And that directly communicates with IOPCIDevice... >> >> And the source code side of PCIDriverKit is at https://github.com/apple-oss-distributions/IOPCIFamily/tree/main/PCIDriverKit >> >> And for PrepareForDMA at https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IOUserServer.cpp#L1001 >> >> IOMemoryDescriptor has this option: https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/DriverKit/IOMemoryDescriptor.iig#L56 - kIOMemoryMapFixedAddress >> >> But not sure whether that’s allowed for user-mode drivers >> >> Hopefully that helps. >> > > Appreciate the pointers. It seems like the flag does work on DriverKit > user-level drivers. Unfortunately it controls the virtual address > placement in the process, not the IOVA for DMA. If you follow the path > through: > That’s indeed the case… > PrepareForDMA_Impl(options, memDesc, offset, length, ...) > -> IODMACommand::prepare(offset, length) > -> md->dmaCommandOperation(kIOMDDMAMap, &mapArgs) > -> IOGeneralMemoryDescriptor::dmaMap(mapper, ..., &mapArgs.fAlloc) > > You can see that the code doesn't pass these flags through to the > ultimate call to iovmMapMemory. The base class path is at: > > https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IOMemoryDescriptor.cpp#L4581-L4587C54 > > and the IOGeneralMemoryDescriptor override (which is the path taken for client memory) has the same gap: > > https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IOMemoryDescriptor.cpp#L4671-L4742 > > where the mapOptions are also set in IODMACommand::prepare(): > > https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IODMACommand.cpp#L985C1-L996C1 > > I think the flag that would have to be in the path would be > kIODMAMapFixedAddress. Intriguingly, mapOptions is defined at https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IODMACommand.cpp#L943C39-L943C49 and set there in https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IODMACommand.cpp#L985C1-L996C1 but then not used afterwards… And the flags passed are: https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IOUserServer.cpp#L979 so it’s not in IODMACommandSpecification either. Makes me wonder even more what Apple does for their own VM thing, or maybe the simplest answer is that they’re not using something public - especially as they’re doing this from user-mode using the IOPCIDevice interface… Or… the Apple code in Virtualization.framework is unfinished and I’m just thinking too hard about this
On Wed Apr 8, 2026 at 3:12 PM PDT, Mohamed Mediouni wrote:
>
>
>> On 8. Apr 2026, at 22:45, Scott J. Goldman <scottjgo@gmail.com> wrote:
>>
>> On Wed Apr 8, 2026 at 12:09 PM PDT, Mohamed Mediouni wrote:
>>>
>>>
>>>> On 8. Apr 2026, at 09:02, Scott J. Goldman <scottjgo@gmail.com> wrote:
>>>>
>>>> On Sun Apr 5, 2026 at 5:16 PM PDT, Mohamed Mediouni wrote:
>>>>>
>>>>>
>>>>>> On 6. Apr 2026, at 01:20, Scott J. Goldman <scottjgo@gmail.com> wrote:
>>>>>>
>>>>>> On Sun Apr 5, 2026 at 1:14 AM PDT, Mohamed Mediouni wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On 5. Apr 2026, at 10:01, Mohamed Mediouni <mohamed@unpredictable.fr> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 5. Apr 2026, at 09:28, Scott J. Goldman <scottjgo@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> This series adds VFIO PCI device passthrough support for Apple Silicon
>>>>>>>>> Macs running macOS, using a DriverKit extension (dext) as the host
>>>>>>>>> backend instead of the Linux VFIO kernel driver.
>>>>>>>>>
>>>>>>>>> I'm sending this as an RFC because I'd like feedback before investing
>>>>>>>>> further in upstreaming. The code is functional. I've tested it with
>>>>>>>>> an NVIDIA RTX 5090 in a Thunderbolt dock on an M4 MacBook Air. GPU
>>>>>>>>> gaming works but is slow (~30 fps on high settings in Cyberpunk 2077
>>>>>>>>> [1]), likely due to the BAR access penalty described below. AI
>>>>>>>>> inference workloads appear less affected. Ollama with Qwen 3.5
>>>>>>>>> generates around 140 tok/sec on the same setup [2].
>>>>>>>>>
>>>>>>>>> How it works:
>>>>>>>>>
>>>>>>>>> On Linux, VFIO relies on kernel-managed IOMMU groups and /dev/vfio
>>>>>>>>> for device access and DMA mapping. On macOS, there is no equivalent
>>>>>>>>> kernel interface. Instead, a userspace DriverKit extension
>>>>>>>>> (VFIOUserPCIDriver) mediates access to the physical PCI device through
>>>>>>>>> IOKit's IOUserClient and PCIDriverKit APIs.
>>>>>>>>>
>>>>>>>>> The series keeps the existing VFIOPCIDevice model and reuses QEMU's
>>>>>>>>> passthrough infrastructure. A few ioctl callsites are refactored into
>>>>>>>>> io_ops callbacks, the build system is extended for Darwin, and the
>>>>>>>>> Apple-specific backend plugs in behind those abstractions.
>>>>>>>>>
>>>>>>>>> The guest sees two PCI devices: the passthrough device itself
>>>>>>>>> (vfio-apple-pci, which subclasses VFIOPCIDevice) and a companion
>>>>>>>>> DMA mapping device (apple-dma-pci). On the QEMU side, an
>>>>>>>>> AppleVFIOContainer implements the IOMMU backend, and a C client
>>>>>>>>> library wraps the IOUserClient calls to the dext for config space,
>>>>>>>>> BAR MMIO, interrupts, reset, and DMA.
>>>>>>>>>
>>>>>>>>> DMA limitations:
>>>>>>>>>
>>>>>>>>> This is the biggest platform constraint. Unlike a typical IOMMU
>>>>>>>>> mapping operation where the caller specifies the IOVA, the
>>>>>>>>> PCIDriverKit API (IODMACommand::PrepareForDMA) returns a
>>>>>>>>> system-assigned IOVA. There is no way to request a specific address.
>>>>>>>>> This means the guest's requested DMA addresses cannot be used
>>>>>>>>> directly. The guest kernel module must intercept DMA mapping calls
>>>>>>>>> and forward them through the companion device to get the actual
>>>>>>>>> hardware IOVA.
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Ugh this one is not great. By the way, Apple has a private PCIe passthrough
>>>>>>>> API used by Virtualization.framework but that's a different design.
>>>>>>
>>>>>> This is really interesting and I had not heard about this. Are you
>>>>>> able to elaborate on this one at all? Maybe this is something where an
>>>>>> internal API to manipulate the DART is available inside
>>>>>> Virtualization.framework?
>>>>>
>>>>> Hello,
>>>>>
>>>>> All of it needs using private entitlements currently.
>>>>>
>>>>> It's _VZPCIPassthroughDeviceConfiguration, a private class needing com.apple.private.virtualization to use.
>>>>>
>>>>> The VMM process itself then uses the com.apple.private.PCIPassthrough.access entitlement. I'm not
>>>>> sure whether OS versions even have all the code currently though.
>>>>>
>>>>
>>>> Appreciate the pointers here. It looks like, as you said, the framework
>>>> taps into a bunch of code that isn't shipped to us mere mortals. I can
>>>> see from some of the code in Virtualization.framework the general shape
>>>> of what they're doing, though.
>>>>
>>>> It looks like they implement a virtio-iommu device that ultimately calls
>>>> into the host kernel with some internal APIs to do the DART mappings.
>>>>
>>>
>>> Hello,
>>>
>>> Some more details:
>>>
>>> The VMM side when using Virtualization.framework is at /System/Library/Frameworks/Virtualization.framework/XPCServices/com.apple.Virtualization.VirtualMachine.xpc/Contents/MacOS/com.apple.Virtualization.VirtualMachine
>>> as Virtualization.framework
>>>
>>> And that directly communicates with IOPCIDevice...
>>>
>>> And the source code side of PCIDriverKit is at https://github.com/apple-oss-distributions/IOPCIFamily/tree/main/PCIDriverKit
>>>
>>> And for PrepareForDMA at https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IOUserServer.cpp#L1001
>>>
>>> IOMemoryDescriptor has this option: https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/DriverKit/IOMemoryDescriptor.iig#L56 - kIOMemoryMapFixedAddress
>>>
>>> But not sure whether that’s allowed for user-mode drivers
>>>
>>> Hopefully that helps.
>>>
>>
>> Appreciate the pointers. It seems like the flag does work on DriverKit
>> user-level drivers. Unfortunately it controls the virtual address
>> placement in the process, not the IOVA for DMA. If you follow the path
>> through:
>>
> That’s indeed the case…
>> PrepareForDMA_Impl(options, memDesc, offset, length, ...)
>> -> IODMACommand::prepare(offset, length)
>> -> md->dmaCommandOperation(kIOMDDMAMap, &mapArgs)
>> -> IOGeneralMemoryDescriptor::dmaMap(mapper, ..., &mapArgs.fAlloc)
>>
>> You can see that the code doesn't pass these flags through to the
>> ultimate call to iovmMapMemory. The base class path is at:
>>
>> https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IOMemoryDescriptor.cpp#L4581-L4587C54
>>
>> and the IOGeneralMemoryDescriptor override (which is the path taken for client memory) has the same gap:
>>
>> https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IOMemoryDescriptor.cpp#L4671-L4742
>>
>> where the mapOptions are also set in IODMACommand::prepare():
>>
>> https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IODMACommand.cpp#L985C1-L996C1
>>
>> I think the flag that would have to be in the path would be
>> kIODMAMapFixedAddress.
>
> Intriguingly, mapOptions is defined at https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IODMACommand.cpp#L943C39-L943C49 and set there in https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IODMACommand.cpp#L985C1-L996C1 but then not used afterwards…
>
> And the flags passed are:
> https://github.com/apple-oss-distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/iokit/Kernel/IOUserServer.cpp#L979
> so it’s not in IODMACommandSpecification either.
>
> Makes me wonder even more what Apple does for their own VM thing, or maybe
> the simplest answer is that they’re not using something public - especially
> as they’re doing this from user-mode using the IOPCIDevice interface…
> Or… the Apple code in Virtualization.framework is unfinished and I’m just thinking too hard about this
My possibly incorrect read of the strings and some disassembly of
Virtulization.framework is that they look for
IOServiceMatching("PCIPassthrough") which is simply not exported by
anything I can find on my Mac. For internal builds maybe they ship
another kext for this?
> On 9. Apr 2026, at 01:33, Scott J. Goldman <scottjgo@gmail.com> wrote:
>
> My possibly incorrect read of the strings and some disassembly of
> Virtulization.framework is that they look for
> IOServiceMatching("PCIPassthrough") which is simply not exported by
> anything I can find on my Mac. For internal builds maybe they ship
> another kext for this?
>
Hello,
Looked further and tried to run that code…
The related service names queried at runtime by com.apple.Virtualization.VirtualMachine
passed to IOServiceMatching in this case are IOPCIDevice… and PCIPassthroughController
As the internet has no mention of PCIPassthroughController and it’s not
there on my machine running macOS 26.4, and I don’t see any reference of
the com.apple.private.PCIPassthrough.access entitlement anywhere outside
of the VirtualMachine service entitlements list, it’s possible to definitely
conclude that the kernel side of this isn’t shipped by Apple at this point.
And that’s pretty unfortunate but oh well :/ Hopefully was worth looking at though.
> On 8. Apr 2026, at 09:02, Scott J. Goldman <scottjgo@gmail.com> wrote: > > > I dug in here and implemented the restricted-dma-pool solution. Still > needs some cleanups but it's working enough to test. To start with the > bad news: > > 1. As I mentioned previously, the mainline kernel has a max 256k limit > for any single swiotlb mapping. This has been debated a few times on > LKML, but the consensus has been generally that it should not be changed > or made configurable. You can see the threads: > - https://lkml.org/lkml/2015/3/3/84 > - https://patchwork.kernel.org/project/linux-mips/patch/20210914151016.3174924-1-Roman_Skakun@epam.com/ > > 2. NVIDIA drivers immediately make a contiguous 528384 byte allocation, > at least on my hardware (NVIDIA RTX 5090), which is required as part of > initializing the firmware on the card. This obviously fails immediately. > It happens on both the NVIDIA-provided "open" drivers[1] and the in-tree > `nouveau` [2], so it's more a hardware-specific issue than just a driver > problem. If you hack around that (allocate 3 smaller buffers and hope > they are contiguous), you'll see that both drivers assume coherent DMA > memory (moreso in the nvidia driver than nouveau, but it's a problem in > both). They map DMA buffers and then write data into the buffers > afterward. So you end up sending empty swiotlb buffers to the card and > it'll ultimately fail to initialize. Hello, Interesting, on x86 the bounce buffer situation is a bit different instead of just conventional swiotlb. It looks like some of the NVIDIA bounce buffering support code is behind checks to support x86_64 only: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/db0c4e65c8e34c678d745ddb1317f53f90d1072b/src/nvidia/src/kernel/gpu/ce/arch/blackwell/kernel_ce_gb100.c#L1833 > > It's possible the press release was referring to using the closed NVIDIA > drivers, but those are now deprecated and don't support my newer GPU. > They’re using the open ones on x86, Hopper doesn’t support the closed ones. > But, there is good news: > > 1. The IOVA range that seems to always come from PCIDriverKit is pretty > far outside the default qemu mapping from `-machine virt`, so the range > can be cleanly identity mapped in the VM without overlap. One of the > restrictions I noted earlier (16MB max contiguous mapping) was actually > just a bug in my code. A large contiguous mapping seems to work fine, > though the ~1.5GB limit is still real. There’s a catch there for early platforms with small IPA space (ie early non-Pro/Max) which only had a 64GB IPA space. But documenting those as not supported is probably fine. > > 2. restricted-dma-pool DT attribute can be assigned per-device. So it > doesn't affect other drivers on the system, and potentially that means > you can have different pools for multiple devices (have not actually > tried this yet, but seems like it would work). Yes. > > 3. More normal devices can work. I purchased a thunderbolt nvme > enclosure and it works with the swiotlb bounce buffering with no kernel > modifications. > > 4. With a sufficient amount of hacks in the driver, the NVIDIA "open" > driver can be made to work, albeit with already slow gaming performance > reduced to about 30% (~10fps) vs paravirt dma mapping (~30fps). I wasn't > able to get CUDA working, but presumably that just needs more elbow > grease. UVM for CUDA uses its own separate memory allocator so changes for CUDA are expected - except if you force disable UVM which you can do by not loading nvidia-uvm and this: https://gist.githubusercontent.com/shkhln/40ef290463e78fb2b0000c60f4ad797e/raw/0e1fd8e8ea52b7445c3d33f5e5975efd20388dcb/uvm_ioctl_override.c > > After sleeping on this a bit, I think my proposal would be: > > - The `restricted-dma-pool` method can be the default. For most devices > this will work seamlessly, though users may have to specify a size for > the pool, since the optimal size will vary for each device. > > - The apple-dma-pci thing can be downgraded from an actual device to an > out-of-tree workaround. I have not yet tested it, but presumably it > can use ivshmem or a virtual serial port to communicate the mappings. > It's mostly a guest-side hack so it doesn't really need qemu > involvement necessarily. If it’s going to ship, I think keeping it an actual device is a good idea. > > - I doubt Apple will actually approve this for distribution, but I can > write a kext that uses the kernel API to manipulate the DART directly. Unfortunately the kext situation is pretty much a hard no for signing on the Apple side. :/ > I didn't realize this was an option before. This can act as kind of a > companion for my dext and as follow-on to this patchset, I can teach > the vIOMMU device to use it. Eventually if Apple exposes this as > something you can use in a dext, then the functionality can be moved > into the dext and all of these concerns become moot. Until then, it > can be an optimization if you're willing to run without SIP. Yeah, proper virtio-mmu can be available as an option when SIP is off with a custom kext, and I think having support for that in-tree would be a good idea. > > If you think this is OK, I can prepare a new version of the patchset. This all looks cool :) > > Thanks, > -sjg > > [1] https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/src/kernel/gpu/gsp/kernel_gsp.c#L5404 > [2] https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c#L1827 >
© 2016 - 2026 Red Hat, Inc.