> -----Original Message-----
> From: Jonathan Cameron <jic23@kernel.org>
> Sent: 19 May 2026 18:43
> To: Manish Honap <mhonap@nvidia.com>
> Cc: Alex Williamson <alwilliamson@nvidia.com>; Shameer Kolothum Thodi
> <skolothumtho@nvidia.com>; Ankit Agrawal <ankita@nvidia.com>;
> mst@redhat.com; imammedo@redhat.com; anisinha@redhat.com;
> eric.auger@redhat.com; peter.maydell@linaro.org;
> shannon.zhaosl@gmail.com; jonathan.cameron@huawei.com;
> fan.ni@samsung.com; pbonzini@redhat.com; richard.henderson@linaro.org;
> marcel.apfelbaum@gmail.com; clg@redhat.com; cohuck@redhat.com;
> dan.j.williams@intel.com; dave.jiang@intel.com; alejandro.lucero-
> palau@amd.com; Vikram Sethi <vsethi@nvidia.com>; Neo Jia
> <cjia@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>; Zhi Wang
> <zhiw@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>; linux-
> cxl@vger.kernel.org; kvm@vger.kernel.org; qemu-devel@nongnu.org; qemu-
> arm@nongnu.org
> Subject: Re: [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci
>
> External email: Use caution opening links or attachments
>
>
> On Mon, 27 Apr 2026 23:42:26 +0530
> <mhonap@nvidia.com> wrote:
>
> > From: Manish Honap <mhonap@nvidia.com>
> >
> > This series adds QEMU-side support for passing CXL Type-2 devices
> > (GPUs and accelerators with host-managed device memory) to VMs via
> > vfio-pci.
> Hi Manish,
>
> Having read this description I'm not sure I really understand all the
> constraints this is operating under.
>
> There are two basic paths to getting safe virtualisation of CXL.mem.
> 1) Lock down decoders etc against preconfigured mappings.
> 2) Constrain the more insane stuff via topology care - basically direct
> connect
> only for now and let the guest program stuff appropriately.
>
> One possible constraint is making it all look like early CXL host stuff
> where the bios does the heavy lifting. I don't mind if we have to do
> that today but I'd like to understand if it's a long term requirement or
> not.
>
> I think this is currently doing a mix of 1 and 2. You lock endpoint
> decoders but ignore host bridge ones. I think we need to handle those.
> Maybe I'm missing a check that a pass through decoder is in use (there
> is a commandline control to turn that on / off for single RP HB).
>
>
> >
> > It pairs with the kernel series "vfio/pci: CXL Type-2 passthrough"[1]
> > posted to the vfio mailing list. Patches 3-7 need that kernel series
> > present to do anything useful. I am new to QEMU development, so please
> > forgive and point me in the right direction for correct infrastructure
> > decisions.
> >
> > Background
> > ----------
> >
> > CXL Type-2 devices expose device memory (CXL.mem) through HDM
> decoders.
> > The kernel vfio-pci driver shadows the HDM Decoder Capability
> > registers so userspace can observe and control decoder commits without
> > touching the hardware register page directly.
> >
> > Without this series, the guest never sees the device memory range and
> > the HDM decoder goes unconfigured. The device shows up but its memory
> > is unreachable.
> >
> > Design decisions
> > ----------------
> >
> > CXL.mem is exposed to the guest as a dedicated GPA window declared in
> > ACPI
> > (CEDT/CFMWS) rather than a PCI BAR. The HDM decoder BASE must match
> > the CFMWS base and remain stable; BAR assignment is not stable.
>
> Note a lot of my thinking here is based on the related but not identical
> question of how to do CXL type 3 emulation for virtualization cases.
>
> My thinking for type 2 is we'd do similar tricks to for memory in
> constraining the topology so things like interleave don't cause us
> problems and CFMWS to particular device mappings are fixed (though in
> theory offsets with in CFMWS could change).
>
> I may be missing the point entirely but why does HDM decoder base (here
> the GPA address where the routing is configured to put the DPA of the
> EP) need to match the CFMWS base or for that matter remain stable (i.e.
> Guest forced to put it in the same place)? Is it just so the type 2
> driver knows where to find it? (There are easy ways around that - or is
> this something baked into an existing driver?)
>
> There are a few easy ways to constrain things to reduce flexibility
> (e.g.
> we don't want interleave to be possible etc). If we absolutely have to
> we can also lock decoders. The only reason I was avoiding that for type
> 3 device was that it isn't needed + can get complex fast!
>
> I think you are constraining the topology for this to work to PXB with
> 1 RP with pass through HDM decoders and no switches etc.
> That's fine for now but I didn't immediately spot the checks that
> enforce that.
>
> Can we have a command line example in this cover letter.
Restating the paths for CXL.mem virtualization for v2 scope
Path 1: lock down decoders against preconfigured mappings.
Path 2: constrain topology (direct-connect only, no switches,
no interleave) and let the guest program decoders.
For V2, I will:
(a) Document the Path 1 / Path 2 split explicitly in the cover
letter, and state that this series implements Path 1 for the
endpoint inside a Path-2-constrained topology.
(b) Add a command-line example to the cover letter (you asked for
this on patch 7/9 as well).
(c) Add realize-time checks in vfio_cxl_setup() that refuse to
start unless:
- cxl_get_hb_passthrough(hb) is true
- pcie_count_ds_ports(hb->bus) == 1
- no switch is present between the cxl-rp and the vfio-pci
(d) Use existing CFMWS infrastructure and drop VIRT_HIGH_CXL_MMIO
>
> > A separate
> > VIRT_HIGH_CXL_MMIO window in the ARM virt memory map carries this GPA
> > range, independent of the existing PCIe MMIO slots.
>
> I'm not immediately understanding why the CXL.mem path isn't via exiting
> CFMWS support. Why do we need a new memory window? Ultimately you seem
> to map to normal CFMWS regions. Also note you need to be very careful
> adding stuff to the memory map as bunch of other stuff above that will
> move - so it must simply not be there by default to avoid breaking
> migration of existing VMs. Maybe that is fine, I didn't check carefully
>
> >
> > The Component Register BAR contains two distinct ranges. Accelerator
> > register windows are passed through as direct hardware mmaps via
> > VFIO_REGION_INFO_CAP_SPARSE_MMAP. The HDM Decoder Capability block is
> > excluded from that sparse list by the kernel and must be intercepted
> > by QEMU to track decoder state. A single priority-1 COMP_REGS overlay
> > placed at hdm_regs_offset inside the BAR container wins over any
> > hardware-backed alias at the same offset, with no per-window aliasing
> > required.
> >
> > The guest has no mechanism to remap host physical mappings. QEMU
> > programs decoder 0 with the CFMWS base through the kernel's COMP_REGS
> > shadow at machine_done time, after all devices are realized and before
> the guest starts.
>
> Ah so this is locking down the decoder? I'm not necessarily against
> that but would like to understand what the constraints are that lead to
> that being necessary as opposed to guest programming them later (and
> late set up of the MemoryRegion like we do for type 3)
Decoder locking is required for this series because the kernel-side endpoint
driver expects a committed decoder before vfio-cxl probe completes.
The host CXL stack commits the endpoint HDM decoder before vfio binds; QEMU
then locks it again at machine_done with the GPA so the guest cannot move it.
>
> > The notifier is registered only for devices the kernel reports as
> > firmware-committed (VFIO_CXL_CAP_FIRMWARE_COMMITTED).
> >
> > The CXL.mem MemoryRegion is a mmap-backed RAM-device region backed by
> > a VM_IO|VM_PFNMAP VMA. The VFIO MemoryListener would attempt an IOMMU
> > DMA mapping for it when it is added to system_memory, which always
> > fails: pin_user_pages() refuses VM_IO pages. No IOMMU mapping is
> > needed for these regions - CPU access goes via KVM Stage-2 page faults
> > and device DMA to RAM uses separate per-RAM-section IOMMU entries. The
> > listener is extended to skip the mapping attempt for VFIO-owned
> > RAM-device regions.
> >
> > pxb-cxl bridges had no _DSM method. Without _DSM function 5 the OS
> > defaults to treating PCI configuration as reassignable.
>
> That one is annoyingly controversial. I see you have Shameer's patch so
> he can go into history of why. I tried to land that in the past and I
> was far from the first.
okay, I will check this part for additional details.
>
> > On machines with firmware-committed HDM decoders that reassignment
> > breaks the CXL.mem mapping, so the _DSM is added with
> > preserve_config=true for ARM and false for x86.
>
> What has that got to do with CXL.mem mapping? It can move the BARs
> around but those aren't related to the CXL.mem mapping.
>
> Can we break this description up into two separate parts as it feels
> like BAR mappings and CXL.mem mappings are getting confused.
>
> 1) Deal with how the configuration stuff works - bars etc.
> 2) Deal with the CXL.mem mapping into a CFMWS.
>
Yes, agreed these two are independent.
For v2, I will split the cover-letter narrative accordingly and reorder
patches so the CXL.mem story (HDM decoder, FMWS, decoder lock) sits
in one block and the PCI/BAR story (_DSM preserve_config) sits in
another.
> >
> > Known issues:
> > - The bios-tables test will fail due to the _DSM addition.
> > A fix will be provided in a follow-up round.
>
> Unless I'm missing some checks, a whole load more than that will fail as
> device memory / memory hotplug region + probably the CFMWS regions all
> move.
>
> > - VFIO_CXL_CAP_CACHE_CAPABLE will require additional handling.
> > - Devices with multiple firmware-committed HDM decoders are not fully
> > supported.
> > - Non-firmware-committed devices are not supported.
> > - linux-headers sync is manual and temporary; once the kernel series
> is
> > merged, this patch will be replaced with script generated update.
> >
> > [1]
> > https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidi
> > a.com
> >
> Anyhow, seems like a good starting point for discussion.
Thank you for your suggestions.
>
> As we evolve the CXL support into including real virtualization use
> cases I'm sure we'll also throw up random corners around resetting etc.
>
> Note I mostly skipped the VFIO stuff on basis others know a lot more on
> that side of things than I do!
>
> Thanks,
>
> Jonathan
>
> > Manish Honap (9):
> > hw/arm/virt: Add CXL FMWS PA window for device memory
> > cxl: Add preserve_config to pxb-cxl OSC method
> > linux-headers: Update vfio.h for CXL Type-2 device passthrough
> > hw/vfio/region: Add vfio_region_setup_with_ops() for custom region
> ops
> > hw/vfio/pci: Add CXL Type-2 device detection and region setup
> > hw/vfio/pci: Wire CXL component-register BAR with COMP_REGS overlay
> > hw/vfio+cxl: Program HDM decoder 0 at machine_done for
> > firmware-committed devices
> > hw/arm/smmu-common: Allow pxb-cxl as SMMUv3 primary bus
> > vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions
> >
> > hw/acpi/cxl-stub.c | 2 +-
> > hw/acpi/cxl.c | 4 +-
> > hw/arm/smmu-common.c | 17 +-
> > hw/arm/virt-acpi-build.c | 5 +
> > hw/arm/virt.c | 7 +
> > hw/cxl/cxl-host-stubs.c | 2 +
> > hw/cxl/cxl-host.c | 8 +
> > hw/i386/acpi-build.c | 2 +-
> > hw/pci-host/gpex-acpi.c | 43 +++-
> > hw/vfio/listener.c | 14 ++
> > hw/vfio/pci.c | 411
> +++++++++++++++++++++++++++++++++++++
> > hw/vfio/pci.h | 15 ++
> > hw/vfio/region.c | 15 +-
> > hw/vfio/trace-events | 6 +
> > hw/vfio/vfio-region.h | 3 +
> > include/hw/acpi/cxl.h | 2 +-
> > include/hw/arm/virt.h | 2 +
> > include/hw/cxl/cxl_host.h | 10 +
> > include/hw/pci-host/gpex.h | 2 +
> > linux-headers/linux/vfio.h | 18 ++
> > 20 files changed, 570 insertions(+), 18 deletions(-)
> >
> > --
> > 2.25.1
> >
> >
> -----Original Message----- > From: Manish Honap <mhonap@nvidia.com> > Sent: 01 June 2026 08:56 > To: Jonathan Cameron <jic23@kernel.org> > Cc: Alex Williamson <alwilliamson@nvidia.com>; Shameer Kolothum Thodi > <skolothumtho@nvidia.com>; Ankit Agrawal <ankita@nvidia.com>; > mst@redhat.com; imammedo@redhat.com; anisinha@redhat.com; > eric.auger@redhat.com; peter.maydell@linaro.org; > shannon.zhaosl@gmail.com; jonathan.cameron@huawei.com; > fan.ni@samsung.com; pbonzini@redhat.com; richard.henderson@linaro.org; > marcel.apfelbaum@gmail.com; clg@redhat.com; cohuck@redhat.com; > dave.jiang@intel.com; alejandro.lucero-palau@amd.com; Vikram Sethi > <vsethi@nvidia.com>; Neo Jia <cjia@nvidia.com>; Tarun Gupta (SW-GPU) > <targupta@nvidia.com>; Zhi Wang <zhiw@nvidia.com>; Krishnakant Jaju > <kjaju@nvidia.com>; linux-cxl@vger.kernel.org; kvm@vger.kernel.org; qemu- > devel@nongnu.org; qemu-arm@nongnu.org; Manish Honap > <mhonap@nvidia.com> > Subject: RE: [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci > [...] > > > pxb-cxl bridges had no _DSM method. Without _DSM function 5 the OS > > > defaults to treating PCI configuration as reassignable. > > > > That one is annoyingly controversial. I see you have Shameer's patch > > so he can go into history of why. I tried to land that in the past and > > I was far from the first. > > okay, I will check this part for additional details. I guess the concern here is the QEMU regression reported with certain devices with legacy IO port BARs having issues on arm64 when _DSM #5 is specified: https://lore.kernel.org/all/20210724185234.GA2265457@roeck-us.net/ For accel SMMUv3 this is fine as we restrict the devices to vfio-pci only. One option, until there is a proper fix in the Linux kernel or EDK2, might be to maintain a checklist so that devices known to have issues are not attached when _DSM #5 is advertised. Thanks, Shameer
© 2016 - 2026 Red Hat, Inc.