[PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough

mhonap@nvidia.com posted 20 patches 3 weeks, 5 days ago
There is a newer version of this series
[PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough
Posted by mhonap@nvidia.com 3 weeks, 5 days ago
From: Manish Honap <mhonap@nvidia.com>

Add a driver-api document describing the architecture, interfaces, and
operational constraints of CXL Type-2 device passthrough via vfio-pci-core.

CXL Type-2 devices (cache-coherent accelerators such as GPUs with attached
device memory) present unique passthrough requirements not covered by the
existing vfio-pci documentation:

- The host kernel retains ownership of the HDM decoder hardware through
  the CXL subsystem, so the guest cannot program decoders directly.
- Two additional VFIO device regions expose the emulated HDM register
  state (COMP_REGS) and the DPA memory window (DPA region) to userspace.
- DVSEC configuration space writes are intercepted and virtualized so
  that the guest cannot alter host-owned CXL.io / CXL.mem enable bits.
- Device reset (FLR) is coordinated through vfio_pci_ioctl_reset(): all
  DPA PTEs are zapped before the reset and restored afterward.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 Documentation/driver-api/index.rst        |   1 +
 Documentation/driver-api/vfio-pci-cxl.rst | 216 ++++++++++++++++++++++
 2 files changed, 217 insertions(+)
 create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 1833e6a0687e..7ec661846f6b 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -47,6 +47,7 @@ of interest to most developers working on device drivers.
    vfio-mediated-device
    vfio
    vfio-pci-device-specific-driver-acceptance
+   vfio-pci-cxl
 
 Bus-level documentation
 =======================
diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
new file mode 100644
index 000000000000..f2cbe2fdb036
--- /dev/null
+++ b/Documentation/driver-api/vfio-pci-cxl.rst
@@ -0,0 +1,216 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================================
+VFIO PCI CXL Type-2 Device Passthrough
+====================================================
+
+Overview
+--------
+
+CXL (Compute Express Link) Type-2 devices are cache-coherent PCIe accelerators
+and GPUs that attach their own volatile memory (Device Physical Address space,
+or DPA) to the host memory fabric via the CXL protocol.  Examples include
+GPU/accelerator cards that expose coherent device memory to the host.
+
+When such a device is passthroughed to a virtual machine using ``vfio-pci``,
+the kernel CXL subsystem must remain in control of the Host-managed Device
+Memory (HDM) decoders that map the device's DPA into the host physical address
+(HPA) space.  A VMM such as QEMU cannot program HDM decoders directly; instead
+it uses a set of VFIO-specific regions and UAPI extensions described here.
+
+This support is compiled in when ``CONFIG_VFIO_CXL_CORE=y``.  It can be
+disabled at module load time for all devices bound to ``vfio-pci`` with::
+
+    modprobe vfio-pci disable_cxl=1
+
+Variant drivers can disable CXL extensions for individual devices by setting
+``vdev->disable_cxl = true`` in their probe function before registration.
+
+Device Detection
+----------------
+
+CXL Type-2 detection happens automatically when ``vfio-pci`` registers a
+device that has:
+
+1. A CXL Device DVSEC capability (PCIe DVSEC Vendor ID 0x1E98, ID 0x0000).
+2. Bit 2 (Mem_Capable) set in the CXL Capability register within that DVSEC.
+3. A PCI class code that is **not** ``0x050210`` (CXL Type-3 memory device).
+4. An HDM Decoder block discoverable via the Register Locator DVSEC.
+5. A pre-committed HDM decoder (BIOS/firmware programmed) with non-zero size.
+
+On successful detection ``VFIO_DEVICE_FLAGS_CXL`` is set in
+``vfio_device_info.flags`` alongside ``VFIO_DEVICE_FLAGS_PCI``.
+
+UAPI Extensions
+---------------
+
+VFIO_DEVICE_GET_INFO Capability: VFIO_DEVICE_INFO_CAP_CXL
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When ``VFIO_DEVICE_FLAGS_CXL`` is set the device info capability chain
+contains a ``vfio_device_info_cap_cxl`` structure (cap ID 6)::
+
+    struct vfio_device_info_cap_cxl {
+        struct vfio_info_cap_header header; /* id=6, version=1 */
+        __u8  hdm_count;          /* number of HDM decoders */
+        __u8  hdm_regs_bar_index; /* PCI BAR containing component registers */
+        __u16 pad;
+        __u32 flags;              /* VFIO_CXL_CAP_* flags */
+        __u64 hdm_regs_size;      /* size in bytes of the HDM decoder block */
+        __u64 hdm_regs_offset;    /* byte offset within the BAR to HDM block */
+        __u64 dpa_size;           /* total DPA size in bytes */
+        __u32 dpa_region_index;   /* index of the DPA device region */
+        __u32 comp_regs_region_index; /* index of the COMP_REGS device region */
+    };
+
+Flags:
+
+``VFIO_CXL_CAP_COMMITTED`` (bit 0)
+    The HDM decoder was committed by the kernel CXL subsystem.
+
+``VFIO_CXL_CAP_PRECOMMITTED`` (bit 1)
+    The HDM decoder was pre-committed by host firmware/BIOS.  The VMM does
+    not need to allocate CXL HPA space; the mapping is already live.
+
+VFIO Regions
+~~~~~~~~~~~~~
+
+A CXL Type-2 device exposes two additional device regions beyond the standard
+PCI BAR regions.  Their indices are reported in ``dpa_region_index`` and
+``comp_regs_region_index`` in the capability structure.
+
+**DPA Region** (subtype ``VFIO_REGION_SUBTYPE_CXL``)
+    Flags: ``VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE |
+    VFIO_REGION_INFO_FLAG_MMAP``
+
+    Represents the device's DPA memory mapped at the kernel-assigned HPA.
+    The VMM should map this region with mmap() to expose device memory to the
+    guest.  Page faults are handled lazily; the kernel inserts PFNs on first
+    access rather than at mmap() time.  During FLR/reset all PTEs are
+    invalidated and the region becomes inaccessible until the reset completes.
+
+    Read and write access via the region file descriptor is also supported and
+    routes through a kernel-managed virtual address established with
+    ``ioremap_cache()``.
+
+**COMP_REGS Region** (subtype ``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``)
+    Flags: ``VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE``
+    (no mmap).
+
+    An emulated, read/write-only region exposing the HDM decoder registers.
+    The kernel shadows the hardware HDM register state and enforces all
+    bit-field rules (reserved bits, read-only bits, commit semantics) on every
+    write.  Only 32-bit aligned, 32-bit wide accesses are permitted, matching
+    the hardware requirement.
+
+    The VMM uses this region to read and write HDM decoder BASE, SIZE, and
+    CTRL registers.  Setting the COMMIT bit (bit 9) in a CTRL register causes
+    the kernel to immediately set the COMMITTED bit (bit 10) in the emulated
+    shadow state, allowing the VMM to detect the transition via a
+    ``notify_change`` callback.
+
+    The component register BAR itself (``hdm_regs_bar_index``) is hidden:
+    ``VFIO_DEVICE_GET_REGION_INFO`` for that BAR index returns ``size = 0``.
+    All HDM access must go through the COMP_REGS region.
+
+Region Type Identifiers::
+
+    /* type = PCI_VENDOR_ID_CXL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE (0x80001e98) */
+    #define VFIO_REGION_SUBTYPE_CXL           1   /* DPA memory region */
+    #define VFIO_REGION_SUBTYPE_CXL_COMP_REGS 2   /* HDM register region */
+
+DVSEC Configuration Space Emulation
+-------------------------------------
+
+When ``CONFIG_VFIO_CXL_CORE=y`` the kernel installs a CXL-aware write handler
+for the ``PCI_EXT_CAP_ID_DVSEC`` (0x23) extended capability entry in the vfio-pci
+configuration space permission table.  This handler runs for every device
+opened under ``vfio-pci``; for non-CXL devices it falls through to the
+hardware write path unchanged.
+
+For CXL devices, writes to the following DVSEC registers are intercepted and
+emulated in ``vdev->vconfig`` (the per-device shadow configuration space):
+
++--------------------+--------+-------------------------------------------+
+| Register           | Offset | Emulation                                 |
++====================+========+===========================================+
+| CXL Control        | 0x0c   | RWL semantics; IO_Enable forced to 1;     |
+|                    |        | locked after Lock register bit 0 is set.  |
++--------------------+--------+-------------------------------------------+
+| CXL Status         | 0x0e   | Bit 14 (Viral_Status) is RW1CS.           |
++--------------------+--------+-------------------------------------------+
+| CXL Control2       | 0x10   | Bits 0, 3 forwarded to hardware; bits     |
+|                    |        | 1 and 2 trigger subsystem actions.        |
++--------------------+--------+-------------------------------------------+
+| CXL Status2        | 0x12   | Bit 3 (RW1CS) forwarded to hardware when  |
+|                    |        | Capability3 bit 3 is set.                 |
++--------------------+--------+-------------------------------------------+
+| CXL Lock           | 0x14   | RWO; once set, Control becomes read-only  |
+|                    |        | until conventional reset.                 |
++--------------------+--------+-------------------------------------------+
+| Range Base High/Lo | varies | Stored in vconfig; Base Low [27:0]        |
+|                    |        | reserved bits cleared.                    |
++--------------------+--------+-------------------------------------------+
+
+Reads of these registers return the emulated vconfig values.  Read-only
+registers (Capability, Size registers, range Size High/Low) are also served
+from vconfig, which was seeded from hardware at device open time.
+
+FLR and Reset Behaviour
+-----------------------
+
+During Function Level Reset (FLR):
+
+1. ``vfio_cxl_zap_region_locked()`` is called under the write side of
+   ``memory_lock``.  It sets ``region_active = false`` and calls
+   ``unmap_mapping_range()`` to invalidate all DPA region PTEs.
+
+2. Any concurrent page fault or ``read()``/``write()`` on the DPA region
+   sees ``region_active = false`` and returns ``VM_FAULT_SIGBUS`` or ``-EIO``
+   respectively.
+
+3. After reset completes, ``vfio_cxl_reactivate_region()`` re-reads the HDM
+   decoder state from hardware into ``comp_reg_virt[]`` (it will typically
+   be all-zeros after FLR) and sets ``region_active = true`` only if the
+   COMMITTED bit is set in the freshly re-snapshotted hardware state for
+   pre-committed decoders.  The VMM may re-fault into the DPA region without
+   issuing a new ``mmap()`` call.  Each newly faulted page is scrubbed via
+   ``memset_io()`` before the PFN is inserted.
+
+VMM Integration Notes
+---------------------
+
+A VMM integrating CXL Type-2 passthrough should:
+
+1. Issue ``VFIO_DEVICE_GET_INFO`` and check ``VFIO_DEVICE_FLAGS_CXL``.
+2. Walk the capability chain to find ``VFIO_DEVICE_INFO_CAP_CXL`` (id = 6).
+3. Record ``dpa_region_index``, ``comp_regs_region_index``, ``dpa_size``,
+   ``hdm_count``, ``hdm_regs_offset``, and ``hdm_regs_size``.
+4. Map the DPA region (``dpa_region_index``) with mmap() to a guest physical
+   address.  The region supports ``PROT_READ | PROT_WRITE``.
+5. Open the COMP_REGS region (``comp_regs_region_index``) and attach a
+   ``notify_change`` callback to detect COMMIT transitions.  When bit 10
+   (COMMITTED) transitions from 0 to 1 in a CTRL register read, the VMM
+   should expose the corresponding DPA range to the guest and map the
+   relevant slice of the DPA mmap.
+6. For pre-committed devices (``VFIO_CXL_CAP_PRECOMMITTED`` set) the entire
+   DPA is already mapped and the VMM need not wait for a guest COMMIT.
+7. Program the guest CXL DVSEC registers (via VFIO config space write) to
+   reflect the guest's view.  The kernel emulates all register semantics
+   including the CONFIG_LOCK one-shot latch.
+
+Kernel Configuration
+--------------------
+
+``CONFIG_VFIO_CXL_CORE`` (bool)
+    Enable CXL Type-2 passthrough support in ``vfio-pci-core``.
+    Depends on ``CONFIG_VFIO_PCI_CORE``, ``CONFIG_CXL_BUS``, and
+    ``CONFIG_CXL_MEM``.
+
+References
+----------
+
+* CXL Specification 3.1, §8.1.3 — DVSEC for CXL Devices
+* CXL Specification 3.1, §8.2.4.20 — CXL HDM Decoder Capability Structure
+* ``include/uapi/linux/vfio.h`` — ``VFIO_DEVICE_INFO_CAP_CXL``,
+  ``VFIO_REGION_SUBTYPE_CXL``, ``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``
-- 
2.25.1

Re: [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough
Posted by Jonathan Cameron 3 weeks, 4 days ago
On Thu, 12 Mar 2026 02:04:38 +0530
mhonap@nvidia.com wrote:

> From: Manish Honap <mhonap@nvidia.com>
> 
> Add a driver-api document describing the architecture, interfaces, and
> operational constraints of CXL Type-2 device passthrough via vfio-pci-core.
> 
> CXL Type-2 devices (cache-coherent accelerators such as GPUs with attached
> device memory) present unique passthrough requirements not covered by the
> existing vfio-pci documentation:
> 
> - The host kernel retains ownership of the HDM decoder hardware through
>   the CXL subsystem, so the guest cannot program decoders directly.
> - Two additional VFIO device regions expose the emulated HDM register
>   state (COMP_REGS) and the DPA memory window (DPA region) to userspace.
> - DVSEC configuration space writes are intercepted and virtualized so
>   that the guest cannot alter host-owned CXL.io / CXL.mem enable bits.
> - Device reset (FLR) is coordinated through vfio_pci_ioctl_reset(): all
>   DPA PTEs are zapped before the reset and restored afterward.
> 
> Signed-off-by: Manish Honap <mhonap@nvidia.com>

Hi Manish.

Great to see this doc.

Provides a convenient place to talk about the restrictions on this
current patch set and how we resolve them.

My particular interest is in the region sizing as I don't see using
a locked own bios setup range as a comprehensive solution.

Shall we say, there is some awareness that the CXL spec doesn't require
enough information from type 2 devices and it wasn't necessarily
understood that VFIO type solutions can't rely on the
"It's an accelerator so it has a custom driver, no need for standards"

It is a gap I'd like to close.  Given it's being discussed in public
we can prepare a Code First proposal to either add stuff to the spec
or develop some external guidance on what a device needs to do if we
aren't going to need either a variant driver, or device specific handling
in user space.



> ---
>  Documentation/driver-api/index.rst        |   1 +
>  Documentation/driver-api/vfio-pci-cxl.rst | 216 ++++++++++++++++++++++
>  2 files changed, 217 insertions(+)
>  create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
> 
> diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
> index 1833e6a0687e..7ec661846f6b 100644
> --- a/Documentation/driver-api/index.rst
> +++ b/Documentation/driver-api/index.rst

>  
>  Bus-level documentation
>  =======================
> diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
> new file mode 100644
> index 000000000000..f2cbe2fdb036
> --- /dev/null
> +++ b/Documentation/driver-api/vfio-pci-cxl.rst

> +Device Detection
> +----------------
> +
> +CXL Type-2 detection happens automatically when ``vfio-pci`` registers a
> +device that has:
> +
> +1. A CXL Device DVSEC capability (PCIe DVSEC Vendor ID 0x1E98, ID 0x0000).
> +2. Bit 2 (Mem_Capable) set in the CXL Capability register within that DVSEC.

FWIW to be type 2 as opposed to a type 3 non class code device (e.g. the
compressed memory devices Gregory Price and others are using) you need
Cache_capable as well.  Might be worth making this all about
CXL Type-2 and non class code Type-3.

> +3. A PCI class code that is **not** ``0x050210`` (CXL Type-3 memory device).
> +4. An HDM Decoder block discoverable via the Register Locator DVSEC.
> +5. A pre-committed HDM decoder (BIOS/firmware programmed) with non-zero size.

This is the bit that we need to make more general. Otherwise you'll have
to have a bios upgrade for every type 2 device (and no native hotplug).
Note native hotplug is quite likely if anyone is switch based device
pooling.

I assume that you are doing this today to get something upstream
and presume it works for the type 2 device you have on the host you
care about.  I'm not sure there are 'general' solutions but maybe
there are some heuristics or sufficient conditions for establishing the
size.

Type 2 might have any of:
- Conveniently preprogrammed HDM decoders (the case you use)
- Maximum of 2 HDM decoders + the same number of Range registers.
  In general the problem with range registers is they are a legacy feature
  and there are only 2 of them whereas a real device may have many more
  DPA ranges. In this corner case though, is it enough to give us the
  necessary sizes?  I think it might be but would like others familiar
  with the spec to confirm. (If needed I'll take this to the consortium
  for an 'official' view).
- A DOE and table access protocol.  CDAT should give us enough info to
  be fairly sure what is needed.
- A CXL mailbox (maybe the version in the PCI spec now) and the spec defined
  commands to query what is there.  Reading the intro to 8.2.10.9 Memory
  Device Command Sets, it's a little unclear on whether these are valid on
  non class code devices but I believe having the appropriate Mailbox
  type identifier is enough to say we expect to get them.

None of this is required though and the mailboxes are non trivial.
So personally I think we should propose a new DVSEC that provides any
info we need for generic passthrough.  Starting with what we need
to get the regions right.  Until something like that is in place we
will have to store this info somewhere.

There is (maybe) an alternative of doing the region allocation on demand.
That is emulate the HDM decoders in QEMU (on top of the emulation
here) and when settings corresponding to a region setup occur,
go request one from the CXL core. The problem is we can't guarantee
it will be available at that time. So we can 'guess' what to provide
to the VM in terms of CXL fixed memory windows, but short of heuristics
(either whole of the host offer, or divide it up based on devices present
 vs what is in the VM) that is going to be prone to it not being available
later.

Where do people think this should be?  We are going to end up with
a device list somewhere. Could be in kernel, or in QEMU or make it an
orchestrator problem (applying the 'someone else's problem' solution).

       | locked after Lock register bit 0 is set.  |
> +
> +VMM Integration Notes
> +---------------------
> +
> +A VMM integrating CXL Type-2 passthrough should:
> +
> +1. Issue ``VFIO_DEVICE_GET_INFO`` and check ``VFIO_DEVICE_FLAGS_CXL``.
> +2. Walk the capability chain to find ``VFIO_DEVICE_INFO_CAP_CXL`` (id = 6).
> +3. Record ``dpa_region_index``, ``comp_regs_region_index``, ``dpa_size``,
> +   ``hdm_count``, ``hdm_regs_offset``, and ``hdm_regs_size``.
> +4. Map the DPA region (``dpa_region_index``) with mmap() to a guest physical
> +   address.  The region supports ``PROT_READ | PROT_WRITE``.
> +5. Open the COMP_REGS region (``comp_regs_region_index``) and attach a
> +   ``notify_change`` callback to detect COMMIT transitions.  When bit 10
> +   (COMMITTED) transitions from 0 to 1 in a CTRL register read, the VMM
> +   should expose the corresponding DPA range to the guest and map the
> +   relevant slice of the DPA mmap.
> +6. For pre-committed devices (``VFIO_CXL_CAP_PRECOMMITTED`` set) the entire
> +   DPA is already mapped and the VMM need not wait for a guest COMMIT.
> +7. Program the guest CXL DVSEC registers (via VFIO config space write) to
> +   reflect the guest's view.  The kernel emulates all register semantics
> +   including the CONFIG_LOCK one-shot latch.
> +

Can you share an RFC for this flow in QEMU?  Ideally also a type 2 model
(there have been a few posted in the past) that would allow testing this with
emulated qemu as the host, then KVM / VFIO on top of that?
If not I can probably find some time to hack something together.

Thanks,

Jonathan
Re: [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough
Posted by Alex Williamson 2 weeks, 6 days ago
On Fri, 13 Mar 2026 12:13:41 +0000
Jonathan Cameron <jonathan.cameron@huawei.com> wrote:

> On Thu, 12 Mar 2026 02:04:38 +0530
> mhonap@nvidia.com wrote:
> 
> > From: Manish Honap <mhonap@nvidia.com>
> > ---
> >  Documentation/driver-api/index.rst        |   1 +
> >  Documentation/driver-api/vfio-pci-cxl.rst | 216 ++++++++++++++++++++++
> >  2 files changed, 217 insertions(+)
> >  create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
> > 
> > diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
> > index 1833e6a0687e..7ec661846f6b 100644
> > --- a/Documentation/driver-api/index.rst
> > +++ b/Documentation/driver-api/index.rst  
> 
> >  
> >  Bus-level documentation
> >  =======================
> > diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
> > new file mode 100644
> > index 000000000000..f2cbe2fdb036
> > --- /dev/null
> > +++ b/Documentation/driver-api/vfio-pci-cxl.rst  
> 
> > +Device Detection
> > +----------------
> > +
> > +CXL Type-2 detection happens automatically when ``vfio-pci`` registers a
> > +device that has:
> > +
> > +1. A CXL Device DVSEC capability (PCIe DVSEC Vendor ID 0x1E98, ID 0x0000).
> > +2. Bit 2 (Mem_Capable) set in the CXL Capability register within that DVSEC.  
> 
> FWIW to be type 2 as opposed to a type 3 non class code device (e.g. the
> compressed memory devices Gregory Price and others are using) you need
> Cache_capable as well.  Might be worth making this all about
> CXL Type-2 and non class code Type-3.
> 
> > +3. A PCI class code that is **not** ``0x050210`` (CXL Type-3 memory device).
> > +4. An HDM Decoder block discoverable via the Register Locator DVSEC.
> > +5. A pre-committed HDM decoder (BIOS/firmware programmed) with non-zero size.  
> 
> This is the bit that we need to make more general. Otherwise you'll have
> to have a bios upgrade for every type 2 device (and no native hotplug).
> Note native hotplug is quite likely if anyone is switch based device
> pooling.
> 
> I assume that you are doing this today to get something upstream
> and presume it works for the type 2 device you have on the host you
> care about.  I'm not sure there are 'general' solutions but maybe
> there are some heuristics or sufficient conditions for establishing the
> size.
> 
> Type 2 might have any of:
> - Conveniently preprogrammed HDM decoders (the case you use)
> - Maximum of 2 HDM decoders + the same number of Range registers.
>   In general the problem with range registers is they are a legacy feature
>   and there are only 2 of them whereas a real device may have many more
>   DPA ranges. In this corner case though, is it enough to give us the
>   necessary sizes?  I think it might be but would like others familiar
>   with the spec to confirm. (If needed I'll take this to the consortium
>   for an 'official' view).
> - A DOE and table access protocol.  CDAT should give us enough info to
>   be fairly sure what is needed.
> - A CXL mailbox (maybe the version in the PCI spec now) and the spec defined
>   commands to query what is there.  Reading the intro to 8.2.10.9 Memory
>   Device Command Sets, it's a little unclear on whether these are valid on
>   non class code devices but I believe having the appropriate Mailbox
>   type identifier is enough to say we expect to get them.
> 
> None of this is required though and the mailboxes are non trivial.
> So personally I think we should propose a new DVSEC that provides any
> info we need for generic passthrough.  Starting with what we need
> to get the regions right.  Until something like that is in place we
> will have to store this info somewhere.
> 
> There is (maybe) an alternative of doing the region allocation on demand.
> That is emulate the HDM decoders in QEMU (on top of the emulation
> here) and when settings corresponding to a region setup occur,
> go request one from the CXL core. The problem is we can't guarantee
> it will be available at that time. So we can 'guess' what to provide
> to the VM in terms of CXL fixed memory windows, but short of heuristics
> (either whole of the host offer, or divide it up based on devices present
>  vs what is in the VM) that is going to be prone to it not being available
> later.
> 
> Where do people think this should be?  We are going to end up with
> a device list somewhere. Could be in kernel, or in QEMU or make it an
> orchestrator problem (applying the 'someone else's problem' solution).

That's the typical approach.  That's what we did with resizable BARs.
If we cannot guarantee allocation on demand, we need to push the policy
to the device, via something that indicates the size to use, or to the
orchestration, via something that allows the size to be committed
out-of-band.  As with REBAR, we then need to be able to restrict the
guest behavior to select only the configured option.

I imagine this means for the non-pre-allocated case, we need to develop
some sysfs attributes that allows that out-of-band sizing, which would
then appear as a fixed, pre-allocated configuration to the guest.
Thanks,

Alex
Re: [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough
Posted by Jonathan Cameron 2 weeks, 5 days ago
On Tue, 17 Mar 2026 15:24:45 -0600
Alex Williamson <alex@shazbot.org> wrote:

> On Fri, 13 Mar 2026 12:13:41 +0000
> Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> > On Thu, 12 Mar 2026 02:04:38 +0530
> > mhonap@nvidia.com wrote:
> >   
> > > From: Manish Honap <mhonap@nvidia.com>
> > > ---
> > >  Documentation/driver-api/index.rst        |   1 +
> > >  Documentation/driver-api/vfio-pci-cxl.rst | 216 ++++++++++++++++++++++
> > >  2 files changed, 217 insertions(+)
> > >  create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
> > > 
> > > diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
> > > index 1833e6a0687e..7ec661846f6b 100644
> > > --- a/Documentation/driver-api/index.rst
> > > +++ b/Documentation/driver-api/index.rst    
> >   
> > >  
> > >  Bus-level documentation
> > >  =======================
> > > diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
> > > new file mode 100644
> > > index 000000000000..f2cbe2fdb036
> > > --- /dev/null
> > > +++ b/Documentation/driver-api/vfio-pci-cxl.rst    
> >   
> > > +Device Detection
> > > +----------------
> > > +
> > > +CXL Type-2 detection happens automatically when ``vfio-pci`` registers a
> > > +device that has:
> > > +
> > > +1. A CXL Device DVSEC capability (PCIe DVSEC Vendor ID 0x1E98, ID 0x0000).
> > > +2. Bit 2 (Mem_Capable) set in the CXL Capability register within that DVSEC.    
> > 
> > FWIW to be type 2 as opposed to a type 3 non class code device (e.g. the
> > compressed memory devices Gregory Price and others are using) you need
> > Cache_capable as well.  Might be worth making this all about
> > CXL Type-2 and non class code Type-3.
> >   
> > > +3. A PCI class code that is **not** ``0x050210`` (CXL Type-3 memory device).
> > > +4. An HDM Decoder block discoverable via the Register Locator DVSEC.
> > > +5. A pre-committed HDM decoder (BIOS/firmware programmed) with non-zero size.    
> > 
> > This is the bit that we need to make more general. Otherwise you'll have
> > to have a bios upgrade for every type 2 device (and no native hotplug).
> > Note native hotplug is quite likely if anyone is switch based device
> > pooling.
> > 
> > I assume that you are doing this today to get something upstream
> > and presume it works for the type 2 device you have on the host you
> > care about.  I'm not sure there are 'general' solutions but maybe
> > there are some heuristics or sufficient conditions for establishing the
> > size.
> > 
> > Type 2 might have any of:
> > - Conveniently preprogrammed HDM decoders (the case you use)
> > - Maximum of 2 HDM decoders + the same number of Range registers.
> >   In general the problem with range registers is they are a legacy feature
> >   and there are only 2 of them whereas a real device may have many more
> >   DPA ranges. In this corner case though, is it enough to give us the
> >   necessary sizes?  I think it might be but would like others familiar
> >   with the spec to confirm. (If needed I'll take this to the consortium
> >   for an 'official' view).
> > - A DOE and table access protocol.  CDAT should give us enough info to
> >   be fairly sure what is needed.
> > - A CXL mailbox (maybe the version in the PCI spec now) and the spec defined
> >   commands to query what is there.  Reading the intro to 8.2.10.9 Memory
> >   Device Command Sets, it's a little unclear on whether these are valid on
> >   non class code devices but I believe having the appropriate Mailbox
> >   type identifier is enough to say we expect to get them.
> > 
> > None of this is required though and the mailboxes are non trivial.
> > So personally I think we should propose a new DVSEC that provides any
> > info we need for generic passthrough.  Starting with what we need
> > to get the regions right.  Until something like that is in place we
> > will have to store this info somewhere.
> > 
> > There is (maybe) an alternative of doing the region allocation on demand.
> > That is emulate the HDM decoders in QEMU (on top of the emulation
> > here) and when settings corresponding to a region setup occur,
> > go request one from the CXL core. The problem is we can't guarantee
> > it will be available at that time. So we can 'guess' what to provide
> > to the VM in terms of CXL fixed memory windows, but short of heuristics
> > (either whole of the host offer, or divide it up based on devices present
> >  vs what is in the VM) that is going to be prone to it not being available
> > later.
> > 
> > Where do people think this should be?  We are going to end up with
> > a device list somewhere. Could be in kernel, or in QEMU or make it an
> > orchestrator problem (applying the 'someone else's problem' solution).  
> 
> That's the typical approach.  That's what we did with resizable BARs.
> If we cannot guarantee allocation on demand, we need to push the policy
> to the device, via something that indicates the size to use, or to the
> orchestration, via something that allows the size to be committed
> out-of-band.  As with REBAR, we then need to be able to restrict the
> guest behavior to select only the configured option.
> 
> I imagine this means for the non-pre-allocated case, we need to develop
> some sysfs attributes that allows that out-of-band sizing, which would
> then appear as a fixed, pre-allocated configuration to the guest.
> Thanks,

I did some reading as only vaguely familiar with how the resizeable bar stuff
was done. That approach should be fairly straight forward to adapt here.
Stash some config in struct pci_dev before binding vfio-pci/cxl via a sysfs
interface.  Given that the association with the CXL infrastructure only
happens later (unlike bar config) it would then be the job of the
vfio-pci/cxl driver to see what was requested and attempt to set up the
CXL topology to deliver it at bind time. 

Manesh, would you mind hack at small PoC on top of your existing code to see
if this approach shows up any problems?  I don't have anything to test against
right now, though could probably hack some emulation together fairly fast.
I'm thinking you'll get there faster!  I'm mostly focused on this cycle
stuff at the moment, and I suspect we'll be discussing this for a while
yet + it has dependencies on other series that aren't in yet.

I'm not sure the PCI folk will like us stashing random stuff in their
structures just because we haven't bound anything yet though so have no
CXL structures to use.  We should probably think about how VF CXL.mem
region/sub-region assignment might work as well.

Sticking to PF (well actually just function 0) passthrough for now...
For the guest, we can constrain things so there is only one right option
though it will limit what topologies we can build.  Basically each device
passed through has it's own CXL fixed memory window, it's own host bridge,
it's own root port + no switches.  The sizing it sees for the CFMWS
matches what we configured in the host.  We could program that topology up
and lock it down but that means VM BIOS nastiness so I'd leave it to the
native linux code to bring it up.  If anyone wants to do P2P it'll get
harder to do within the spec as we will have to prevent topologies that
contain foot guns like the ability to configure interleave.

This constrained approach is what we plan for the CXL class code type 3
device emulation used for DCD so we've been exploring it already.
It's still possible to do annoying things like zero size decoders +
skip. For now we can fail HDM decoder commits if they are particularly
non sensical and we haven't handled them yet - ultimately we'll probably
want to minimize what we refuse to handle as I'm sure 'other OS' may not
do things the same as Linux.

P2P and the fun of single device on multiple PCI heirarchies as to be solved
later. As an FYI, for bandwidth, people will be building devices that
interleave memory addresses over multiple root ports. Dan reminded
me of that challenge last night.  See bundled ports in CXL 4.0, though
this particular part related to CXL.mem is actually possible prior to
that stuff for CXL.cache.  Oh and don't get me started on TSP / coco challenges.
I take the view they are Dan's problem for now ;)

Jonathan


> 
> Alex