From: Manish Honap <mhonap@nvidia.com>
Add Documentation/driver-api/vfio-pci-cxl.rst describing the architecture,
VFIO interfaces, and operational constraints for CXL Type-2 (cache-coherent
accelerator) passthrough via vfio-pci-core, and link it from the driver-api
index.
The document covers:
- VFIO_DEVICE_FLAGS_CXL and VFIO_DEVICE_INFO_CAP_CXL: what the capability
struct contains and what the FIRMWARE_COMMITTED and CACHE_CAPABLE flags mean
- How to derive hdm_decoder_offset and hdm_count from the COMP_REGS region
by traversing the CXL Capability Array to find cap ID 0x5 and reading the
HDM Decoder Capability register
- Topology-aware sparse mmap on the component BAR (topologies A, B, C
covering comp block at end, start, or middle of the BAR)
- Two extra VFIO device regions: COMP_REGS for the emulated HDM register
state and the DPA memory window
- DVSEC config write virtualization: what the guest sees vs. hardware
- FLR coordination: DPA PTEs zapped before reset, restored after
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
Documentation/driver-api/index.rst | 1 +
Documentation/driver-api/vfio-pci-cxl.rst | 382 ++++++++++++++++++++++
2 files changed, 383 insertions(+)
create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 1833e6a0687e..7ec661846f6b 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -47,6 +47,7 @@ of interest to most developers working on device drivers.
vfio-mediated-device
vfio
vfio-pci-device-specific-driver-acceptance
+ vfio-pci-cxl
Bus-level documentation
=======================
diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
new file mode 100644
index 000000000000..1256e4d33fc6
--- /dev/null
+++ b/Documentation/driver-api/vfio-pci-cxl.rst
@@ -0,0 +1,382 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+VFIO PCI CXL Type-2 device passthrough
+=======================================
+
+Overview
+--------
+
+Type-2 CXL devices are PCIe accelerators (GPUs, compute ASICs, and similar)
+with coherent device memory on CXL.mem. DPA is mapped into host physical
+address space through HDM decoders that the kernel's CXL subsystem owns. A
+guest cannot program that hardware directly.
+
+This ``vfio-pci`` mode hands a VMM:
+
+- A read/write VFIO device region (COMP_REGS) that emulates the HDM decoder
+ register block with CXL register rules enforced in kernel code.
+- A mmapable VFIO device region (DPA) backed by the kernel-chosen host physical
+ range for device memory.
+- DVSEC config-space emulation so the guest cannot change host-owned CXL.io /
+ CXL.mem enable bits.
+
+Build with ``CONFIG_VFIO_CXL_CORE=y``. At runtime you can turn it off with::
+
+ modprobe vfio-pci disable_cxl=1
+
+or, in a variant driver, set ``vdev->disable_cxl = true`` before registration.
+
+
+Device detection
+----------------
+
+At ``vfio_pci_core_register_device()`` the driver checks for a Type-2 style
+setup. All of the following must hold:
+
+1. CXL Device DVSEC present (PCIe DVSEC Vendor ID ``0x1E98``, DVSEC ID
+ ``0x0000``).
+2. ``Mem_Capable`` (bit 2) set in the CXL Capability register inside that DVSEC.
+3. PCI class code is **not** ``0x050210`` (CXL Type-3 memory expander).
+4. An HDM Decoder capability block reachable through the Register Locator DVSEC.
+5. At least one HDM decoder committed by firmware with non-zero size.
+
+The CXL spec labels "Type-2" as devices with both ``Mem_Capable`` and
+``Cache_Capable``. This driver also takes ``Mem_Capable``-only devices
+(``Cache_Capable=0``), which behave like Type-3 style accelerators without the
+usual class code. ``VFIO_CXL_CAP_CACHE_CAPABLE`` exposes the cache bit to
+userspace so a VMM can treat FLR differently when needed.
+
+When detection succeeds, ``VFIO_DEVICE_FLAGS_CXL`` is ORed into
+``vfio_device_info.flags`` together with ``VFIO_DEVICE_FLAGS_PCI``.
+
+.. note::
+
+ **Firmware must commit an HDM decoder before open.** The driver only
+ discovers DPA range and size from a decoder that firmware already committed.
+ Devices without that, or hot-plugged setups that never get it, are out of
+ scope for now.
+
+ Follow-up options under discussion include CXL range registers in the
+ Device DVSEC (often enough on single-decoder parts), CDAT over DOE, mailbox
+ Get Partition Info, or a future DVSEC field from the consortium for
+ base/size/NUMA without extra side channels. There is also talk of a sysfs
+ path, modeled on resizable BAR, where an orchestrator fixes the DPA window
+ before vfio-pci binds so the driver still sees a committed range.
+
+
+UAPI: VFIO_DEVICE_INFO_CAP_CXL
+------------------------------
+
+When ``VFIO_DEVICE_FLAGS_CXL`` is set, the device info capability chain
+includes a ``vfio_device_info_cap_cxl`` structure (cap ID 6, version 1)::
+
+ struct vfio_device_info_cap_cxl {
+ struct vfio_info_cap_header header; /* id=6, version=1 */
+ __u8 hdm_regs_bar_index; /* BAR index containing component regs */
+ __u8 reserved[3];
+ __u32 flags; /* VFIO_CXL_CAP_* flags */
+ __u64 hdm_regs_offset; /* byte offset within the BAR to the
+ * CXL.mem register area start. This
+ * equals comp_reg_offset + CXL_CM_OFFSET
+ * where CXL_CM_OFFSET = 0x1000. */
+ __u32 dpa_region_index; /* VFIO region index for DPA memory */
+ __u32 comp_regs_region_index; /* VFIO region index for COMP_REGS */
+ };
+ /*
+ * hdm_count and hdm_decoder_offset are intentionally absent from this
+ * struct. Both are derivable from the COMP_REGS region. See the
+ * "Deriving HDM info from COMP_REGS" section below.
+ */
+
+ #define VFIO_CXL_CAP_FIRMWARE_COMMITTED (1 << 0)
+ #define VFIO_CXL_CAP_CACHE_CAPABLE (1 << 1)
+
+``VFIO_CXL_CAP_FIRMWARE_COMMITTED``
+ At least one HDM decoder was pre-committed by firmware. The DPA region
+ is live at device open; the VMM can map it without waiting for a guest
+ COMMIT cycle.
+
+``VFIO_CXL_CAP_CACHE_CAPABLE``
+ The device has an HDM-DB decoder (CXL.mem + CXL.cache). This mirrors the
+ ``Cache_Capable`` bit from the CXL DVSEC Capability register. The kernel
+ does not run Write-Back Invalidation (WBI) before FLR; with this flag set
+ that stays the VMM's job.
+
+DPA region size comes from ``VFIO_DEVICE_GET_REGION_INFO`` on
+``dpa_region_index``, not from this struct.
+
+
+VFIO regions
+------------
+
+A CXL device adds two device regions on top of the usual BARs. Their indices
+are in ``dpa_region_index`` and ``comp_regs_region_index``.
+
+DPA region (``VFIO_REGION_SUBTYPE_CXL``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Flags: ``READ | WRITE | MMAP``.
+
+The backing store is the host physical range the kernel assigned for DPA. The
+kernel maps it with ``memremap(MEMREMAP_WB)`` because CXL device memory on a
+coherent link sits in the CPU cache hierarchy. That mapping is normal cached
+memory, so ``copy_to/from_user`` works without extra barriers.
+
+Page faults are lazy: PFNs are installed per page on first touch via
+``vmf_insert_pfn``. ``mmap()`` does not populate the whole region up front.
+
+Region read/write through the fd uses the same ``MEMREMAP_WB`` mapping with
+``copy_to/from_user``. ``ioread``/``iowrite`` MMIO helpers are not used on
+this path.
+
+During FLR, ``unmap_mapping_range()`` drops user PTEs and ``region_active``
+clears before the reset runs. Ongoing faults or region I/O then error instead
+of touching a dead mapping. IOMMU ATC invalidation from the zap has to finish
+before the device resets; doing it the other way around can leave an SMMU
+waiting on a device that no longer responds.
+
+After reset, the region comes back once ``COMMITTED`` shows up again in fresh
+HDM hardware state. The VMM can fault pages in again without a new ``mmap()``.
+
+COMP_REGS region (``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Flags: ``READ | WRITE`` (no mmap).
+
+Emulated registers for the CXL.mem slice of the component register block: the
+CXL Capability Array header at offset 0, then the HDM Decoder capability
+starting at ``hdm_decoder_offset`` (the byte offset derived by traversing the
+CXL Capability Array — see "Deriving HDM info from COMP_REGS" below).
+Region size from ``VFIO_DEVICE_GET_REGION_INFO`` covers the full capability
+array prefix plus all HDM decoder blocks.
+
+Only 32-bit, 32-bit-aligned accesses are allowed. 8- and 16-bit attempts get
+``-EINVAL``.
+
+Offsets below ``hdm_decoder_offset`` return the snapshot from device open.
+Writes there are dropped (with a WARN); the capability array stays read-only.
+
+From ``hdm_decoder_offset`` upward the kernel keeps a shadow
+(``comp_reg_virt[]``) and applies field rules:
+
+- At open, hardware HDM state is snapshotted. For firmware-committed decoders
+ the LOCK bit is cleared and BASE_HI/BASE_LO are zeroed in the shadow so the
+ VMM can program guest GPA; the host HPA is not carried in the shadow after
+ that.
+- ``COMMIT`` (bit 9 of CTRL): writing 1 sets ``COMMITTED`` (bit 10) in the
+ shadow immediately. Real hardware stays committed; the shadow tracks what
+ the guest should see.
+- When LOCK is set, writes to BASE_HI and SIZE_HI are ignored so
+ firmware-committed values survive.
+
+Region type identifiers::
+
+ /* type = PCI_VENDOR_ID_CXL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE */
+ #define VFIO_REGION_SUBTYPE_CXL 1 /* DPA memory region */
+ #define VFIO_REGION_SUBTYPE_CXL_COMP_REGS 2 /* HDM register shadow */
+
+
+BAR access
+----------
+
+``VFIO_DEVICE_GET_REGION_INFO`` for ``hdm_regs_bar_index`` reports the full
+BAR size with ``READ | WRITE | MMAP`` flags and a
+``VFIO_REGION_INFO_CAP_SPARSE_MMAP`` capability listing the GPU or
+accelerator register windows — the mmappable parts of the BAR that do **not**
+contain CXL component registers.
+
+The number of sparse areas depends on where the CXL component register block
+``[comp_reg_offset, comp_reg_offset + comp_reg_size)`` sits within the BAR:
+
+* **Topology A** - component block at BAR end:
+ ``[gpu_regs | comp_regs]`` → 1 area: ``[0, comp_reg_offset)``
+
+* **Topology B** - component block at BAR start:
+ ``[comp_regs | gpu_regs]`` → 1 area: ``[comp_reg_size, bar_len)``
+
+* **Topology C** - component block in middle:
+ ``[gpu_regs | comp_regs | gpu_regs]`` → 2 areas:
+ ``[0, comp_reg_offset)`` and ``[comp_reg_offset + comp_reg_size, bar_len)``
+
+VMMs **must** iterate all ``nr_areas`` entries; do not assume a single area or
+that the first area starts at offset zero.
+
+The GPU/accelerator register windows listed in the sparse capability **are**
+physically mmappable: ``mmap()`` on the VFIO device fd at the corresponding
+BAR offset succeeds and yields a host-physical-backed mapping suitable for
+KVM stage-2 installation.
+
+The CXL component register block itself **is not** mmappable. Any ``mmap()``
+request whose range overlaps ``[comp_reg_offset, comp_reg_offset +
+comp_reg_size)`` returns ``-EINVAL``; those registers must be accessed through
+the ``COMP_REGS`` device region.
+
+
+DVSEC configuration space emulation
+-----------------------------------
+
+With ``CONFIG_VFIO_CXL_CORE=y``, vfio-pci installs a handler for
+``PCI_EXT_CAP_ID_DVSEC`` (``0x23``) in the config access table. Non-CXL
+devices fall through as before.
+
+On CXL devices, writes to these DVSEC registers are caught and reflected in
+``vdev->vconfig`` (shadow config space):
+
++--------------------+--------+--------------------------------------------------+
+| Register | Offset | Emulation |
++====================+========+==================================================+
+| CXL Control | +0x0c | RWL; IO_Enable held at 1; locked when Lock |
+| | | bit 0 is set. |
++--------------------+--------+--------------------------------------------------+
+| CXL Status | +0x0e | Bit 14 (Viral_Status) is RW1CS. |
++--------------------+--------+--------------------------------------------------+
+| CXL Control2 | +0x10 | Bits 1 and 2 forwarded to hardware. |
++--------------------+--------+--------------------------------------------------+
+| CXL Status2 | +0x12 | Bit 3 forwarded when Capability3 bit 3 is set. |
++--------------------+--------+--------------------------------------------------+
+| CXL Lock | +0x14 | RWO; once set, Control becomes read-only until |
+| | | conventional reset. |
++--------------------+--------+--------------------------------------------------+
+| Range Base Hi/Lo | varies | Stored in vconfig; Base Low [27:0] reserved bits |
+| | | cleared on write. |
++--------------------+--------+--------------------------------------------------+
+
+Reads return the shadow. Read-only registers (Capability, Size High/Low) are
+filled from hardware at open.
+
+
+FLR and reset
+-------------
+
+FLR goes through ``vfio_pci_ioctl_reset()``. The CXL-specific part is:
+
+1. ``vfio_cxl_zap_region_locked()`` runs under the write side of
+ ``memory_lock``. It clears ``region_active`` and calls
+ ``unmap_mapping_range()`` on the DPA inode mapping so user PTEs go away.
+ Concurrent faults or fd I/O hit the inactive flag and error. IOMMU ATC must
+ drain before reset (see the DPA region notes above).
+
+2. After FLR, ``vfio_cxl_reactivate_region()`` reads HDM hardware again into
+ ``comp_reg_virt[]``. If ``COMMITTED`` is set (common when firmware left the
+ decoder committed), ``region_active`` turns back on and the VMM can refault
+ without remapping.
+
+
+Known limitations
+-----------------
+
+**Pre-committed HDM decoder required**
+ See `Device detection`_ and the note there.
+
+**CXL hot-plug not supported**
+ Slots need to be present and programmed by firmware at boot.
+
+**CXL.cache Write-Back Invalidation not implemented**
+ For HDM-DB devices (``VFIO_CXL_CAP_CACHE_CAPABLE``), the kernel does not
+ run WBI before FLR. The VMM must do it and expose Back-Invalidation in the
+ guest topology where required.
+
+
+VMM integration notes
+---------------------
+
+For a ``VFIO_CXL_CAP_FIRMWARE_COMMITTED`` device (what works today)::
+
+ /* 1. Get device info and locate the CXL cap */
+ vfio_device_get_info(fd, &dinfo);
+ assert(dinfo.flags & VFIO_DEVICE_FLAGS_CXL);
+ cxl = find_cap(&dinfo, VFIO_DEVICE_INFO_CAP_CXL);
+
+ /* 2. Get DPA and COMP_REGS region sizes */
+ get_region_info(fd, cxl->dpa_region_index, &dpa_ri);
+ get_region_info(fd, cxl->comp_regs_region_index, &comp_ri);
+
+ /* 3. Map DPA region at a guest physical address */
+ gpa_base = allocate_guest_phys(dpa_ri.size);
+ mmap(gpa_base, dpa_ri.size, PROT_READ|PROT_WRITE,
+ MAP_SHARED|MAP_FIXED, vfio_fd,
+ (off_t)cxl->dpa_region_index << VFIO_PCI_OFFSET_SHIFT);
+
+ /* 4. Derive hdm_decoder_offset from COMP_REGS (see section below) */
+ uint64_t hdm_decoder_offset = derive_hdm_offset(vfio_fd, comp_ri);
+
+ /* 5. Write guest GPA into HDM Decoder 0 BASE via COMP_REGS pwrite */
+ u32 base_hi = gpa_base >> 32;
+ comp_off = (off_t)cxl->comp_regs_region_index << VFIO_PCI_OFFSET_SHIFT;
+ pwrite(vfio_fd, &base_hi, 4,
+ comp_off + hdm_decoder_offset + CXL_HDM_DECODER0_BASE_HIGH_OFFSET);
+
+ /* 6. Build guest CXL topology using gpa_base and dpa_ri.size */
+ build_cfmws(gpa_base, dpa_ri.size);
+
+ /* 7. If CACHE_CAPABLE: issue WBI before any guest FLR */
+
+Extra detail:
+
+- DPA size is ``dpa_ri.size`` from region info.
+- ``CXL_HDM_DECODER0_BASE_HIGH_OFFSET`` lives in ``include/uapi/cxl/cxl_regs.h``.
+- On the BAR, ``mmaps[0].size`` from the sparse-mmap cap on
+ ``hdm_regs_bar_index`` splits GPU MMIO (BAR fd) from the CXL block (COMP_REGS
+ region).
+- If ``VFIO_CXL_CAP_CACHE_CAPABLE`` is set, the guest CXL topology should
+ advertise Back-Invalidation and the VMM should run WBI before FLR.
+
+
+Deriving HDM info from COMP_REGS
+---------------------------------
+
+``hdm_decoder_offset`` and ``hdm_count`` are not in ``vfio_device_info_cap_cxl``
+because both are directly readable from the ``COMP_REGS`` region.
+
+**Finding hdm_decoder_offset:**
+
+Read dwords from the COMP_REGS region starting at offset 0 (the CXL Capability
+Array). ``comp_off`` is the VFIO file offset for the COMP_REGS region:
+``(off_t)cxl->comp_regs_region_index << VFIO_PCI_OFFSET_SHIFT``::
+
+ /* Dword 0: CXL Capability Array Header */
+ pread(fd, &hdr, 4, comp_off + 0);
+ /* bits[15:0] must be 1 (CM_CAP_HDR_CAP_ID) */
+ /* bits[31:24] = number of capability entries */
+ num_caps = (hdr >> 24) & 0xff; /* CXL_CM_CAP_HDR_ARRAY_SIZE_MASK */
+
+ /* Walk entries at dword 1..num_caps */
+ for (i = 1; i <= num_caps; i++) {
+ pread(fd, &entry, 4, comp_off + i * 4);
+ cap_id = entry & 0xffff; /* CXL_CM_CAP_HDR_ID_MASK */
+ if (cap_id == 0x5) { /* CXL_CM_CAP_CAP_ID_HDM */
+ hdm_decoder_offset = (entry >> 20) & 0xfff; /* CXL_CM_CAP_PTR_MASK */
+ break;
+ }
+ }
+
+**Finding hdm_count:**
+
+Read the HDM Decoder Capability register (HDMC) at ``hdm_decoder_offset + 0``::
+
+ pread(fd, &hdmc, 4, comp_off + hdm_decoder_offset);
+ field = hdmc & 0xf; /* CXL_HDM_DECODER_COUNT_MASK bits[3:0] */
+ hdm_count = field ? field * 2 : 1; /* 0→1, N→N*2 decoders */
+
+All constants are in ``include/uapi/cxl/cxl_regs.h``.
+
+
+Kernel configuration
+--------------------
+
+``CONFIG_VFIO_CXL_CORE`` (bool)
+ CXL Type-2 passthrough in ``vfio-pci-core``. Needs ``CONFIG_VFIO_PCI_CORE``,
+ ``CONFIG_CXL_BUS``, and ``CONFIG_CXL_MEM``.
+
+References
+----------
+
+* CXL Specification 4.0, 8.1.3 - PCIe DVSEC for CXL Devices
+* CXL Specification 4.0, 8.2.4.20 - CXL HDM Decoder Capability Structure
+* ``include/uapi/linux/vfio.h`` - ``VFIO_DEVICE_INFO_CAP_CXL``,
+ ``VFIO_REGION_SUBTYPE_CXL``, ``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``
+* ``include/uapi/cxl/cxl_regs.h`` - ``CXL_CM_OFFSET``,
+ ``CXL_CM_CAP_HDR_ARRAY_SIZE_MASK``, ``CXL_CM_CAP_HDR_ID_MASK``,
+ ``CXL_CM_CAP_PTR_MASK``, ``CXL_HDM_DECODER_COUNT_MASK``,
+ ``CXL_HDM_DECODER0_BASE_HIGH_OFFSET``
--
2.25.1
© 2016 - 2026 Red Hat, Inc.