From: Manish Honap <mhonap@nvidia.com>
CXL Type-2 accelerators (e.g. CXL.mem-capable GPUs) cannot be passed
through to virtual machines with stock vfio-pci because the driver has
no concept of HDM decoder management, DPA region exposure, or component
register emulation. This series wires all of that into vfio-pci-core
behind a new CONFIG_VFIO_CXL_CORE optional module, without requiring a
variant driver.
When a CXL Device DVSEC (Vendor ID 0x1E98, ID 0x0000) is detected at
device open time, the driver:
- Probes the HDM Decoder Capability block in the component registers
and allocates a DPA region through the CXL subsystem. On devices
where firmware has already committed a decoder, the kernel skips
allocation and re-uses the committed range.
- Builds a kernel-owned shadow of the HDM register block. The VMM
reads and writes this shadow through a dedicated COMP_REGS VFIO
region rather than touching the hardware directly. The kernel
enforces CXL 3.1 bit-field rules: reserved bits, read-only bits,
the COMMIT/COMMITTED latch, and the LOCK→0 reprogram path for
firmware-committed decoders.
- Exposes the DPA range as a second VFIO region (VFIO_REGION_SUBTYPE_CXL)
backed by the kernel-assigned HPA. PTEs are inserted lazily on first
page fault and torn down atomically under memory_lock during FLR.
- Intercepts writes to the CXL DVSEC configuration-space registers
(Control, Status, Control2, Status2, Lock, Range Base) and replays
them through a per-device vconfig shadow, enforcing RWL/RW1CS/RWO
access semantics and the CONFIG_LOCK one-shot latch.
- Returns a VFIO_DEVICE_INFO_CAP_CXL capability (id=6) carrying the
HDM register BAR index and offset, commit flags, and the indices of
the DPA and COMP_REGS regions. HDM decoder count and the HDM block
offset within COMP_REGS are derivable by the VMM from the CXL
Capability Array in the COMP_REGS region itself, so they are not
duplicated in the capability struct.
- Builds a sparse-mmap capability for the component register BAR so
VMMs can map GPU/accelerator register windows while the kernel
protects the CXL component register block. Three physical layouts
are handled: component block at the BAR end, at the start, and in
the middle.
- Provides a module parameter (disable_cxl=1) and a per-device flag
(vdev->disable_cxl) for suppressing the feature without recompiling.
- Includes selftests covering device detection, capability parsing,
region enumeration, HDM register emulation, DPA mmap with page-fault
insertion, FLR invalidation, and DVSEC register emulation.
The series is applied on top of the cxl/next branch using the base
specified at the end of this cover letter plus Alejandro's v23 Type-2
device support patches [1].
Series structure
================
Patches 1-5 extend the CXL subsystem with the APIs vfio-pci needs.
Patches 6-8 add the vfio-pci-core plumbing (UAPI, device state,
Kconfig/build).
Patches 9-15 implement the core device lifecycle: detection, HDM
emulation, media readiness, region management, DPA region, and DVSEC
emulation.
Patches 16-18 wire everything together at open/close time and
populate the VFIO ioctl paths.
Patches 19-20 add documentation and selftests.
Changes since v1
================
UAPI struct minimization (patch 6)
v1 carried hdm_count, hdm_regs_size, hdm_decoder_offset, dpa_size,
and a pad byte in vfio_device_info_cap_cxl. All four fields are
derivable from data the VMM already has: hdm_count and the HDM block
offset come from the CXL Capability Array in the COMP_REGS region,
hdm_regs_size is implicit in the COMP_REGS region size, and dpa_size
is the DPA region size. v2 drops them and replaces pad with
reserved[3]. The VFIO_CXL_CAP_PRECOMMITTED flag is gone; the single
VFIO_CXL_CAP_FIRMWARE_COMMITTED flag covers both the committed and
precommitted cases. VFIO_CXL_CAP_CACHE_CAPABLE is added to expose
the HDM-DB (CXL.cache) capability bit.
Component BAR access: sparse mmap instead of blanket rejection (patch 17)
v1 returned size=0 for the component BAR and rejected all mmap and
r/w access to it. That broke GPU passthrough scenarios where the
device puts accelerator register windows in the same BAR as the CXL
component registers. v2 replaces the blanket rejection with a
sparse-mmap capability that advertises only the GPU register windows,
carving out the component register block. vfio_cxl_mmap_overlaps_comp_regs()
rejects only the sub-range covering [comp_reg_offset, comp_reg_offset
+ comp_reg_size); everything else in the BAR remains mappable.
CXL register defines moved to uapi/cxl/cxl_regs.h (patch 3)
v1 placed the component register defines in a private header
(include/cxl/cxl_regs.h). v2 moves them to include/uapi/cxl/cxl_regs.h
so VMMs can include them directly without duplicating definitions.
HDM API simplification (patch 1)
v1 exported cxl_get_hdm_reg_info() which returned a raw struct with
offset and size fields. v2 replaces it with cxl_get_hdm_info() which
uses the cached count already populated by cxl_probe_component_regs()
and returns a single struct with all HDM metadata, removing the need
for callers to re-read the hardware.
cxl_await_range_active() split (patch 4)
cxl_await_media_ready() requires a CXLMDEV mailbox register, which
Type-2 accelerators may not have. v2 splits out cxl_await_range_active()
so the HDM range-active poll can be used independently of the media
ready path.
LOCK→0 transition in HDM ctrl write emulation (patch 11)
v1 did not handle the case where a guest tries to clear the LOCK bit
to reprogram a firmware-committed decoder. v2 allows this transition
and re-programs the hardware accordingly.
Component register buffer allocation (patch 11)
v1 allocated only the HDM register sub-range in the COMP_REGS buffer.
v2 allocates the full CXL_COMPONENT_REG_BLOCK_SIZE so future patches
can expose other capability blocks (e.g. RAS, CXL.cache) without a
structural change.
Register region setup split (patch 16)
v1 tied region registration to the detection/init path. v2 splits it
into explicit vfio_cxl_register_cxl_region() and
vfio_cxl_register_comp_regs_region() functions called from
vfio_pci_open_device(), which is the correct point since vconfig and
pci_config_map are valid there.
VLA fix merged into selftest (patch 20)
v1 had a separate patch 20 fixing a VLA initialisation in
vfio_pci_irq_set(). v2 folds that fix into the selftest patch to
keep the standalone CXL change count at 19 functional patches.
Reviewer feedback addressed
===========================
Dave Jiang:
- Replace open-coded bit shifts with FIELD_GET() / FIELD_PREP()
throughout the HDM emulation code.
- Rename flag from VFIO_CXL_CAP_COMMITTED / VFIO_CXL_CAP_PRECOMMITTED
to VFIO_CXL_CAP_FIRMWARE_COMMITTED; the old names were ambiguous.
- Use memremap(MEMREMAP_WB) for the DPA kernel mapping instead of
ioremap_cache(), which selects the wrong memory-type descriptor on
ARM64.
- Use __free() / DEFINE_FREE() scope helpers for CXL resource cleanup
in the region management path, replacing the open-coded error
unwind.
- Remove the unused abs_off parameter from the HDM accessor.
- Rename cxl_dvsec_control_write() to better reflect its role.
Jonathan Cameron:
- Move CXL register defines to uapi/cxl/cxl_regs.h so VMMs can
consume them without a kernel header dependency.
- Use local variables with __free() rather than struct members for
intermediate ERR_PTR returns in the region management code; avoids
ambiguity about ownership on error paths.
- The assumption that a pre-committed decoder always exists at probe
time is too restrictive for hotplug scenarios; v2 makes the
precommitted path a fast-track that falls back to dynamic allocation
when no committed decoder is found.
Alex Williamson:
- The blanket size=0 / mmap-reject approach for the component BAR
prevents VMMs from accessing GPU register windows in the same BAR.
v2 implements the sparse-mmap capability described above.
Limitations and future work
===========================
Switched topologies with more than one caching agent are not yet
supported; that is planned for a follow-on series.
RAS/ECC handling and CXL core reset integration (cxl_reset support
from Srirangan [2]) will be added in subsequent patches.
Dependencies
============
[1] CXL Type-2 device basic support (Alejandro Lucero-Palau, v23):
https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/
[2] CXL reset support for Type-2 devices (Srirangan Madhavan):
https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/
Cc: Alex Williamson <alex@shazbot.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alejandro Lucero <alejandro.lucero-palau@amd.com>
Cc: linux-cxl@vger.kernel.org
Cc: kvm@vger.kernel.org
Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
base-commit: 3f7938b1aec7f06d5b23adca83e4542fcf027001
--
Manish Honap (20):
cxl: Add cxl_get_hdm_info() for HDM decoder metadata
cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public
header
cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
cxl: Split cxl_await_range_active() from media-ready wait
cxl: Record BIR and BAR offset in cxl_register_map
vfio: UAPI for CXL-capable PCI device assignment
vfio/pci: Add CXL state to vfio_pci_core_device
vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks
vfio/cxl: Detect CXL DVSEC and probe HDM block
vfio/pci: Export config access helpers
vfio/cxl: Introduce HDM decoder register emulation framework
vfio/cxl: Wait for HDM ranges and create memdev
vfio/cxl: CXL region management support
vfio/cxl: DPA VFIO region with demand fault mmap and reset zap
vfio/cxl: Virtualize CXL DVSEC config writes
vfio/cxl: Register regions with VFIO layer
vfio/pci: Advertise CXL cap and sparse component BAR to userspace
vfio/cxl: Provide opt-out for CXL feature
docs: vfio-pci: Document CXL Type-2 device passthrough
selftests/vfio: Add CXL Type-2 VFIO assignment test
Documentation/driver-api/index.rst | 1 +
Documentation/driver-api/vfio-pci-cxl.rst | 382 +++
drivers/cxl/core/pci.c | 64 +-
drivers/cxl/core/regs.c | 30 +
drivers/cxl/cxl.h | 46 -
drivers/vfio/pci/Kconfig | 2 +
drivers/vfio/pci/Makefile | 1 +
drivers/vfio/pci/cxl/Kconfig | 9 +
drivers/vfio/pci/cxl/vfio_cxl_config.c | 306 ++
drivers/vfio/pci/cxl/vfio_cxl_core.c | 880 ++++++
drivers/vfio/pci/cxl/vfio_cxl_emu.c | 509 ++++
drivers/vfio/pci/cxl/vfio_cxl_priv.h | 133 +
drivers/vfio/pci/vfio_pci.c | 32 +
drivers/vfio/pci/vfio_pci_config.c | 58 +-
drivers/vfio/pci/vfio_pci_core.c | 46 +-
drivers/vfio/pci/vfio_pci_priv.h | 66 +
drivers/vfio/pci/vfio_pci_rdwr.c | 16 +-
include/cxl/cxl.h | 51 +
include/linux/vfio_pci_core.h | 10 +
include/uapi/cxl/cxl_regs.h | 160 +
include/uapi/linux/vfio.h | 86 +
tools/testing/selftests/vfio/Makefile | 1 +
.../selftests/vfio/lib/vfio_pci_device.c | 3 +-
.../selftests/vfio/vfio_cxl_type2_test.c | 2631 +++++++++++++++++
24 files changed, 5459 insertions(+), 64 deletions(-)
create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
create mode 100644 drivers/vfio/pci/cxl/Kconfig
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_config.c
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_emu.c
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
create mode 100644 include/uapi/cxl/cxl_regs.h
create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c
--
2.25.1