[PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support

mhonap@nvidia.com posted 20 patches 4 hours ago
Documentation/driver-api/index.rst            |    1 +
Documentation/driver-api/vfio-pci-cxl.rst     |  382 +++
drivers/cxl/core/pci.c                        |   64 +-
drivers/cxl/core/regs.c                       |   30 +
drivers/cxl/cxl.h                             |   46 -
drivers/vfio/pci/Kconfig                      |    2 +
drivers/vfio/pci/Makefile                     |    1 +
drivers/vfio/pci/cxl/Kconfig                  |    9 +
drivers/vfio/pci/cxl/vfio_cxl_config.c        |  306 ++
drivers/vfio/pci/cxl/vfio_cxl_core.c          |  880 ++++++
drivers/vfio/pci/cxl/vfio_cxl_emu.c           |  509 ++++
drivers/vfio/pci/cxl/vfio_cxl_priv.h          |  133 +
drivers/vfio/pci/vfio_pci.c                   |   32 +
drivers/vfio/pci/vfio_pci_config.c            |   58 +-
drivers/vfio/pci/vfio_pci_core.c              |   46 +-
drivers/vfio/pci/vfio_pci_priv.h              |   66 +
drivers/vfio/pci/vfio_pci_rdwr.c              |   16 +-
include/cxl/cxl.h                             |   51 +
include/linux/vfio_pci_core.h                 |   10 +
include/uapi/cxl/cxl_regs.h                   |  160 +
include/uapi/linux/vfio.h                     |   86 +
tools/testing/selftests/vfio/Makefile         |    1 +
.../selftests/vfio/lib/vfio_pci_device.c      |    3 +-
.../selftests/vfio/vfio_cxl_type2_test.c      | 2631 +++++++++++++++++
24 files changed, 5459 insertions(+), 64 deletions(-)
create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
create mode 100644 drivers/vfio/pci/cxl/Kconfig
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_config.c
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_emu.c
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
create mode 100644 include/uapi/cxl/cxl_regs.h
create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c
[PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support
Posted by mhonap@nvidia.com 4 hours ago
From: Manish Honap <mhonap@nvidia.com>

CXL Type-2 accelerators (e.g. CXL.mem-capable GPUs) cannot be passed
through to virtual machines with stock vfio-pci because the driver has
no concept of HDM decoder management, DPA region exposure, or component
register emulation.  This series wires all of that into vfio-pci-core
behind a new CONFIG_VFIO_CXL_CORE optional module, without requiring a
variant driver.

When a CXL Device DVSEC (Vendor ID 0x1E98, ID 0x0000) is detected at
device open time, the driver:

  - Probes the HDM Decoder Capability block in the component registers
    and allocates a DPA region through the CXL subsystem.  On devices
    where firmware has already committed a decoder, the kernel skips
    allocation and re-uses the committed range.

  - Builds a kernel-owned shadow of the HDM register block.  The VMM
    reads and writes this shadow through a dedicated COMP_REGS VFIO
    region rather than touching the hardware directly.  The kernel
    enforces CXL 3.1 bit-field rules: reserved bits, read-only bits,
    the COMMIT/COMMITTED latch, and the LOCK→0 reprogram path for
    firmware-committed decoders.

  - Exposes the DPA range as a second VFIO region (VFIO_REGION_SUBTYPE_CXL)
    backed by the kernel-assigned HPA.  PTEs are inserted lazily on first
    page fault and torn down atomically under memory_lock during FLR.

  - Intercepts writes to the CXL DVSEC configuration-space registers
    (Control, Status, Control2, Status2, Lock, Range Base) and replays
    them through a per-device vconfig shadow, enforcing RWL/RW1CS/RWO
    access semantics and the CONFIG_LOCK one-shot latch.

  - Returns a VFIO_DEVICE_INFO_CAP_CXL capability (id=6) carrying the
    HDM register BAR index and offset, commit flags, and the indices of
    the DPA and COMP_REGS regions.  HDM decoder count and the HDM block
    offset within COMP_REGS are derivable by the VMM from the CXL
    Capability Array in the COMP_REGS region itself, so they are not
    duplicated in the capability struct.

  - Builds a sparse-mmap capability for the component register BAR so
    VMMs can map GPU/accelerator register windows while the kernel
    protects the CXL component register block.  Three physical layouts
    are handled: component block at the BAR end, at the start, and in
    the middle.

  - Provides a module parameter (disable_cxl=1) and a per-device flag
    (vdev->disable_cxl) for suppressing the feature without recompiling.

  - Includes selftests covering device detection, capability parsing,
    region enumeration, HDM register emulation, DPA mmap with page-fault
    insertion, FLR invalidation, and DVSEC register emulation.

The series is applied on top of the cxl/next branch using the base
specified at the end of this cover letter plus Alejandro's v23 Type-2
device support patches [1].

Series structure
================

  Patches 1-5 extend the CXL subsystem with the APIs vfio-pci needs.

  Patches 6-8 add the vfio-pci-core plumbing (UAPI, device state,
  Kconfig/build).

  Patches 9-15 implement the core device lifecycle: detection, HDM
  emulation, media readiness, region management, DPA region, and DVSEC
  emulation.

  Patches 16-18 wire everything together at open/close time and
  populate the VFIO ioctl paths.

  Patches 19-20 add documentation and selftests.

Changes since v1
================

UAPI struct minimization (patch 6)

  v1 carried hdm_count, hdm_regs_size, hdm_decoder_offset, dpa_size,
  and a pad byte in vfio_device_info_cap_cxl. All four fields are
  derivable from data the VMM already has: hdm_count and the HDM block
  offset come from the CXL Capability Array in the COMP_REGS region,
  hdm_regs_size is implicit in the COMP_REGS region size, and dpa_size
  is the DPA region size.  v2 drops them and replaces pad with
  reserved[3].  The VFIO_CXL_CAP_PRECOMMITTED flag is gone; the single
  VFIO_CXL_CAP_FIRMWARE_COMMITTED flag covers both the committed and
  precommitted cases.  VFIO_CXL_CAP_CACHE_CAPABLE is added to expose
  the HDM-DB (CXL.cache) capability bit.

Component BAR access: sparse mmap instead of blanket rejection (patch 17)

  v1 returned size=0 for the component BAR and rejected all mmap and
  r/w access to it. That broke GPU passthrough scenarios where the
  device puts accelerator register windows in the same BAR as the CXL
  component registers. v2 replaces the blanket rejection with a
  sparse-mmap capability that advertises only the GPU register windows,
  carving out the component register block.  vfio_cxl_mmap_overlaps_comp_regs()
  rejects only the sub-range covering [comp_reg_offset, comp_reg_offset
  + comp_reg_size); everything else in the BAR remains mappable.

CXL register defines moved to uapi/cxl/cxl_regs.h (patch 3)

  v1 placed the component register defines in a private header
  (include/cxl/cxl_regs.h). v2 moves them to include/uapi/cxl/cxl_regs.h
  so VMMs can include them directly without duplicating definitions.

HDM API simplification (patch 1)

  v1 exported cxl_get_hdm_reg_info() which returned a raw struct with
  offset and size fields. v2 replaces it with cxl_get_hdm_info() which
  uses the cached count already populated by cxl_probe_component_regs()
  and returns a single struct with all HDM metadata, removing the need
  for callers to re-read the hardware.

cxl_await_range_active() split (patch 4)

  cxl_await_media_ready() requires a CXLMDEV mailbox register, which
  Type-2 accelerators may not have.  v2 splits out cxl_await_range_active()
  so the HDM range-active poll can be used independently of the media
  ready path.

LOCK→0 transition in HDM ctrl write emulation (patch 11)

  v1 did not handle the case where a guest tries to clear the LOCK bit
  to reprogram a firmware-committed decoder. v2 allows this transition
  and re-programs the hardware accordingly.

Component register buffer allocation (patch 11)

  v1 allocated only the HDM register sub-range in the COMP_REGS buffer.
  v2 allocates the full CXL_COMPONENT_REG_BLOCK_SIZE so future patches
  can expose other capability blocks (e.g. RAS, CXL.cache) without a
  structural change.

Register region setup split (patch 16)

  v1 tied region registration to the detection/init path.  v2 splits it
  into explicit vfio_cxl_register_cxl_region() and
  vfio_cxl_register_comp_regs_region() functions called from
  vfio_pci_open_device(), which is the correct point since vconfig and
  pci_config_map are valid there.

VLA fix merged into selftest (patch 20)

  v1 had a separate patch 20 fixing a VLA initialisation in
  vfio_pci_irq_set().  v2 folds that fix into the selftest patch to
  keep the standalone CXL change count at 19 functional patches.

Reviewer feedback addressed
===========================

Dave Jiang:
  - Replace open-coded bit shifts with FIELD_GET() / FIELD_PREP()
    throughout the HDM emulation code.
  - Rename flag from VFIO_CXL_CAP_COMMITTED / VFIO_CXL_CAP_PRECOMMITTED
    to VFIO_CXL_CAP_FIRMWARE_COMMITTED; the old names were ambiguous.
  - Use memremap(MEMREMAP_WB) for the DPA kernel mapping instead of
    ioremap_cache(), which selects the wrong memory-type descriptor on
    ARM64.
  - Use __free() / DEFINE_FREE() scope helpers for CXL resource cleanup
    in the region management path, replacing the open-coded error
    unwind.
  - Remove the unused abs_off parameter from the HDM accessor.
  - Rename cxl_dvsec_control_write() to better reflect its role.

Jonathan Cameron:
  - Move CXL register defines to uapi/cxl/cxl_regs.h so VMMs can
    consume them without a kernel header dependency.
  - Use local variables with __free() rather than struct members for
    intermediate ERR_PTR returns in the region management code; avoids
    ambiguity about ownership on error paths.
  - The assumption that a pre-committed decoder always exists at probe
    time is too restrictive for hotplug scenarios; v2 makes the
    precommitted path a fast-track that falls back to dynamic allocation
    when no committed decoder is found.

Alex Williamson:
  - The blanket size=0 / mmap-reject approach for the component BAR
    prevents VMMs from accessing GPU register windows in the same BAR.
    v2 implements the sparse-mmap capability described above.

Limitations and future work
===========================

  Switched topologies with more than one caching agent are not yet
  supported; that is planned for a follow-on series.

  RAS/ECC handling and CXL core reset integration (cxl_reset support
  from Srirangan [2]) will be added in subsequent patches.

Dependencies
============

[1] CXL Type-2 device basic support (Alejandro Lucero-Palau, v23):
    https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/

[2] CXL reset support for Type-2 devices (Srirangan Madhavan):
    https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/

Cc: Alex Williamson <alex@shazbot.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alejandro Lucero <alejandro.lucero-palau@amd.com>
Cc: linux-cxl@vger.kernel.org
Cc: kvm@vger.kernel.org

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>

base-commit: 3f7938b1aec7f06d5b23adca83e4542fcf027001
--

Manish Honap (20):
  cxl: Add cxl_get_hdm_info() for HDM decoder metadata
  cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public
    header
  cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
  cxl: Split cxl_await_range_active() from media-ready wait
  cxl: Record BIR and BAR offset in cxl_register_map
  vfio: UAPI for CXL-capable PCI device assignment
  vfio/pci: Add CXL state to vfio_pci_core_device
  vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks
  vfio/cxl: Detect CXL DVSEC and probe HDM block
  vfio/pci: Export config access helpers
  vfio/cxl: Introduce HDM decoder register emulation framework
  vfio/cxl: Wait for HDM ranges and create memdev
  vfio/cxl: CXL region management support
  vfio/cxl: DPA VFIO region with demand fault mmap and reset zap
  vfio/cxl: Virtualize CXL DVSEC config writes
  vfio/cxl: Register regions with VFIO layer
  vfio/pci: Advertise CXL cap and sparse component BAR to userspace
  vfio/cxl: Provide opt-out for CXL feature
  docs: vfio-pci: Document CXL Type-2 device passthrough
  selftests/vfio: Add CXL Type-2 VFIO assignment test

 Documentation/driver-api/index.rst            |    1 +
 Documentation/driver-api/vfio-pci-cxl.rst     |  382 +++
 drivers/cxl/core/pci.c                        |   64 +-
 drivers/cxl/core/regs.c                       |   30 +
 drivers/cxl/cxl.h                             |   46 -
 drivers/vfio/pci/Kconfig                      |    2 +
 drivers/vfio/pci/Makefile                     |    1 +
 drivers/vfio/pci/cxl/Kconfig                  |    9 +
 drivers/vfio/pci/cxl/vfio_cxl_config.c        |  306 ++
 drivers/vfio/pci/cxl/vfio_cxl_core.c          |  880 ++++++
 drivers/vfio/pci/cxl/vfio_cxl_emu.c           |  509 ++++
 drivers/vfio/pci/cxl/vfio_cxl_priv.h          |  133 +
 drivers/vfio/pci/vfio_pci.c                   |   32 +
 drivers/vfio/pci/vfio_pci_config.c            |   58 +-
 drivers/vfio/pci/vfio_pci_core.c              |   46 +-
 drivers/vfio/pci/vfio_pci_priv.h              |   66 +
 drivers/vfio/pci/vfio_pci_rdwr.c              |   16 +-
 include/cxl/cxl.h                             |   51 +
 include/linux/vfio_pci_core.h                 |   10 +
 include/uapi/cxl/cxl_regs.h                   |  160 +
 include/uapi/linux/vfio.h                     |   86 +
 tools/testing/selftests/vfio/Makefile         |    1 +
 .../selftests/vfio/lib/vfio_pci_device.c      |    3 +-
 .../selftests/vfio/vfio_cxl_type2_test.c      | 2631 +++++++++++++++++
 24 files changed, 5459 insertions(+), 64 deletions(-)
 create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
 create mode 100644 drivers/vfio/pci/cxl/Kconfig
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_config.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_emu.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
 create mode 100644 include/uapi/cxl/cxl_regs.h
 create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c

--
2.25.1