Performant CXL type 3 non-interleaved regions

[PATCH v2 0/2] Performant CXL type 3 non-interleaved regions

Posted by Alireza Sanaee via qemu development 2 days, 18 hours ago

The CXL address to device decoding logic is complex because of the need
to correctly decode fine grained interleave. The current implementation
prevents use with KVM where executed instructions may reside in that
memory and gives very slow performance even in TCG.

In many real cases non interleaved memory configurations are useful and
for those we can use a more conventional memory region alias allowing
similar performance to other memory in the system.

Whether this fast path is applicable can be established once the full
set of HDM decoders has been committed (in whatever order the guest
decides to commit them). As such a check is performed on each commit /
uncommit of HDM decoder to establish if the alias should be added or
removed.

Tested Topologies and Region Layouts
====================================

This series was validated across multiple CXL topology configurations,
covering single-device, multi-device, multi-host-bridge, and switched
fabrics. Region creation was exercised using the `cxl` userspace tool
with both non-interleaved and interleaved setups.

Decoder and memdev identifiers were discovered using:

  cxl list
  cxl list -D

Decoder IDs (e.g. decoder0.0) and memdev names (mem0, mem1) are
environment-specific. Commands below use placeholders such as
<decoder_span_both> which should be replaced with IDs from `cxl list -D`.

---------------------------------------------------------------------

Region Layout Notation
----------------------

CFMW (CXL Fixed Memory Window) is shown as a linear address space
containing regions:

  CFMW: [ R0 | R1 | R2 ]

R0, R1, R2 are regions created by `cxl create-region`.

Non-interleaved region:

  R0 (ways=1) -> entirely on one device (mem0 or mem1)
  Fast path: APPLICABLE

2-way interleaved region (g=256):

  R1 (ways=2, g=256) striped across devices:

    |mem0|mem1|mem0|mem1|mem0|mem1| ...
     256  256  256  256  256  256  bytes

  Fast path: NOT APPLICABLE

---------------------------------------------------------------------

1) One device, one host bridge, one fixed window
------------------------------------------------

QEMU:

  -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
  -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
  -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
  -object memory-backend-ram,id=mem0,size=512M,share=on
  -device cxl-type3,id=dev0,bus=rp0,memdev=mem0

Topology:

  Host
    |
    +-- CXL Host Bridge (cxl.0)
         |
         +-- Root Port (rp0)
              |
              +-- Type-3 (dev0, mem0)

Regions created:

  cxl create-region ... -w 1 ... mem0   (Fast path: YES)
  cxl create-region ... -w 1 ... mem0   (Fast path: YES)

Layout:

  CFMW: [ R0 | R1 ]

  R0 -> mem0  (Fast path: YES)
  R1 -> mem0  (Fast path: YES)

---------------------------------------------------------------------

2) One host bridge, two Type-3 devices (via two root ports)
------------------------------------------------------------

QEMU:

  -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
  -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
  -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
  -device cxl-rp,id=rp1,bus=cxl.0,port=1,chassis=0,slot=3
  -object memory-backend-ram,id=mem0,size=512M,share=on
  -object memory-backend-ram,id=mem1,size=512M,share=on
  -device cxl-type3,id=dev0,bus=rp0,memdev=mem0
  -device cxl-type3,id=dev1,bus=rp1,memdev=mem1

Topology:

  Host
    |
    +-- CXL Host Bridge (cxl.0)
         |
         +-- Root Port (rp0) -- Type-3 (dev0, mem0)
         |
         +-- Root Port (rp1) -- Type-3 (dev1, mem1)

Region patterns exercised:

2.1 All non-interleaved:
  R0 -> mem0  (Fast path: YES)
  R1 -> mem0  (Fast path: YES)
  R2 -> mem1  (Fast path: YES)
  R3 -> mem1  (Fast path: YES)

2.2 Interleaved + local:
  R0 -> mem0/mem1 interleaved  (Fast path: NO)
  R1 -> mem0                   (Fast path: YES)

2.3 Local + interleaved + local:
  R0 -> mem0                   (Fast path: YES)
  R1 -> mem0/mem1 interleaved  (Fast path: NO)
  R2 -> mem1                   (Fast path: YES)

---------------------------------------------------------------------

3) Two host bridges, one device per host bridge
------------------------------------------------

QEMU:

  -M q35,cxl=on,
     cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,
     cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G,
     cxl-fmw.2.targets.0=cxl.0,cxl-fmw.2.size=4G
  -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
  -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
  -object memory-backend-ram,id=mem0,size=512M,share=on
  -device cxl-type3,id=dev0,bus=rp0,memdev=mem0
  -device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=13
  -device cxl-rp,id=rp1,bus=cxl.1,port=0,chassis=1,slot=2
  -object memory-backend-ram,id=mem1,size=512M,share=on
  -device cxl-type3,id=dev1,bus=rp1,memdev=mem1

Region patterns identical to section 2, and fast-path applicability is
identical per region mapping (non-interleaved: YES, interleaved: NO).

---------------------------------------------------------------------

4) Switch topology
------------------

QEMU:

  -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
  -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
  -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
  -device cxl-rp,id=rp1,bus=cxl.0,port=0,chassis=0,slot=3
  -device cxl-upstream,id=us0,bus=rp0
  -device cxl-downstream,id=ds0,bus=us0,port=0,chassis=0,slot=4
  -object memory-backend-ram,id=mem0,size=512M,share=on
  -device cxl-type3,id=dev0,bus=ds0,memdev=mem0

Topology (detailed):

  Host
    |
    +-- CXL Host Bridge (cxl.0)
         |
         +-- Root Port (rp0)
         |     |
         |     +-- CXL Switch (upstream us0)
         |           |
         |           +-- Downstream Port (ds0) -- Type-3 (mem0)
         |           |
         |           +-- Downstream Port (ds1) -- Type-3 (mem1) [optional]
         +-- Root Port (rp1)
               |
               +-- More devices/switches.

Fast-path interpretation in this topology:

  If only mem0 exists:
    All regions -> Fast path: YES

  If mem0 and mem1 exist:
    Non-interleaved regions -> Fast path: YES
    Interleaved regions     -> Fast path: NO

---------------------------------------------------------------------

Summary
-------

Across all topologies, region creation, enablement, and HDM decoder
commit/uncommit flows were exercised. The fast path is enabled only when
all decoders describe a non-interleaved mapping and is removed when any
interleave configuration is introduced.

Alireza Sanaee (2):
  hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in
    window.
  hw/cxl: Add a performant (and correct) path for the non interleaved
    cases

 hw/cxl/cxl-component-utils.c |   8 ++
 hw/cxl/cxl-host.c            | 203 +++++++++++++++++++++++++++++++++--
 hw/mem/cxl_type3.c           |   4 +
 include/hw/cxl/cxl.h         |   1 +
 include/hw/cxl/cxl_device.h  |   1 +
 5 files changed, 208 insertions(+), 9 deletions(-)

-- 
2.43.0

Re: [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions

Posted by Gregory Price 1 day, 15 hours ago

On Wed, Feb 04, 2026 at 02:59:59PM +0000, Alireza Sanaee via qemu development wrote:
> The CXL address to device decoding logic is complex because of the need
> to correctly decode fine grained interleave. The current implementation
> prevents use with KVM where executed instructions may reside in that
> memory and gives very slow performance even in TCG.
> 

what was the base-commit of this set? I'm having trouble trying to get
it applied.

~Gregory

Re: [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions

Posted by Alireza Sanaee via qemu development 1 day ago

On Thu, 5 Feb 2026 13:07:03 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Wed, Feb 04, 2026 at 02:59:59PM +0000, Alireza Sanaee via qemu development wrote:
> > The CXL address to device decoding logic is complex because of the need
> > to correctly decode fine grained interleave. The current implementation
> > prevents use with KVM where executed instructions may reside in that
> > memory and gives very slow performance even in TCG.
> >   
> 
> what was the base-commit of this set? I'm having trouble trying to get
> it applied.
> 
> ~Gregory

Hi Gregory,

It is on top of bunch of patches Jonathan's sending to be picked up. Not sure which one is picked up but here is my tree here including Jonathan's patches.

https://github.com/sarsanaee/qemu-lab/commits/performant-v2-E/

Thanks,
Ali