RE: [PATCH 0/5] PCI/CXL: Save and restore CXL DVSEC and HDM state across resets

Posted by Dan Williams 2 weeks, 6 days ago

Manish Honap wrote:
[..]
> > The CXL accelerator series is currently contending with being able to
> > restore device configuration after reset. I expect vfio-cxl to build on
> > that, not push CXL flows into the PCI core.
> 
> Hello Dan,
> 
> My VFIO CXL Type-2 passthrough series [1] takes a position on this that I
> would like to explain because I expect you will have similar concerns about
> it and I'd rather have this conversation now.
> 
> Type-2 passthrough series takes the opposite structural approach as you are
> suggesting here: CXL Type-2 support is an optional extension compiled into
> vfio-pci-core (CONFIG_VFIO_CXL_CORE), not a separate driver.
> 
> Here is the reasoning:
> 
> 1. Device enumeration
> =====================
> 
> CXL Type-2 devices (GPU + accelerator class) are enumerated as struct pci_dev
> objects.  The kernel discovers them through PCI config space scan, not through
> the CXL bus. The CXL capability is advertised via the DVSEC (PCI_EXT_CAP_ID
> 0x23, Vendor ID 0x1E98), which is PCI config space. There is no CXL bus
> device to bind to.
> 
> A standalone vfio-cxl driver would therefore need to match on the PCI device
> just like vfio-pci does, and then call into vfio-pci-core for every PCI
> concern: config space emulation, BAR region handling, MSI/MSI-X, INTx, DMA
> mapping, FLR, and migration callbacks. That is the variant driver pattern
> we rejected in favour of generic CXL passthrough. We have seen this exact

Lore link for this "rejection" discussion?

> outcome with the prior iterations of this series before we moved to the
> enlightened vfio-pci model.

I still do not understand the argument. CXL functionality is a library
that PCI drivers can use. If vfio-pci functionality is also a library
then vfio-cxl is a driver that uses services from both libraries. Where
the module and driver name boundaries are drawn is more an organization
decision not an functional one.

The argument for vfio-cxl organizational independence is more about
being able to tell at a diffstat level the relative PCI vs CXL
maintenance impact / regression risk.

> 2. CXL-CORE involvement
> =======================
> 
> CXL type-2 passthrough series does not bypass CXL core. At vfio_pci_probe()
> time the CXL enlightenment layer:
> 
>   - calls cxl_get_hdm_info() to probe the HDM Decoder Capability block,
>   - calls cxl_get_committed_decoder() to locate pre-committed firmware regions,
>   - calls cxl_create_region() / cxl_request_dpa() for dynamic allocation,
>   - creates a struct cxl_memdev via the CXL core (via cxl_probe_component_regs,
>     the same path Alejandro's v23 series uses).
> 
> The CXL core is fully involved.  The difference is that the binding to
> userspace is still through vfio-pci, which already manages the pci_dev
> lifecycle, reset sequencing, and VFIO region/irq API.

Sure, every CXL driver in the system will do the same.

> 3. Standalone vfio-cxl
> ======================
> 
> To match the model you are suggesting, vfio-cxl would need to:
> 
>   (a) Register a new driver on the CXL bus (struct cxl_driver), probing
>       struct cxl_memdev or a new struct cxl_endpoint,

What, why? Just like this patch was series was proposing extending the
PCI core with additional common functionality the proposal is extend the
CXL core object drivers with the same.

>   (b) Re-implement or delegate everything vfio-pci-core provides — config
>       space, BAR regions, IRQs, DMA, FLR, and VFIO container management —
>       either by calling vfio-pci-core as a library or by duplicating it, and

What is the argument against a library?

>   (c) present to userspace through a new device model distinct from
>       vfio-pci.

CXL is a distinct operational model. What breaks if userspace is
required to explicitly account for CXL passhthrough?

> This is a significant new surface. QEMU's CXL passthrough support already
> builds on vfio-pci: it receives the PCI device via VFIO, reads the
> VFIO_DEVICE_INFO_CAP_CXL capability chain, and exposes the CXL topology.
> A vfio-cxl object model would require non-trivial QEMU changes for something
> that already works in the enlightened vfio-pci model.

What specifically about a kernel code organization choice affects the
QEMU implementation? A uAPI is kernel code organization agnostic.

The concern is designing ourselves into a PCI corner when longterm QEMU
benefits from understanding CXL objects. For example, CXL error handling
/ recovery is already well on its way to being performed in terms of CXL
port objects.

> 4. Module dependency
> ====================
> 
> Current solution: CONFIG_VFIO_CXL_CORE depends on CONFIG_CXL_BUS. We do not
> add CXL knowledge to the PCI core;

drivers/pci/cxl.c

> we add it to the VFIO layer that is already CXL_BUS-dependent.

Yes, VFIO layer needs CXL enlightenment and VFIO's requirements imply
wider benefits to other CXL capable devices.

> I would very much appreciate your thoughts on [1] considering the above. I want
> to understand your thoughts on whether vfio-pci-core can remain the single
> entry point from userspace, or whether you envision a new VFIO device type.
> 
> Jonathan has indicated he has thoughts on this as well; hopefully, we
> can converge on a direction that doesn't require duplicating vfio-pci-core.

No one is suggesting, "require duplicating vfio-pci-core", please do not
argue with strawman cariacatures like this.

> [1] https://lore.kernel.org/linux-cxl/20260311203440.752648-1-mhonap@nvidia.com/

Will take a look...

Re: [PATCH 0/5] PCI/CXL: Save and restore CXL DVSEC and HDM state across resets

Posted by Alex Williamson 2 weeks, 6 days ago

On Tue, 17 Mar 2026 10:03:28 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Manish Honap wrote:
> [..]
> > > The CXL accelerator series is currently contending with being able to
> > > restore device configuration after reset. I expect vfio-cxl to build on
> > > that, not push CXL flows into the PCI core.  
> > 
> > Hello Dan,
> > 
> > My VFIO CXL Type-2 passthrough series [1] takes a position on this that I
> > would like to explain because I expect you will have similar concerns about
> > it and I'd rather have this conversation now.
> > 
> > Type-2 passthrough series takes the opposite structural approach as you are
> > suggesting here: CXL Type-2 support is an optional extension compiled into
> > vfio-pci-core (CONFIG_VFIO_CXL_CORE), not a separate driver.
> > 
> > Here is the reasoning:
> > 
> > 1. Device enumeration
> > =====================
> > 
> > CXL Type-2 devices (GPU + accelerator class) are enumerated as struct pci_dev
> > objects.  The kernel discovers them through PCI config space scan, not through
> > the CXL bus. The CXL capability is advertised via the DVSEC (PCI_EXT_CAP_ID
> > 0x23, Vendor ID 0x1E98), which is PCI config space. There is no CXL bus
> > device to bind to.
> > 
> > A standalone vfio-cxl driver would therefore need to match on the PCI device
> > just like vfio-pci does, and then call into vfio-pci-core for every PCI
> > concern: config space emulation, BAR region handling, MSI/MSI-X, INTx, DMA
> > mapping, FLR, and migration callbacks. That is the variant driver pattern
> > we rejected in favour of generic CXL passthrough. We have seen this exact  
> 
> Lore link for this "rejection" discussion?
> 
> > outcome with the prior iterations of this series before we moved to the
> > enlightened vfio-pci model.  
> 
> I still do not understand the argument. CXL functionality is a library
> that PCI drivers can use.

This is a key aspect of the decision to "enlighten" vfio-pci to know
about CXL.  Ultimately the vfio driver for CXL devices is a PCI driver,
it binds to a PCI device.  We've developed macros for PCI devices to
identify in their ID table that they provide a vfio-pci override
driver, see for example the output of the following on your own system:

$ grep vfio_pci /lib/modules/`uname -r`/modules.alias

The catch-all entry is vfio-pci itself:

alias vfio_pci:v*d*sv*sd*bc*sc*i* vfio_pci

A vfio-pci variant driver for a specific device will include vendor and
device ID matches:

alias vfio_pci:v000015B3d0000101Esv*sd*bc*sc*i* mlx5_vfio_pci

Tools like libvirt know to make use of this when assigning a PCI
hostdev device to a VM by matching the most appropriate driver based on
these aliases.  They know they'll get a vfio-pci interface for use with
things like QEMU with a vfio-pci driver option.

If we introduce vfio-cxl, that also binds to a PCI device, how do we
end up making this automatic for userspace?

If we were to make "vfio-cxl" as a vfio-pci variant driver, we'd need
to expand the ID table for specific devices, which becomes a
maintenance issue.  Otherwise userspace would need to detect the CXL
capabilities and override the automatic driver aliases.  We can't match
drivers based on DVSEC capabilities and we don't have any protocol to
define a "2nd best" match for a device alias if probe fails.

> If vfio-pci functionality is also a library
> then vfio-cxl is a driver that uses services from both libraries. Where
> the module and driver name boundaries are drawn is more an organization
> decision not an functional one.

But as above, it is functional.  Someone needs to define when to use
which driver, which leads to libvirt needing to specify whether a
device is being exposed as PCI or CXL, and the same understanding in
each VMM.  OTOH, using vfio-pci as the basis and layering CXL feature
detection, ie. enlightenment, gives us a more compatible, incremental
approach.

> The argument for vfio-cxl organizational independence is more about
> being able to tell at a diffstat level the relative PCI vs CXL
> maintenance impact / regression risk.

But we still have that.  CXL enlightenment for vfio-pci(-core) can
still be configured out and compartmentalized into separate helper
library code.

> > 2. CXL-CORE involvement
> > =======================
> > 
> > CXL type-2 passthrough series does not bypass CXL core. At vfio_pci_probe()
> > time the CXL enlightenment layer:
> > 
> >   - calls cxl_get_hdm_info() to probe the HDM Decoder Capability block,
> >   - calls cxl_get_committed_decoder() to locate pre-committed firmware regions,
> >   - calls cxl_create_region() / cxl_request_dpa() for dynamic allocation,
> >   - creates a struct cxl_memdev via the CXL core (via cxl_probe_component_regs,
> >     the same path Alejandro's v23 series uses).
> > 
> > The CXL core is fully involved.  The difference is that the binding to
> > userspace is still through vfio-pci, which already manages the pci_dev
> > lifecycle, reset sequencing, and VFIO region/irq API.  
> 
> Sure, every CXL driver in the system will do the same.
> 
> > 3. Standalone vfio-cxl
> > ======================
> > 
> > To match the model you are suggesting, vfio-cxl would need to:
> > 
> >   (a) Register a new driver on the CXL bus (struct cxl_driver), probing
> >       struct cxl_memdev or a new struct cxl_endpoint,  
> 
> What, why? Just like this patch was series was proposing extending the
> PCI core with additional common functionality the proposal is extend the
> CXL core object drivers with the same.

I don't follow, what is the proposal?

> >   (b) Re-implement or delegate everything vfio-pci-core provides — config
> >       space, BAR regions, IRQs, DMA, FLR, and VFIO container management —
> >       either by calling vfio-pci-core as a library or by duplicating it, and  
> 
> What is the argument against a library?

vfio-pci-core is already a library, the extensions to support CXL as an
enlightenment of vfio-pci is also a library.  The issue is that a
vfio-cxl PCI driver module presents more issues than simply code
organization.

> >   (c) present to userspace through a new device model distinct from
> >       vfio-pci.  
> 
> CXL is a distinct operational model. What breaks if userspace is
> required to explicitly account for CXL passhthrough?

The entire virtualization stack needs to gain an understanding of the
intended use case of the device rather than simply push a PCI device
with CXL capabilities out to the guest.

> > This is a significant new surface. QEMU's CXL passthrough support already
> > builds on vfio-pci: it receives the PCI device via VFIO, reads the
> > VFIO_DEVICE_INFO_CAP_CXL capability chain, and exposes the CXL topology.
> > A vfio-cxl object model would require non-trivial QEMU changes for something
> > that already works in the enlightened vfio-pci model.  
> 
> What specifically about a kernel code organization choice affects the
> QEMU implementation? A uAPI is kernel code organization agnostic.
> 
> The concern is designing ourselves into a PCI corner when longterm QEMU
> benefits from understanding CXL objects. For example, CXL error handling
> / recovery is already well on its way to being performed in terms of CXL
> port objects.

Are you suggesting that rather than using the PCI device as the basis
for assignment to a userspace driver or VM that we make each port
objects assignable and somehow collect them into configuration on top of
a PCI device?  I don't think these port objects are isolated for such a
use case.  I'd like to better understand how you envision this to work.

The organization of the code in the kernel seems 90%+ the same whether
we enlighten vfio-pci to detect and expose CXL features or we create a
separate vfio-cxl PCI driver only for CXL devices, but the userspace
consequences are increased significantly.

> > 4. Module dependency
> > ====================
> > 
> > Current solution: CONFIG_VFIO_CXL_CORE depends on CONFIG_CXL_BUS. We do not
> > add CXL knowledge to the PCI core;  
> 
> drivers/pci/cxl.c

This is largely a consequence of CXL_BUS being a loadable module.

> > we add it to the VFIO layer that is already CXL_BUS-dependent.  
> 
> Yes, VFIO layer needs CXL enlightenment and VFIO's requirements imply
> wider benefits to other CXL capable devices.
> 
> > I would very much appreciate your thoughts on [1] considering the above. I want
> > to understand your thoughts on whether vfio-pci-core can remain the single
> > entry point from userspace, or whether you envision a new VFIO device type.
> > 
> > Jonathan has indicated he has thoughts on this as well; hopefully, we
> > can converge on a direction that doesn't require duplicating vfio-pci-core.  
> 
> No one is suggesting, "require duplicating vfio-pci-core", please do not
> argue with strawman cariacatures like this.

I think it comes down to whether the enlightenment maps to the existing
granularity of the core module.  Reset is probably a good example, ie.
how does the device being CXL affect the emulation of FLR, initiated
through device config space, versus the device reset ioctl.  The former
should maintain the CXL.io scope while the latter has an expanded scope
with CXL.

> > [1] https://lore.kernel.org/linux-cxl/20260311203440.752648-1-mhonap@nvidia.com/  
> 
> Will take a look...

Thanks!

Alex

Re: [PATCH 0/5] PCI/CXL: Save and restore CXL DVSEC and HDM state across resets

Posted by Dan Williams 4 days, 19 hours ago

Alex Williamson wrote:

Hey Alex, sorry for the lag in responding here...

> On Tue, 17 Mar 2026 10:03:28 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Manish Honap wrote:
> > [..]
> > > > The CXL accelerator series is currently contending with being able to
> > > > restore device configuration after reset. I expect vfio-cxl to build on
> > > > that, not push CXL flows into the PCI core.  
> > > 
> > > Hello Dan,
> > > 
> > > My VFIO CXL Type-2 passthrough series [1] takes a position on this that I
> > > would like to explain because I expect you will have similar concerns about
> > > it and I'd rather have this conversation now.
> > > 
> > > Type-2 passthrough series takes the opposite structural approach as you are
> > > suggesting here: CXL Type-2 support is an optional extension compiled into
> > > vfio-pci-core (CONFIG_VFIO_CXL_CORE), not a separate driver.
> > > 
> > > Here is the reasoning:
> > > 
> > > 1. Device enumeration
> > > =====================
> > > 
> > > CXL Type-2 devices (GPU + accelerator class) are enumerated as struct pci_dev
> > > objects.  The kernel discovers them through PCI config space scan, not through
> > > the CXL bus. The CXL capability is advertised via the DVSEC (PCI_EXT_CAP_ID
> > > 0x23, Vendor ID 0x1E98), which is PCI config space. There is no CXL bus
> > > device to bind to.
> > > 
> > > A standalone vfio-cxl driver would therefore need to match on the PCI device
> > > just like vfio-pci does, and then call into vfio-pci-core for every PCI
> > > concern: config space emulation, BAR region handling, MSI/MSI-X, INTx, DMA
> > > mapping, FLR, and migration callbacks. That is the variant driver pattern
> > > we rejected in favour of generic CXL passthrough. We have seen this exact  
> > 
> > Lore link for this "rejection" discussion?
> > 
> > > outcome with the prior iterations of this series before we moved to the
> > > enlightened vfio-pci model.  
> > 
> > I still do not understand the argument. CXL functionality is a library
> > that PCI drivers can use.
> 
[..]
> If we were to make "vfio-cxl" as a vfio-pci variant driver, we'd need
> to expand the ID table for specific devices, which becomes a
> maintenance issue.  Otherwise userspace would need to detect the CXL
> capabilities and override the automatic driver aliases.  We can't match
> drivers based on DVSEC capabilities and we don't have any protocol to
> define a "2nd best" match for a device alias if probe fails.

I can see the argument, and why it makes sense to attempt this way
first. Point conceded.

Now a follow on concern is the plan to manage a case of "PCI operation
is available, but CXL operation is not. Does the driver proceed?" Put
another way, I immediately see how to convey the policy of "continue
without CXL" when there is an explicit driver distinction, but it is
ambiguous with an enlightened vfio-pci driver.

> > If vfio-pci functionality is also a library
> > then vfio-cxl is a driver that uses services from both libraries. Where
> > the module and driver name boundaries are drawn is more an organization
> > decision not an functional one.
> 
> But as above, it is functional.  Someone needs to define when to use
> which driver, which leads to libvirt needing to specify whether a
> device is being exposed as PCI or CXL, and the same understanding in
> each VMM.  OTOH, using vfio-pci as the basis and layering CXL feature
> detection, ie. enlightenment, gives us a more compatible, incremental
> approach.

Ok, to make sure I understand the proposal: userspace still needs to to
end up with knowledge of CXL operation, but that need not be resolved by
module policy.

Userspace also just needs to be ok with the unsightliness of the CXL
modules autoloading on systems without CXL.

> > The argument for vfio-cxl organizational independence is more about
> > being able to tell at a diffstat level the relative PCI vs CXL
> > maintenance impact / regression risk.
> 
> But we still have that.  CXL enlightenment for vfio-pci(-core) can
> still be configured out and compartmentalized into separate helper
> library code.

Yes, modulo some of the proposal here to enlighten the PCI core with CXL
specifics that I want to give more scrutiny.

> > > 2. CXL-CORE involvement
> > > =======================
> > > 
> > > CXL type-2 passthrough series does not bypass CXL core. At vfio_pci_probe()
> > > time the CXL enlightenment layer:
> > > 
> > >   - calls cxl_get_hdm_info() to probe the HDM Decoder Capability block,
> > >   - calls cxl_get_committed_decoder() to locate pre-committed firmware regions,
> > >   - calls cxl_create_region() / cxl_request_dpa() for dynamic allocation,
> > >   - creates a struct cxl_memdev via the CXL core (via cxl_probe_component_regs,
> > >     the same path Alejandro's v23 series uses).
> > > 
> > > The CXL core is fully involved.  The difference is that the binding to
> > > userspace is still through vfio-pci, which already manages the pci_dev
> > > lifecycle, reset sequencing, and VFIO region/irq API.  
> > 
> > Sure, every CXL driver in the system will do the same.
> > 
> > > 3. Standalone vfio-cxl
> > > ======================
> > > 
> > > To match the model you are suggesting, vfio-cxl would need to:
> > > 
> > >   (a) Register a new driver on the CXL bus (struct cxl_driver), probing
> > >       struct cxl_memdev or a new struct cxl_endpoint,  
> > 
> > What, why? Just like this patch was series was proposing extending the
> > PCI core with additional common functionality the proposal is extend the
> > CXL core object drivers with the same.
> 
> I don't follow, what is the proposal?

Implement features like CXL Reset as operations against CXL objects like
memdevs and regions. For example, PCI reset does not consider management
of cache coherent memory, and certainly not interleaved cache coherent
memory. Other CXL drivers also benefit if these capabilities are
centralized.

> > >   (b) Re-implement or delegate everything vfio-pci-core provides — config
> > >       space, BAR regions, IRQs, DMA, FLR, and VFIO container management —
> > 
> > What is the argument against a library?
> 
> vfio-pci-core is already a library, the extensions to support CXL as an
> enlightenment of vfio-pci is also a library.  The issue is that a
> vfio-cxl PCI driver module presents more issues than simply code
> organization.

Understood. As I conceded above my concerns are complications that a
vfio-cxl module does not solve cleanly.

> > >   (c) present to userspace through a new device model distinct from
> > >       vfio-pci.  
> > 
> > CXL is a distinct operational model. What breaks if userspace is
> > required to explicitly account for CXL passhthrough?
> 
> The entire virtualization stack needs to gain an understanding of the
> intended use case of the device rather than simply push a PCI device
> with CXL capabilities out to the guest.

Agree.

> > > This is a significant new surface. QEMU's CXL passthrough support already
> > > builds on vfio-pci: it receives the PCI device via VFIO, reads the
> > > VFIO_DEVICE_INFO_CAP_CXL capability chain, and exposes the CXL topology.
> > > A vfio-cxl object model would require non-trivial QEMU changes for something
> > > that already works in the enlightened vfio-pci model.  
> > 
> > What specifically about a kernel code organization choice affects the
> > QEMU implementation? A uAPI is kernel code organization agnostic.
> > 
> > The concern is designing ourselves into a PCI corner when longterm QEMU
> > benefits from understanding CXL objects. For example, CXL error handling
> > / recovery is already well on its way to being performed in terms of CXL
> > port objects.
> 
> Are you suggesting that rather than using the PCI device as the basis
> for assignment to a userspace driver or VM that we make each port
> objects assignable and somehow collect them into configuration on top of
> a PCI device?  I don't think these port objects are isolated for such a
> use case.  I'd like to better understand how you envision this to work.

No, simply that CXL operations relative to that assigned PCI device are
serviced by the CXL core. The object to manage over reset is subject to
CPU speculative reads and potentially interleave, I think it breaks the
PCI expectations of local device scope operations.

If CXL Reset in particular stays out of the PCI core it at least
requires something CXL enlightened to be loaded, and at a minimum I do
not think that "something CXL enlightened" should be the PCI core.

There is a reason the CXL specification decided to block secondary bus
reset by default.

> The organization of the code in the kernel seems 90%+ the same whether
> we enlighten vfio-pci to detect and expose CXL features or we create a
> separate vfio-cxl PCI driver only for CXL devices, but the userspace
> consequences are increased significantly.

Agree.

> > > 4. Module dependency
> > > ====================
> > > 
> > > Current solution: CONFIG_VFIO_CXL_CORE depends on CONFIG_CXL_BUS. We do not
> > > add CXL knowledge to the PCI core;  
> > 
> > drivers/pci/cxl.c
> 
> This is largely a consequence of CXL_BUS being a loadable module.

Yes, the question is why does that matter for CXL enlightened operation?
Simply do not burden the PCI core to learn all the CXL concerns.

Re: [PATCH 0/5] PCI/CXL: Save and restore CXL DVSEC and HDM state across resets

Posted by Alex Williamson 3 days, 23 hours ago

Hey Dan,

On Wed, 1 Apr 2026 18:12:19 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Alex Williamson wrote:
> 
> Hey Alex, sorry for the lag in responding here...
> 
> > On Tue, 17 Mar 2026 10:03:28 -0700
> > Dan Williams <dan.j.williams@intel.com> wrote:
> >   
> > > Manish Honap wrote:
> > > [..]  
> > > > > The CXL accelerator series is currently contending with being able to
> > > > > restore device configuration after reset. I expect vfio-cxl to build on
> > > > > that, not push CXL flows into the PCI core.    
> > > > 
> > > > Hello Dan,
> > > > 
> > > > My VFIO CXL Type-2 passthrough series [1] takes a position on this that I
> > > > would like to explain because I expect you will have similar concerns about
> > > > it and I'd rather have this conversation now.
> > > > 
> > > > Type-2 passthrough series takes the opposite structural approach as you are
> > > > suggesting here: CXL Type-2 support is an optional extension compiled into
> > > > vfio-pci-core (CONFIG_VFIO_CXL_CORE), not a separate driver.
> > > > 
> > > > Here is the reasoning:
> > > > 
> > > > 1. Device enumeration
> > > > =====================
> > > > 
> > > > CXL Type-2 devices (GPU + accelerator class) are enumerated as struct pci_dev
> > > > objects.  The kernel discovers them through PCI config space scan, not through
> > > > the CXL bus. The CXL capability is advertised via the DVSEC (PCI_EXT_CAP_ID
> > > > 0x23, Vendor ID 0x1E98), which is PCI config space. There is no CXL bus
> > > > device to bind to.
> > > > 
> > > > A standalone vfio-cxl driver would therefore need to match on the PCI device
> > > > just like vfio-pci does, and then call into vfio-pci-core for every PCI
> > > > concern: config space emulation, BAR region handling, MSI/MSI-X, INTx, DMA
> > > > mapping, FLR, and migration callbacks. That is the variant driver pattern
> > > > we rejected in favour of generic CXL passthrough. We have seen this exact    
> > > 
> > > Lore link for this "rejection" discussion?
> > >   
> > > > outcome with the prior iterations of this series before we moved to the
> > > > enlightened vfio-pci model.    
> > > 
> > > I still do not understand the argument. CXL functionality is a library
> > > that PCI drivers can use.  
> >   
> [..]
> > If we were to make "vfio-cxl" as a vfio-pci variant driver, we'd need
> > to expand the ID table for specific devices, which becomes a
> > maintenance issue.  Otherwise userspace would need to detect the CXL
> > capabilities and override the automatic driver aliases.  We can't match
> > drivers based on DVSEC capabilities and we don't have any protocol to
> > define a "2nd best" match for a device alias if probe fails.  
> 
> I can see the argument, and why it makes sense to attempt this way
> first. Point conceded.
> 
> Now a follow on concern is the plan to manage a case of "PCI operation
> is available, but CXL operation is not. Does the driver proceed?" Put
> another way, I immediately see how to convey the policy of "continue
> without CXL" when there is an explicit driver distinction, but it is
> ambiguous with an enlightened vfio-pci driver.

As an enlightenment to vfio-pci, CXL support must in all cases degrade
to PCI support.  Manish's series proposes a new flag bit in the
DEVICE_INFO ioctl for CXL (type2 specifically) that would be used in
combination with the existing PCI flag.  If both are set, it's a PCI
device with CXL.{mem,cache} capability, otherwise only PCI would be set.
 
> > > If vfio-pci functionality is also a library
> > > then vfio-cxl is a driver that uses services from both libraries. Where
> > > the module and driver name boundaries are drawn is more an organization
> > > decision not an functional one.  
> > 
> > But as above, it is functional.  Someone needs to define when to use
> > which driver, which leads to libvirt needing to specify whether a
> > device is being exposed as PCI or CXL, and the same understanding in
> > each VMM.  OTOH, using vfio-pci as the basis and layering CXL feature
> > detection, ie. enlightenment, gives us a more compatible, incremental
> > approach.  
> 
> Ok, to make sure I understand the proposal: userspace still needs to to
> end up with knowledge of CXL operation, but that need not be resolved by
> module policy.

It's a single module as far as userspace is concerned, and the decision
lies with userspace whether to take advantage of the CXL features
indicated by the device flag.
 
> Userspace also just needs to be ok with the unsightliness of the CXL
> modules autoloading on systems without CXL.

I'm open to suggestions here.  The current proposal will pull in CXL
modules regardless of having a CXL device.

We could build vfio_cxl_core as a module with an automatic
MODULE_SOFTDEP in vfio_pci_core.  We could then do a symbol_get around
CXL code so that we never CXL enlighten a device if the module isn't
loaded, allowing userspace policy control via modprobe.d blacklists.
We could also use a registration mechanism from vfio-cxl-core to
vfio-pci-core to avoid symbol_gets.

> > > The argument for vfio-cxl organizational independence is more about
> > > being able to tell at a diffstat level the relative PCI vs CXL
> > > maintenance impact / regression risk.  
> > 
> > But we still have that.  CXL enlightenment for vfio-pci(-core) can
> > still be configured out and compartmentalized into separate helper
> > library code.  
> 
> Yes, modulo some of the proposal here to enlighten the PCI core with CXL
> specifics that I want to give more scrutiny.
> 
> > > > 2. CXL-CORE involvement
> > > > =======================
> > > > 
> > > > CXL type-2 passthrough series does not bypass CXL core. At vfio_pci_probe()
> > > > time the CXL enlightenment layer:
> > > > 
> > > >   - calls cxl_get_hdm_info() to probe the HDM Decoder Capability block,
> > > >   - calls cxl_get_committed_decoder() to locate pre-committed firmware regions,
> > > >   - calls cxl_create_region() / cxl_request_dpa() for dynamic allocation,
> > > >   - creates a struct cxl_memdev via the CXL core (via cxl_probe_component_regs,
> > > >     the same path Alejandro's v23 series uses).
> > > > 
> > > > The CXL core is fully involved.  The difference is that the binding to
> > > > userspace is still through vfio-pci, which already manages the pci_dev
> > > > lifecycle, reset sequencing, and VFIO region/irq API.    
> > > 
> > > Sure, every CXL driver in the system will do the same.
> > >   
> > > > 3. Standalone vfio-cxl
> > > > ======================
> > > > 
> > > > To match the model you are suggesting, vfio-cxl would need to:
> > > > 
> > > >   (a) Register a new driver on the CXL bus (struct cxl_driver), probing
> > > >       struct cxl_memdev or a new struct cxl_endpoint,    
> > > 
> > > What, why? Just like this patch was series was proposing extending the
> > > PCI core with additional common functionality the proposal is extend the
> > > CXL core object drivers with the same.  
> > 
> > I don't follow, what is the proposal?  
> 
> Implement features like CXL Reset as operations against CXL objects like
> memdevs and regions. For example, PCI reset does not consider management
> of cache coherent memory, and certainly not interleaved cache coherent
> memory. Other CXL drivers also benefit if these capabilities are
> centralized.

I think "CXL Reset as operations against CXL objects" is large already
proposed as [1].  However, it's specifically for type2 devices, so we
can ignore some of the complications, such as interleaved cache
coherence, of a type3 use case.  

[1]https://lore.kernel.org/all/20260306092322.148765-1-smadhavan@nvidia.com/

> > > >   (b) Re-implement or delegate everything vfio-pci-core provides — config
> > > >       space, BAR regions, IRQs, DMA, FLR, and VFIO container management —  
> > > 
> > > What is the argument against a library?  
> > 
> > vfio-pci-core is already a library, the extensions to support CXL as an
> > enlightenment of vfio-pci is also a library.  The issue is that a
> > vfio-cxl PCI driver module presents more issues than simply code
> > organization.  
> 
> Understood. As I conceded above my concerns are complications that a
> vfio-cxl module does not solve cleanly.
> 
> > > >   (c) present to userspace through a new device model distinct from
> > > >       vfio-pci.    
> > > 
> > > CXL is a distinct operational model. What breaks if userspace is
> > > required to explicitly account for CXL passhthrough?  
> > 
> > The entire virtualization stack needs to gain an understanding of the
> > intended use case of the device rather than simply push a PCI device
> > with CXL capabilities out to the guest.  
> 
> Agree.
> 
> > > > This is a significant new surface. QEMU's CXL passthrough support already
> > > > builds on vfio-pci: it receives the PCI device via VFIO, reads the
> > > > VFIO_DEVICE_INFO_CAP_CXL capability chain, and exposes the CXL topology.
> > > > A vfio-cxl object model would require non-trivial QEMU changes for something
> > > > that already works in the enlightened vfio-pci model.    
> > > 
> > > What specifically about a kernel code organization choice affects the
> > > QEMU implementation? A uAPI is kernel code organization agnostic.
> > > 
> > > The concern is designing ourselves into a PCI corner when longterm QEMU
> > > benefits from understanding CXL objects. For example, CXL error handling
> > > / recovery is already well on its way to being performed in terms of CXL
> > > port objects.  
> > 
> > Are you suggesting that rather than using the PCI device as the basis
> > for assignment to a userspace driver or VM that we make each port
> > objects assignable and somehow collect them into configuration on top of
> > a PCI device?  I don't think these port objects are isolated for such a
> > use case.  I'd like to better understand how you envision this to work.  
> 
> No, simply that CXL operations relative to that assigned PCI device are
> serviced by the CXL core. The object to manage over reset is subject to
> CPU speculative reads and potentially interleave, I think it breaks the
> PCI expectations of local device scope operations.
> 
> If CXL Reset in particular stays out of the PCI core it at least
> requires something CXL enlightened to be loaded, and at a minimum I do
> not think that "something CXL enlightened" should be the PCI core.
> 
> There is a reason the CXL specification decided to block secondary bus
> reset by default.
> 
> > The organization of the code in the kernel seems 90%+ the same whether
> > we enlighten vfio-pci to detect and expose CXL features or we create a
> > separate vfio-cxl PCI driver only for CXL devices, but the userspace
> > consequences are increased significantly.  
> 
> Agree.
> 
> > > > 4. Module dependency
> > > > ====================
> > > > 
> > > > Current solution: CONFIG_VFIO_CXL_CORE depends on CONFIG_CXL_BUS. We do not
> > > > add CXL knowledge to the PCI core;    
> > > 
> > > drivers/pci/cxl.c  
> > 
> > This is largely a consequence of CXL_BUS being a loadable module.  
> 
> Yes, the question is why does that matter for CXL enlightened operation?
> Simply do not burden the PCI core to learn all the CXL concerns.

How do we then proceed relative to save/restore of CXL state based on a
PCI reset?  Should CXL core register a save/restore handler with PCI
core or does PCI core reach out for a symbol from CXL core to support
save/restore?

If CXL core is not loaded, are we ok with silently losing CXL state
across a PCI reset, ie. assume that state is unused currently and accept
the risk of losing preconfigured decoders?

Does PCI core need to be involved in suppressing SBR?

Thanks,
Alex

Re: [PATCH 0/5] PCI/CXL: Save and restore CXL DVSEC and HDM state across resets

Posted by Dan Williams 3 days, 22 hours ago

Alex Williamson wrote:
[..]
> > Now a follow on concern is the plan to manage a case of "PCI operation
> > is available, but CXL operation is not. Does the driver proceed?" Put
> > another way, I immediately see how to convey the policy of "continue
> > without CXL" when there is an explicit driver distinction, but it is
> > ambiguous with an enlightened vfio-pci driver.
> 
> As an enlightenment to vfio-pci, CXL support must in all cases degrade
> to PCI support.  Manish's series proposes a new flag bit in the
> DEVICE_INFO ioctl for CXL (type2 specifically) that would be used in
> combination with the existing PCI flag.  If both are set, it's a PCI
> device with CXL.{mem,cache} capability, otherwise only PCI would be set.

Ok.

>  
> > > > If vfio-pci functionality is also a library
> > > > then vfio-cxl is a driver that uses services from both libraries. Where
> > > > the module and driver name boundaries are drawn is more an organization
> > > > decision not an functional one.  
> > > 
> > > But as above, it is functional.  Someone needs to define when to use
> > > which driver, which leads to libvirt needing to specify whether a
> > > device is being exposed as PCI or CXL, and the same understanding in
> > > each VMM.  OTOH, using vfio-pci as the basis and layering CXL feature
> > > detection, ie. enlightenment, gives us a more compatible, incremental
> > > approach.  
> > 
> > Ok, to make sure I understand the proposal: userspace still needs to to
> > end up with knowledge of CXL operation, but that need not be resolved by
> > module policy.
> 
> It's a single module as far as userspace is concerned, and the decision
> lies with userspace whether to take advantage of the CXL features
> indicated by the device flag.
>  
> > Userspace also just needs to be ok with the unsightliness of the CXL
> > modules autoloading on systems without CXL.
> 
> I'm open to suggestions here.  The current proposal will pull in CXL
> modules regardless of having a CXL device.
> 
> We could build vfio_cxl_core as a module with an automatic
> MODULE_SOFTDEP in vfio_pci_core.  We could then do a symbol_get around
> CXL code so that we never CXL enlighten a device if the module isn't
> loaded, allowing userspace policy control via modprobe.d blacklists.
> We could also use a registration mechanism from vfio-cxl-core to
> vfio-pci-core to avoid symbol_gets.

Probably just wait until someone really needs that small bit of memory
back.

> > Implement features like CXL Reset as operations against CXL objects like
> > memdevs and regions. For example, PCI reset does not consider management
> > of cache coherent memory, and certainly not interleaved cache coherent
> > memory. Other CXL drivers also benefit if these capabilities are
> > centralized.
> 
> I think "CXL Reset as operations against CXL objects" is large already
> proposed as [1].  However, it's specifically for type2 devices, so we
> can ignore some of the complications, such as interleaved cache
> coherence, of a type3 use case.  

Nothing type3 specific about interleaved cache coherence. Now,
interleaving for accelerators is not in near term scope, but CXL.mem is
coherent. I do not want to paint the design into a corner in case
host-bridge interleaving for bandwidth becomes a consideration.

> [1]https://lore.kernel.org/all/20260306092322.148765-1-smadhavan@nvidia.com/

Still playing catch up on review, but yes, that version looks
directionally ok and at least has a chance to be extended for
interleaving.

[..]
> > > This is largely a consequence of CXL_BUS being a loadable module.  
> > 
> > Yes, the question is why does that matter for CXL enlightened operation?
> > Simply do not burden the PCI core to learn all the CXL concerns.
> 
> How do we then proceed relative to save/restore of CXL state based on a
> PCI reset?  Should CXL core register a save/restore handler with PCI
> core or does PCI core reach out for a symbol from CXL core to support
> save/restore?

I am currently thinking CXL registered handler to enable enlightened
reset, or direct CXL uAPI.

Is it not safe to assume that new CXL awareness in user tooling is
prepared move to new CXL aware reset interfaces?

> If CXL core is not loaded, are we ok with silently losing CXL state
> across a PCI reset, ie. assume that state is unused currently and accept
> the risk of losing preconfigured decoders?

> Does PCI core need to be involved in suppressing SBR?

SBR is disabled by default. It can currently be destructively forced
with the PCI "cxl_bus" reset type.

Maybe what this wants is a new "SBR iff single device CXL.mem and mem
not kernel mapped", to set aside the "but coherent interleave" noise.
Just not sure it is worth introducing the concept of "device rejectable
PCI reset" vs requiring using CXL uAPI directly.