cxl: Multi-headed device design

[RFC] cxl: Multi-headed device design

Posted by Gregory Price 1 year, 1 month ago

Originally I was planning to kick this off with a patch set, but i've
decided my current prototype does not fit the extensibility requirements
to go from SLD to MH-SLD to MH-MLD.


So instead I'd like to kick off by just discussing the data structures
and laugh/cry a bit about some of the frustrating ambiguities for MH-SLDs
when it comes to the specification.

I apologize for the sheer length of this email, but it really is just
that complex.


=============================================================
 What does the specification say about Multi-headed Devices? 
=============================================================

Defining each relevant component according to the specification:

>
> VCS - Virtual CXL Switch
> * Includes entities within the physical switch belonging to a
>   single VH. It is identified using the VCS ID.
> 
> 
> VH - Virtual Hierarchy.
> * Everything from the CXL RP down.
> 
> 
> LD - Logical Device
> * Entity that represents a CXL Endpoint that is bound to a VCS.
>   An SLD device contains one LD.  An MLD contains multiple LDs.
> 
> 
> SLD - Single Logical Device
> * That's it, that's the definition.
> 
> 
> MLD - Multi Logical Device
> * Multi-Logical Device. CXL component that contains multiple LDs,
>   out of which one LD is reserved for configuration via the FM API,
>   and each remaining LD is suitable for assignment to a different
>   host. Currently MLDs are architected only for Type 3 LDs.
> 
> 
> MH-SLD - Mutli-Headed SLD
> * CXL component that contains multiple CXL ports, each presenting an
>   SLD. The ports must correctly operate when connected to any
>   combination of common or different hosts.
> 
> 
> MH-MLD - Multi-Headed MLD
> * CXL component that contains multiple CXL ports, each presenting an MLD
>   or SLD. The ports must correctly operate when connected to any
>   combination of common or different hosts. The FM-API is used to
>   configure each LD as well as the overall MH-MLD.
> 
>   MH-MLDs are considered a specialized type of MLD and, as such, are
>   subject to all functional and behavioral requirements of MLDs.
> 

Ambiguity #1:

* An SLD contains 1 Logical Device.
* An MH-SLD presents multiple SLDs, one per head.

Ergo an MH-SLD contains multiple LDs which makes it an MLD according to the
definition of LD, but not according to the definition of MLD, or MH-MLD.

Now is the winter of my discontent.

The Specification says this about MH-SLD's in other sections

> 2.4.3 Pooled and Shared FAM
> 
> LD-FAM includes several device variants.
> 
> A multi-headed Single Logical Device (MH-SLD) exposes multiple LDs, each with
> a dedicated link.
> 
>
> 2.5 Multi-Headed Device
> 
> There are two types of Multi-Headed Devices that are distinguied by how
> they present themselves on each head:
> *  MH-SLD, which present SLDs on all head
> *  MH-MLD, which may present MLDs on any of their heads
>
>
> Management of heads in Multi-Headed Devices follows the model defined for
> the device presented by that head:
> *  Heads that present SLDs may support the port management and control
>     features that are available for SLDs
> *  Heads that present MLDs may support the port management and control
>    features that are available for MLDs
>

I want to make very close note of this.  SLD's are managed like SLDs
SLDs, MLDs are managed like MLDs.  MH-SLDs, according to this, should be
managed like SLDs from the perspective of each host.

That's pretty straight forward.

>
> Management of memory resources in Multi-Headed Devices follows the model
> defined for MLD components because both MH-SLDs and MH-MLDs must support
> the isolation of memory resources, state, context, and management on a
> per-LD basis.  LDs within the device are mapped to a single head.
> 
> *  In MH-SLDs, there is a 1:1 mapping between heads and LDs.
> *  In MH-MLDs, multiple LDs are mapped to at most one head.
> 
> 
> Multi-Headed Devices expose a dedicated Component Command Interface (CCI),
> the LD Pool CCI, for management of all LDs within the device. The LD Pool
> CCI may be exposed as an MCTP-based CCI or can be accessed via the Tunnel
> Management Command command through a head’s Mailbox CCI, as detailed in
> Section 7.6.7.3.1.

2.5.1 continues on to describe "LD Management in MH-MLDs" but just ignores
that MH-SLDs (may) exist.  That's frustrating to say the least, but I
suppose we can gather from context that MH-SLD's *MAY NOT* have LD
management controls.

Lets see if that assumption holds.

> 7.6.7.3 MLD Port Command Set
>
> 7.6.7.3.1 Tunnel Management Command (Opcode 5300h)

The referenced section at the end of 2.5 seems to also suggest that
MH-SLDs do not (or don't have to?) implement the tunnel management
command set.  It sends us to the MLD command set, and SLDs don't get
managed like MLDs - ergo it's not relevant?

The final mention of MH-SLDs is mentioned in section 9.13.3

> 9.13.3 Dynamic Capacity Device
> ...
>  MH-SLD or MH-MLD based DCD shall forcefully release shared Dynamic
>  Capacity associated with all associated hosts upon a Conventional Reset
>  of a head.
>

From this we can gather that the specification foresaw someone making a
memory pool from an MH-SLD... but without LD management. We can fill in
some blanks and assume that if someone wanted to, they could make a
shared memory device and implement pooling via software controls.

That'd be a neat bodge/hack.  But that's not important right now.


Finally, we look at what the mailbox command-set requirements are for
multi-headed devices:

> 7.6.7.5 Multi-Headed Device Command Set
> The Multi-Headed device command set includes commands for querying the
> Head-to-LD mapping in a Multi-Headed device. Support for this command
> set is required on the LD Pool CCI of a Multi-Headed device.
>

Ambiguity #2: Ok, now we're not sure whether an MH-SLD is supposed to
expose an LD Pool CCI or not.  Also, is a MH-SLD supposed to show up
as something other than just an SLD?  This is really confusing.

Going back to the MLD Port Command set, we see

> Valid targets for the tunneled commands include switch MLD Ports,
> valid LDs within an MLD, and the LD Pool CCI in a Multi-Headed device.

Whatever the case, there's only a single command in the MHD command set:

> 7.6.7.5.1 Get Multi-Headed Info (Opcode 5500h)

This command is pretty straight forward, it just tells you what the head
to LD mapping is for each of the LDs in the device.  Presumably this is
what gets modified by the FM-APIs when LDs are attached to VCS ports.

For the simplest MH-SLD device, these fields would be immutable, and
there would be a single LD for each head, where head_id == ld_id.



So summarizing, what I took away from this was the following:

In the simplest form of MH-SLD, there's is neither a switch, nor is
thereo LD management.  So, presumably, we don't HAVE to implement the
MHD commands to say we "have MH-SLD support".


========
 Design
========

Ok... that's a lot to break down.  Here's what I think the roadmap
toward multi-headed multi-logical device support should look like:

1. SLD - we have this.  This is struct CXLType3Dev

2. MH-SLD No Switch, No Pool CCI.

3. MH-SLD w/ Pool CCI  (Implementing only Get Multi-Headed Info)

4. MH-SLD w/ Switch (Implementing remap-ability of LD to Head)

5. MH-MLD - the whole kit and kaboodle.


Lets talk about what the first MH-SLD might look like.


=================================
2. MH-SLD No Switch, No Pool CCI.
=================================

1. The device has a "memory pool" that "backs" each Logical Device, and
   the specification does not limit whether this memory is discrete
   or may be shared between heads.

   In QEMU, we can represent this with a shared or file memory backend:

-object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=4G,share=true


2. Each QEMU instance has a discrete SLD that amounts to its own private
   CXLType3Dev.  However, each "Head" maps back to the same common
   memory backend:

-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0


And that's it.  In fact, you can do this now, no changes needed!


But it's also not very useful.  You can only use the memory in devdax
mode, since it's a shared memory region. You could already do this via
the /dev/shm interface, so it's not even new functionality.

In theory you could build a pooling service in software-only on top of
memory blocks. That's an exercise left to the reader.


================================================================
3. MH-SLD w/ Pool CCI  (Implementing only Get Multi-Headed Info)
================================================================

This is a little more complicated, we have our first bit of shared
state.  Originally I had considered a shared memory region in
CXLType3Dev, but this is a backwards abstraction (A MH-SLD contains
mutliple SLDs, an SLD does not contain an MHD State).

diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
index 7b72345079..1a9f2708e1 100644
--- a/include/hw/cxl/cxl_device.h
+++ b/include/hw/cxl/cxl_device.h
@@ -356,16 +356,6 @@ typedef struct CXLPoison {
 typedef QLIST_HEAD(, CXLPoison) CXLPoisonList;
 #define CXL_POISON_LIST_LIMIT 256

+struct CXLMHDState {
+    uint8_t nr_heads;
+    uint8_t nr_lds;
+    uint8_t ldmap[];
+};
+
 struct CXLType3Dev {
     /* Private */
     PCIDevice parent_obj;
@@ -377,15 +367,6 @@ struct CXLType3Dev {
     HostMemoryBackend *lsa;
     uint64_t sn;

+
+    /* Multi-headed device settings */
+    struct {
+        bool active;
+        uint32_t headid;
+        uint32_t shmid;
+        struct CXLMHDState *state;
+    } mhd;
+


The way you would instantiate this would be a via a separate process
that initializes the shared memory region:

shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`
./cxl_mhd_init 4 $shmid1
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd=true,mhd_head=0,mhd_shmid=$1

./cxl_mhd_init would simply setup the nr_heads/lds field and such
and set ldmap[0-3] to the values [0-3].  i.e. the head-to-ld mappings
are static (head_id==ld_id).



But like I said, this is a backwards abstraction, so realistically we
should flip this around such that we have the following:

struct CXLMHD_SharedState {
	uint8_t nr_heads;
	uint8_t nr_lds;
	uint8_t ldmap[];
};

struct CXLMH_SLD {
	uint32_t headid;
	uint32_t shmid;
	struct CXLMHD_SharedState *state;
	struct CXLType3Dev sld;
};

The shared state would be instantiated the same way as above.

With this we'd basically just create a new memory device:

hw/mem/cxl_mh_sld.c


This is pretty straightforward - we just expose some of cxl_type3.c
functions in order to instantiate the device accordingly, the rest of it
just becomes passthrough because... it's just a cxl_type3.c device.


This ultimately manifests as:

shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`

./cxl_mhd_init 4 $shmid1

-device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid


Note: This is the patch set i'm working towards, but I presume there
might be some (strong) opinions, so i didn't want to get too far into
development before posting this.


==============================================================
4. MH-SLD w/ Switch (Implementing LD management in an SLD)
==============================================================

Is it even rational to try to build such a device?

MH-SLDs have a 1-to-1 mapping of Head:Logical Device.

Presumably each SLD maps the entirety of the "pooled" memory,
but the specification does not state that is true.  You could, for
example, setup each Logical Device to map to a particular portion of the
shared/pooled memory area:

-object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=4G,share=true

QEMU #1
-device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid,dpa_base=0,dpa_limit=1G

QEMU #2
-device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid,dpa_base=1G,dpa_limit=1G

... and so on.

At least in theory, this would involve implementing something that
changes which SLD is mapped to a QEMU instance - but functionally this
is just changing the base and limit of each SLD.

It's interesting from a functional testing perspective, there's a bunch
of CCI/Tunnel commands that could be implemented, and presumably this
would require a separate process to manage/serialize appropriately.

=======================================
5. MH-MLD - the whole kit and kaboodle.
=======================================

If we implemented MH-SLD w/ Switching, then presumably it's just on step
further to create an MLD:

struct CXLMH_MLD {
        uint32_t headid;
        uint32_t shmid;
        struct CXLMHD_SharedState *state;
        struct CXLType3Dev ldmap[];
};

But i'm greatly oversimplifying here.  It's much more expressive to
describe an MLD in terms of a multi-tired switch in the QEMU topology,
similar to what can be done right now:

-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12 \
-device cxl-rp,id=rp0,port=0,bus=cxl.0,chassis=0,slot=0 \
-device cxl-rp,id=rp1,port=1,bus=cxl.0,chassis=0,slot=1 \
-device cxl-upstream,bus=rp0,id=us0 \
-device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
-device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
-device cxl-type3,bus=swport0,volatile-memdev=mem0,id=cxl-mem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k


But in order to make this multi-headed, some amount of this state would need
to be encapsulated in a shared memory region (or would it? I don't know, i
haven't finished this thought experiment yet).


=====
 FIN 
=====

I realize this was a long.  If you made it to the end of this email,
thank you reading my TED talk.  I greatly appreciate any comments,
even if it's just "You've gone too deep, Gregory." ;]

Regards,
~Gregory

Re: [RFC] cxl: Multi-headed device design

Posted by Jonathan Cameron via 11 months, 2 weeks ago

On Tue, 21 Mar 2023 21:50:33 -0400
Gregory Price <gregory.price@memverge.com> wrote:

Hi Gregory,

Sorry I took so long to reply to this. Busy month...

Vince presented at LSF-MM so I feel it's fair game to CC him kernel
patches and he may be able to point you in right direction for a few
things in this mail.


> Originally I was planning to kick this off with a patch set, but i've
> decided my current prototype does not fit the extensibility requirements
> to go from SLD to MH-SLD to MH-MLD.
> 
> 
> So instead I'd like to kick off by just discussing the data structures
> and laugh/cry a bit about some of the frustrating ambiguities for MH-SLDs
> when it comes to the specification.
> 
> I apologize for the sheer length of this email, but it really is just
> that complex.

hehe.  I read this far when you first sent it and decided to put
it on the todo list rather than reading the rest ;)

> 
> 
> =============================================================
>  What does the specification say about Multi-headed Devices? 
> =============================================================
> 
> Defining each relevant component according to the specification:
> 
> >
> > VCS - Virtual CXL Switch
> > * Includes entities within the physical switch belonging to a
> >   single VH. It is identified using the VCS ID.
> > 
> > 
> > VH - Virtual Hierarchy.
> > * Everything from the CXL RP down.
> > 
> > 
> > LD - Logical Device
> > * Entity that represents a CXL Endpoint that is bound to a VCS.
> >   An SLD device contains one LD.  An MLD contains multiple LDs.
> > 
> > 
> > SLD - Single Logical Device
> > * That's it, that's the definition.
> > 
> > 
> > MLD - Multi Logical Device
> > * Multi-Logical Device. CXL component that contains multiple LDs,
> >   out of which one LD is reserved for configuration via the FM API,
> >   and each remaining LD is suitable for assignment to a different
> >   host. Currently MLDs are architected only for Type 3 LDs.
> > 
> > 
> > MH-SLD - Mutli-Headed SLD
> > * CXL component that contains multiple CXL ports, each presenting an
> >   SLD. The ports must correctly operate when connected to any
> >   combination of common or different hosts.
> > 
> > 
> > MH-MLD - Multi-Headed MLD
> > * CXL component that contains multiple CXL ports, each presenting an MLD
> >   or SLD. The ports must correctly operate when connected to any
> >   combination of common or different hosts. The FM-API is used to
> >   configure each LD as well as the overall MH-MLD.
> > 
> >   MH-MLDs are considered a specialized type of MLD and, as such, are
> >   subject to all functional and behavioral requirements of MLDs.
> >   
> 
> Ambiguity #1:
> 
> * An SLD contains 1 Logical Device.
> * An MH-SLD presents multiple SLDs, one per head.
> 
> Ergo an MH-SLD contains multiple LDs which makes it an MLD according to the
> definition of LD, but not according to the definition of MLD, or MH-MLD.

I'd go with 'sort of'.  SLD is a presentation of a device to a host.
It can be a normal single headed MLD that has been plugged directly into a host.

So for extra fun points you can have one MH-MLD that has some ports connected
to switches and other directly to hosts. Thus it can present as SLD on some
upstream ports and as MLD on others.

> 
> Now is the winter of my discontent.
> 
> The Specification says this about MH-SLD's in other sections
> 
> > 2.4.3 Pooled and Shared FAM
> > 
> > LD-FAM includes several device variants.
> > 
> > A multi-headed Single Logical Device (MH-SLD) exposes multiple LDs, each with
> > a dedicated link.
> > 
> >
> > 2.5 Multi-Headed Device
> > 
> > There are two types of Multi-Headed Devices that are distinguied by how
> > they present themselves on each head:
> > *  MH-SLD, which present SLDs on all head
> > *  MH-MLD, which may present MLDs on any of their heads

Yup. MH-SLD is the cheap device - not capable of MLD support to any upstream
port - so it can skip some functionality.

> >
> >
> > Management of heads in Multi-Headed Devices follows the model defined for
> > the device presented by that head:
> > *  Heads that present SLDs may support the port management and control
> >     features that are available for SLDs
> > *  Heads that present MLDs may support the port management and control
> >    features that are available for MLDs
> >  
> 
> I want to make very close note of this.  SLD's are managed like SLDs
> SLDs, MLDs are managed like MLDs.  MH-SLDs, according to this, should be
> managed like SLDs from the perspective of each host.

True, but an MH-MLD device connected directly to a host will also 
be managed (at some level anyway) as an SLD on that particular port.

> 
> That's pretty straight forward.
> 
> >
> > Management of memory resources in Multi-Headed Devices follows the model
> > defined for MLD components because both MH-SLDs and MH-MLDs must support
> > the isolation of memory resources, state, context, and management on a
> > per-LD basis.  LDs within the device are mapped to a single head.
> > 
> > *  In MH-SLDs, there is a 1:1 mapping between heads and LDs.
> > *  In MH-MLDs, multiple LDs are mapped to at most one head.
> > 
> > 
> > Multi-Headed Devices expose a dedicated Component Command Interface (CCI),
> > the LD Pool CCI, for management of all LDs within the device. The LD Pool
> > CCI may be exposed as an MCTP-based CCI or can be accessed via the Tunnel
> > Management Command command through a head’s Mailbox CCI, as detailed in
> > Section 7.6.7.3.1.  
> 
> 2.5.1 continues on to describe "LD Management in MH-MLDs" but just ignores
> that MH-SLDs (may) exist.  That's frustrating to say the least, but I
> suppose we can gather from context that MH-SLD's *MAY NOT* have LD
> management controls.

Hmm. In theory you could have an MH-SLD that used a config from flash or similar
but that would be odd.  We need some level of dynamic control to make these
devices useful.  Doesn't mean the spec should exclude dumb devices, but
we shouldn't concentrate on them for emulation.

One possible usecase would be a device that always shares all it's memory on
all ports. Yuk.


> 
> Lets see if that assumption holds.
> 
> > 7.6.7.3 MLD Port Command Set
> >
> > 7.6.7.3.1 Tunnel Management Command (Opcode 5300h)  
> 
> The referenced section at the end of 2.5 seems to also suggest that
> MH-SLDs do not (or don't have to?) implement the tunnel management
> command set.  It sends us to the MLD command set, and SLDs don't get
> managed like MLDs - ergo it's not relevant?
> 
> The final mention of MH-SLDs is mentioned in section 9.13.3
> 
> > 9.13.3 Dynamic Capacity Device
> > ...
> >  MH-SLD or MH-MLD based DCD shall forcefully release shared Dynamic
> >  Capacity associated with all associated hosts upon a Conventional Reset
> >  of a head.
> >  
> 
> From this we can gather that the specification foresaw someone making a
> memory pool from an MH-SLD... but without LD management. We can fill in
> some blanks and assume that if someone wanted to, they could make a
> shared memory device and implement pooling via software controls.

When you say software controls?  I'm not sure I follow. 
> 
> That'd be a neat bodge/hack.  But that's not important right now.
> 
Fair enough. Moving on.

> 
> Finally, we look at what the mailbox command-set requirements are for
> multi-headed devices:
> 
> > 7.6.7.5 Multi-Headed Device Command Set
> > The Multi-Headed device command set includes commands for querying the
> > Head-to-LD mapping in a Multi-Headed device. Support for this command
> > set is required on the LD Pool CCI of a Multi-Headed device.
> >  
> 
> Ambiguity #2: Ok, now we're not sure whether an MH-SLD is supposed to
> expose an LD Pool CCI or not.  Also, is a MH-SLD supposed to show up
> as something other than just an SLD?  This is really confusing.
> 
> Going back to the MLD Port Command set, we see
> 
> > Valid targets for the tunneled commands include switch MLD Ports,
> > valid LDs within an MLD, and the LD Pool CCI in a Multi-Headed device.  
> 
> Whatever the case, there's only a single command in the MHD command set:
> 
> > 7.6.7.5.1 Get Multi-Headed Info (Opcode 5500h)  
> 
> This command is pretty straight forward, it just tells you what the head
> to LD mapping is for each of the LDs in the device.  Presumably this is
> what gets modified by the FM-APIs when LDs are attached to VCS ports.
> 
> For the simplest MH-SLD device, these fields would be immutable, and
> there would be a single LD for each head, where head_id == ld_id.

Agreed.

> 
> 
> 
> So summarizing, what I took away from this was the following:
> 
> In the simplest form of MH-SLD, there's is neither a switch, nor is
> thereo LD management.  So, presumably, we don't HAVE to implement the
> MHD commands to say we "have MH-SLD support".

Whilst theoretically possible - I don' think such a device is interesting.
Minimum I'd want to see is something with multiple upstream SLD ports
and a management LD with appropriate interface to poke it.

The MLD side of things is interesting only once we support MLDs in general
in QEMU CXL emulation and even then they are near invisible to a host
and are more interesting for emulating fabric management.

What you may want to do is take Fan's work on DCD and look at doing
a simple MH-SLD device that uses same cheat of just using QMP commands
to do the configuration.  That's an intermediate step to us getting
the FM-API and similar commands implemented.

> 
> 
> ========
>  Design
> ========
> 
> Ok... that's a lot to break down.  Here's what I think the roadmap
> toward multi-headed multi-logical device support should look like:
> 
> 1. SLD - we have this.  This is struct CXLType3Dev

We could look at Switch + MLD after this, but lots of work to
get the FM-API stuff in place that makes that interesting.
The advantage being we'd have the ability to move LDs around that I
think you are interested in.

> 
> 2. MH-SLD No Switch, No Pool CCI.

I'd fiddle that a little.  To be useful it needs the functionality
that a pool CCI provides - something to change the confirmation, but
that can be impdef - (QMP stuff like Fan Ni did for DCD).
I'm not sure we want to upstream the QMP side of things but it gives
a path to start messing around iwth this quicker.

> 
> 3. MH-SLD w/ Pool CCI  (Implementing only Get Multi-Headed Info)

I'd do this + DCD.

> 
> 4. MH-SLD w/ Switch (Implementing remap-ability of LD to Head)

Hmm. You want this for migration I guess.  I'd be tempted to jump
directly to DCD.  I'm not even sure if the spec really allows this
sort of remapping with out a switch / MHD because DCD covers that gap.

> 
> 5. MH-MLD - the whole kit and kaboodle.
> 
> 
> Lets talk about what the first MH-SLD might look like.
> 
> 
> =================================
> 2. MH-SLD No Switch, No Pool CCI.
> =================================
> 
> 1. The device has a "memory pool" that "backs" each Logical Device, and
>    the specification does not limit whether this memory is discrete
>    or may be shared between heads.
> 
>    In QEMU, we can represent this with a shared or file memory backend:
> 
> -object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=4G,share=true
> 
> 
> 2. Each QEMU instance has a discrete SLD that amounts to its own private
>    CXLType3Dev.  However, each "Head" maps back to the same common
>    memory backend:
> 
> -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0
> 
> 
> And that's it.  In fact, you can do this now, no changes needed!
> 
> 
> But it's also not very useful.  You can only use the memory in devdax
> mode, since it's a shared memory region. You could already do this via
> the /dev/shm interface, so it's not even new functionality.
> 
> In theory you could build a pooling service in software-only on top of
> memory blocks. That's an exercise left to the reader.

Yeah. Let's not do this step.

> 
> 
> ================================================================
> 3. MH-SLD w/ Pool CCI  (Implementing only Get Multi-Headed Info)
> ================================================================
> 
> This is a little more complicated, we have our first bit of shared
> state.  Originally I had considered a shared memory region in
> CXLType3Dev, but this is a backwards abstraction (A MH-SLD contains
> mutliple SLDs, an SLD does not contain an MHD State).


> 
> diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
> index 7b72345079..1a9f2708e1 100644
> --- a/include/hw/cxl/cxl_device.h
> +++ b/include/hw/cxl/cxl_device.h
> @@ -356,16 +356,6 @@ typedef struct CXLPoison {
>  typedef QLIST_HEAD(, CXLPoison) CXLPoisonList;
>  #define CXL_POISON_LIST_LIMIT 256
> 
> +struct CXLMHDState {
> +    uint8_t nr_heads;
> +    uint8_t nr_lds;
> +    uint8_t ldmap[];
> +};
> +
>  struct CXLType3Dev {
>      /* Private */
>      PCIDevice parent_obj;
> @@ -377,15 +367,6 @@ struct CXLType3Dev {
>      HostMemoryBackend *lsa;
>      uint64_t sn;
> 
> +
> +    /* Multi-headed device settings */
> +    struct {
> +        bool active;
> +        uint32_t headid;
> +        uint32_t shmid;
> +        struct CXLMHDState *state;
> +    } mhd;
> +
> 
> 
> The way you would instantiate this would be a via a separate process
> that initializes the shared memory region:
> 
> shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`
> ./cxl_mhd_init 4 $shmid1
> -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd=true,mhd_head=0,mhd_shmid=$1
> 
> ./cxl_mhd_init would simply setup the nr_heads/lds field and such
> and set ldmap[0-3] to the values [0-3].  i.e. the head-to-ld mappings
> are static (head_id==ld_id).
> 
> 
> 
> But like I said, this is a backwards abstraction, so realistically we
> should flip this around such that we have the following:
> 
> struct CXLMHD_SharedState {
> 	uint8_t nr_heads;
> 	uint8_t nr_lds;
> 	uint8_t ldmap[];
> };
> 
> struct CXLMH_SLD {
> 	uint32_t headid;
> 	uint32_t shmid;
> 	struct CXLMHD_SharedState *state;
> 	struct CXLType3Dev sld;
> };
> 
> The shared state would be instantiated the same way as above.
> 
> With this we'd basically just create a new memory device:
> 
> hw/mem/cxl_mh_sld.c
> 
> 
> This is pretty straightforward - we just expose some of cxl_type3.c
> functions in order to instantiate the device accordingly, the rest of it
> just becomes passthrough because... it's just a cxl_type3.c device.
> 
> 
> This ultimately manifests as:
> 
> shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`
> 
> ./cxl_mhd_init 4 $shmid1
> 
> -device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid
> 
> 
> Note: This is the patch set i'm working towards, but I presume there
> might be some (strong) opinions, so i didn't want to get too far into
> development before posting this.

Key here is that what is actually interesting is MH-SLD with Dynamic Capacity,
not just sharing the whole mapped memory.  That gives us the flexibility to
move memory between heads.

A few different moving parts are needed and I think we'd end up with something that
looks like

-device cxl-mhd,volatile-memdev=mem0,id=backend
-device cxl-mhd-sld,mhd=backend,bus=rp0,mhd-head=0,id=dev1,tunnel=true
-device cxl-mhd-sld,mhd=backend,bus=rp1,mhd-head=1,id=dev2

dev1 provides the tunneling interface, but the actual implementation of
the pool CCI and actual memory mappings is in the backend. Note that backend
might be proxy to an external process, or a client/server approach between multiple
QEMU instances.

The Pool CCI is accessed via tunnel from dev1 and can both query everything about
the two heads and also perform DCD capacity add / release on the LDs. That can
potentially include shared capacity and all the other bells and whistles we get
doing DCD on an MLD device.

or squish some parts and make a more extensible type3 device and have.

-device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=dev1,mhd-main=true
-device cxl-type3,mhd=dev1,bus=rp1,mhd-head=1,id=dev2

Possibly adding socket numbers as options if we are doing multi qemu support
(can do that later I think as long as we've thought about how to do the command
 line). 
> 
> 
> ==============================================================
> 4. MH-SLD w/ Switch (Implementing LD management in an SLD)
> ==============================================================
> 
> Is it even rational to try to build such a device?
> 
> MH-SLDs have a 1-to-1 mapping of Head:Logical Device.
> 
> Presumably each SLD maps the entirety of the "pooled" memory,
> but the specification does not state that is true.  You could, for
> example, setup each Logical Device to map to a particular portion of the
> shared/pooled memory area:

DCD is again key here.
You can't move LDs around on an MH-SLD, but5 you can move capacity around
between them using DCD.

> 
> -object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=4G,share=true
> 
> QEMU #1
> -device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid,dpa_base=0,dpa_limit=1G
> 
> QEMU #2
> -device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid,dpa_base=1G,dpa_limit=1G
> 
> ... and so on.
> 
> At least in theory, this would involve implementing something that
> changes which SLD is mapped to a QEMU instance - but functionally this
> is just changing the base and limit of each SLD.
> 
> It's interesting from a functional testing perspective, there's a bunch
> of CCI/Tunnel commands that could be implemented, and presumably this
> would require a separate process to manage/serialize appropriately.
> 

If this is interesting, do a normal MLD and switch first. The MHD case is
something to stack on top of that.


> =======================================
> 5. MH-MLD - the whole kit and kaboodle.
> =======================================
> 
> If we implemented MH-SLD w/ Switching, then presumably it's just on step
> further to create an MLD:
> 
> struct CXLMH_MLD {
>         uint32_t headid;
>         uint32_t shmid;
>         struct CXLMHD_SharedState *state;
>         struct CXLType3Dev ldmap[];
> };
> 
> But i'm greatly oversimplifying here.  It's much more expressive to
> describe an MLD in terms of a multi-tired switch in the QEMU topology,
> similar to what can be done right now:
> 
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12 \
> -device cxl-rp,id=rp0,port=0,bus=cxl.0,chassis=0,slot=0 \
> -device cxl-rp,id=rp1,port=1,bus=cxl.0,chassis=0,slot=1 \
> -device cxl-upstream,bus=rp0,id=us0 \
> -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
> -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
> -device cxl-type3,bus=swport0,volatile-memdev=mem0,id=cxl-mem0 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
> 
> 
> But in order to make this multi-headed, some amount of this state would need
> to be encapsulated in a shared memory region (or would it? I don't know, i
> haven't finished this thought experiment yet).

Someone (wherever the LD pool CCI is) needs to hold shared state.
Lots of options for that. 

> 
> 
> =====
>  FIN 
> =====
> 
> I realize this was a long.  If you made it to the end of this email,
> thank you reading my TED talk.  I greatly appreciate any comments,
> even if it's just "You've gone too deep, Gregory." ;]

:) You've only just got started.  This goes much deeper!

> 
> Regards,
> ~Gregory

To my mind there are a series of steps and questions here.

Which 'hotplug model'.
1) LD model for moving capacity
  - If doing LD model, do MLDs and configurable switches first. Needed as a step along the
    path anyway.  Deal with all the mess that brings and come back to MHD - as you note it
    only makes sense with a switch in the path, so MLDs are a subset of the functionality anyway.

2) DCD model for moving cacapcity
  - MH-SLD with a pool CCI used to do DCD operations on the LDs.  Extension of
    what Fan Ni is looking at.  He's making an SLD pretend to be a device
    where DCD makes sense - whilst still using the CXL type 3 device. We probably shouldn't
    do that without figuring out how to do an MHD-SLD - or at least a head that we intend
    to hang this new stuff off - potentially just using the existing type 3 device with
    more parameters as one of the MH-SLD heads that doesn't have the control interface and
    different parameters if it does have the tunnel to the Pool CCI.

Implementing MCTP CCI.  Probably a later step, but need to think what that looks like.
I'm thinking we proxy it through to wherever the pool CCI ends up.  Should be easy enough
if a little ugly.

So question is whether it's worth a highly modular design, or we just keep tacking
flexibility onto existing Type 3 device emulation.  These are all type 3 devices
after all ;)

Lots of fun details to hammer out.

Jonathan

Re: [RFC] cxl: Multi-headed device design

Posted by Gregory Price 11 months, 2 weeks ago

On Mon, May 15, 2023 at 05:18:07PM +0100, Jonathan Cameron wrote:
> On Tue, 21 Mar 2023 21:50:33 -0400
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > 
> > Ambiguity #1:
> > 
> > * An SLD contains 1 Logical Device.
> > * An MH-SLD presents multiple SLDs, one per head.
> > 
> > Ergo an MH-SLD contains multiple LDs which makes it an MLD according to the
> > definition of LD, but not according to the definition of MLD, or MH-MLD.
> 
> I'd go with 'sort of'.  SLD is a presentation of a device to a host.
> It can be a normal single headed MLD that has been plugged directly into a host.
> 
> So for extra fun points you can have one MH-MLD that has some ports connected
> to switches and other directly to hosts. Thus it can present as SLD on some
> upstream ports and as MLD on others.
>

I suppose this section of the email was really to just point out that
what constitutions a "multi-headed", "logical", and "multi-logical"
device is rather confusing from just reading the spec.  Since writing
this, i've kind of settled on:

MH-* - anything with multiple heads, regardless of how it works
SLD - one LD per head, but LD does not imply any particular command set
MLD - multiple LD's per head, but those LD's may only attach to one head
DCD - anything can technically be a DCD if it implements the commands

Trying to figure out, from the spec, "what commands an MH-SLD" should
implement to be "Spec Compliance" was my frustration.  It's somewhat
clear now that the answer is "Technically nothing... unless its an MLD".

> > I want to make very close note of this.  SLD's are managed like SLDs
> > SLDs, MLDs are managed like MLDs.  MH-SLDs, according to this, should be
> > managed like SLDs from the perspective of each host.
> 
> True, but an MH-MLD device connected directly to a host will also 
> be managed (at some level anyway) as an SLD on that particular port.
>

The ambiguous part is... what commands relate specifically to an SLD?
The spec isn't really written that way, and the answer is that an SLD is
more of a lack of other functionality (specifically MLD functionality),
rather than its own set of functionality.

i.e. an SLD does not require an FM-Owned LD for management, but an MHD,
MLD, and DCD all do (at least in theory).

> > 
> > 2.5.1 continues on to describe "LD Management in MH-MLDs" but just ignores
> > that MH-SLDs (may) exist.  That's frustrating to say the least, but I
> > suppose we can gather from context that MH-SLD's *MAY NOT* have LD
> > management controls.
> 
> Hmm. In theory you could have an MH-SLD that used a config from flash or similar
> but that would be odd.  We need some level of dynamic control to make these
> devices useful.  Doesn't mean the spec should exclude dumb devices, but
> we shouldn't concentrate on them for emulation.
> 
> One possible usecase would be a device that always shares all it's memory on
> all ports. Yuk.
> 

I can say that the earliest forms of MH-SLD, and certainly pre-DCD, is
likely to present all memory on all ports, and potentially provide some
custom commands to help hosts enforce exclusivity.

It's beyond the spec, but this can actually be emulated today with the
MH-SLD setup I describe below.  Certainly I expected a yuk factor to
proposing it, but I think the reality is on the path to 3.0 and DCD
devices we should at least entertain that someone will probably do this
with real hardware.

> > For the simplest MH-SLD device, these fields would be immutable, and
> > there would be a single LD for each head, where head_id == ld_id.
> 
> Agreed.
> 
> > 
> > So summarizing, what I took away from this was the following:
> > 
> > In the simplest form of MH-SLD, there's is neither a switch, nor is
> > there LD management.  So, presumably, we don't HAVE to implement the
> > MHD commands to say we "have MH-SLD support".
> 
> Whilst theoretically possible - I don' think such a device is interesting.
> Minimum I'd want to see is something with multiple upstream SLD ports
> and a management LD with appropriate interface to poke it.
> 
>
> The MLD side of things is interesting only once we support MLDs in general
> in QEMU CXL emulation and even then they are near invisible to a host
> and are more interesting for emulating fabric management.
> 
> What you may want to do is take Fan's work on DCD and look at doing
> a simple MH-SLD device that uses same cheat of just using QMP commands
> to do the configuration.  That's an intermediate step to us getting
> the FM-API and similar commands implemented.
> 

I actually think it's a good step to go from MH-SLD to MH-SLD+DCD while
not having to worry about the complexity of MLD and switches.

(I have not gotten the chance to review the DCD patch set yet, it's on
my list for after ISC'23, I presume this is what has been done).

My thoughts would be that you would have something like the following:

-device ct3d,... etc etc
-device cxl-dcd,type3-backend=mem0,manager=true

the manager would be the owner of the FM-Owned LD, and therefore the
system responsible for managing requests for memory.

How we pass those messages between instances is then an exercise for the
reader.

What I have been doing is just creating a shared memory region with
mkipc and using a separate program to initiate that shared state before
launching the guests.  I'll talk about this a little further down.

> > 
> > ... snip ...
> > 
> > 3. MH-SLD w/ Pool CCI  (Implementing only Get Multi-Headed Info)
> 
> I'd do this + DCD.
> 

I concur, and it's what i was looking into next.

I think your other notes on MH-* with switches is really where I was
left scratching my head.

When I look at Switch/MLD functionality vs DCD, I have a gut feeling the
vast majority of early device vendors are going to skip right over
switches and MLD setups and go directly to MH-SLD+DCD.

> > =================================
> > 2. MH-SLD No Switch, No Pool CCI.
> > =================================
> > 
> > But it's also not very useful.  You can only use the memory in devdax
> > mode, since it's a shared memory region. You could already do this via
> > the /dev/shm interface, so it's not even new functionality.
> > 
> > In theory you could build a pooling service in software-only on top of
> > memory blocks. That's an exercise left to the reader.
> 
> Yeah. Let's not do this step.
> 

To late :].  It was useful as a learning exercise, but it's definitely
not upstream quality.  I may post it for the sake of the playground, but
I too would recommend against this method of pooling in the long term.

I made a proto-DCD command set that was reachable from each memdev
character device, and exposed it to every qemu instance as part of ct3d
(I'm still learning the QEMU ecosystem, so was easier to bodge it in
than make a new device and link it up).

Then I created a shared memory region with mkipc, and implemented a
simple mutex in the space, as well as all the record keeping needed to
manage sections/extents.

> > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`
> > ./cxl_mhd_init 4 $shmid1
> > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd=true,mhd_head=0,mhd_shmid=$1
> > 
> > ./cxl_mhd_init would simply setup the nr_heads/lds field and such
> > and set ldmap[0-3] to the values [0-3].  i.e. the head-to-ld mappings
> > are static (head_id==ld_id).
> > ... snip ...
> >
> > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`
> > ./cxl_mhd_init 4 $shmid1
> > -device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid

The last step was a few extra lines in the read/write functions to
ensure accesses to "Valid addresses" that "Aren't allocated" produce
errors.

At this point, each guest is capable basically using the device to do
the coordination for you by simply calling the allocate/deallocate
functions.

And that's it, you've got pooling.  Each guest sees the full extent of
the entire device, but must ask the device for access to a given
section, and the section can be translated into a memory block number
under the given numa node.

Ok, now lets talk about why this is a bad and why you shouldn't do it
this way:

* Technically a number of bios/hardware interleave functionality can
  bite you pretty hard when making the assumption that memory blocks are
  physically contiguous hardware addresses. However, that assumption
  holds if you simply don't turn those options on, so it might be useful
  as an early-adopter platform.

* The security posutre of a device like this is bad.  It requires each
  attached host to clear the memory before releasing it.  There isn't
  really a good way to do this in numa-mode, so you would have to
  implement custom firmware commands to ensure it happens, and that
  means custom drivers blah blah blah - not great.

  Basically you're trusting each host to play nice.  Not great.
  But potentially useful for early adopters regardless.

* General compaitibility and being in-spec - this design requires a
  number of non-spec extensions, so just generally not recommended,
  certainly not here in QEMU.

> 
> A few different moving parts are needed and I think we'd end up with something that
> looks like
> 
> -device cxl-mhd,volatile-memdev=mem0,id=backend
> -device cxl-mhd-sld,mhd=backend,bus=rp0,mhd-head=0,id=dev1,tunnel=true
> -device cxl-mhd-sld,mhd=backend,bus=rp1,mhd-head=1,id=dev2
> 
> dev1 provides the tunneling interface, but the actual implementation of
> the pool CCI and actual memory mappings is in the backend. Note that backend
> might be proxy to an external process, or a client/server approach between multiple
> QEMU instances.

I've hummed and hawwed over external process vs another QEMU instance and I
still haven't come to a satisfying answer here.  It feels extremely
heavy-handed to use an entirely separate QEMU instance just for this,
but there's nothing to say you can't just host it in one of the
head-attached instances.

I basically skipped this and allowed each instance to send the command
themselves, but serialized it with a mutex.  That way each instance can
operate cleanly without directly coordinating with each other.  I could
see a vendor implementing it this way on early devices.

I don't have a good answer for this yet, but maybe once I review the DCD
patch set I'll have more opinions.

> 
> or squish some parts and make a more extensible type3 device and have.
> 
> -device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=dev1,mhd-main=true
> -device cxl-type3,mhd=dev1,bus=rp1,mhd-head=1,id=dev2
> 

I originally went this route, but the downside of this is "What happens
when the main dies and has to restart".  There's all of kinds of
badness associated with that.  It's why i moved the shared state into a
separately created mkipc region.

> 
> To my mind there are a series of steps and questions here.
> 
> Which 'hotplug model'.
> 1) LD model for moving capacity
>   - If doing LD model, do MLDs and configurable switches first. Needed as a step along the
>     path anyway.  Deal with all the mess that brings and come back to MHD - as you note it
>     only makes sense with a switch in the path, so MLDs are a subset of the functionality anyway.
> 
> 2) DCD model for moving cacapcity
>   - MH-SLD with a pool CCI used to do DCD operations on the LDs.  Extension of
>     what Fan Ni is looking at.  He's making an SLD pretend to be a device
>     where DCD makes sense - whilst still using the CXL type 3 device. We probably shouldn't
>     do that without figuring out how to do an MHD-SLD - or at least a head that we intend
>     to hang this new stuff off - potentially just using the existing type 3 device with
>     more parameters as one of the MH-SLD heads that doesn't have the control interface and
>     different parameters if it does have the tunnel to the Pool CCI.
> 

Personally I think we should focus on the DCD model.  In fact, I think
we're already very close to this, as my personal prototype showed this
can work fairly cleanly, and I imagine I'll have a quick MHD patch set
once I get the change to review the DCD patch set.

If I'm being the honest, the more I look at the LD model, the less I
like it, but I understand that's how scale is going to be achieved.  I
don't know if focusing on that design right now is going to produce
adoption in the short term, since we're not likely to see those devices
for a few years.

MH-SLD+DCD is likely to show up much sooner, so I will target that.

~Gregory

Re: [RFC] cxl: Multi-headed device design

Posted by Jonathan Cameron via 11 months, 2 weeks ago

On Tue, 16 May 2023 02:20:07 -0400
Gregory Price <gregory.price@memverge.com> wrote:

> On Mon, May 15, 2023 at 05:18:07PM +0100, Jonathan Cameron wrote:
> > On Tue, 21 Mar 2023 21:50:33 -0400
> > Gregory Price <gregory.price@memverge.com> wrote:
> >   
> > > 
> > > Ambiguity #1:
> > > 
> > > * An SLD contains 1 Logical Device.
> > > * An MH-SLD presents multiple SLDs, one per head.
> > > 
> > > Ergo an MH-SLD contains multiple LDs which makes it an MLD according to the
> > > definition of LD, but not according to the definition of MLD, or MH-MLD.  
> > 
> > I'd go with 'sort of'.  SLD is a presentation of a device to a host.
> > It can be a normal single headed MLD that has been plugged directly into a host.
> > 
> > So for extra fun points you can have one MH-MLD that has some ports connected
> > to switches and other directly to hosts. Thus it can present as SLD on some
> > upstream ports and as MLD on others.
> >  
> 
> I suppose this section of the email was really to just point out that
> what constitutions a "multi-headed", "logical", and "multi-logical"
> device is rather confusing from just reading the spec.  Since writing
> this, i've kind of settled on:
> 
> MH-* - anything with multiple heads, regardless of how it works
> SLD - one LD per head, but LD does not imply any particular command set
> MLD - multiple LD's per head, but those LD's may only attach to one head
> DCD - anything can technically be a DCD if it implements the commands
> 
> Trying to figure out, from the spec, "what commands an MH-SLD" should
> implement to be "Spec Compliance" was my frustration.  It's somewhat
> clear now that the answer is "Technically nothing... unless its an MLD".

Sounds about right :)  Some of this is intentional - it's a grab bag
of features an options not a nice clean definition of 'the right set
to implement'. Market should probably drive that. I think expectation is
defacto feature set standards will happen - but outside of the CXL spec.

> 
> > > I want to make very close note of this.  SLD's are managed like SLDs
> > > SLDs, MLDs are managed like MLDs.  MH-SLDs, according to this, should be
> > > managed like SLDs from the perspective of each host.  
> > 
> > True, but an MH-MLD device connected directly to a host will also 
> > be managed (at some level anyway) as an SLD on that particular port.
> >  
> 
> The ambiguous part is... what commands relate specifically to an SLD?
> The spec isn't really written that way, and the answer is that an SLD is
> more of a lack of other functionality (specifically MLD functionality),
> rather than its own set of functionality.

Yup.

> 
> i.e. an SLD does not require an FM-Owned LD for management, but an MHD,
> MLD, and DCD all do (at least in theory).

DCD 'might' though I don't think anything in the spec rules that you 'must'
control the SLD/MLD via the FM-API, it's just a spec provided option.
From our point of view we don't want to get more creative so lets assume
it does.

I can't immediately think of reason for a single head SLD to have an FM owned
LD, though it may well have an MCTP CCI for querying stuff about it from an FM.

> 
> > > 
> > > 2.5.1 continues on to describe "LD Management in MH-MLDs" but just ignores
> > > that MH-SLDs (may) exist.  That's frustrating to say the least, but I
> > > suppose we can gather from context that MH-SLD's *MAY NOT* have LD
> > > management controls.  
> > 
> > Hmm. In theory you could have an MH-SLD that used a config from flash or similar
> > but that would be odd.  We need some level of dynamic control to make these
> > devices useful.  Doesn't mean the spec should exclude dumb devices, but
> > we shouldn't concentrate on them for emulation.
> > 
> > One possible usecase would be a device that always shares all it's memory on
> > all ports. Yuk.
> >   
> 
> I can say that the earliest forms of MH-SLD, and certainly pre-DCD, is
> likely to present all memory on all ports, and potentially provide some
> custom commands to help hosts enforce exclusivity.
> 
> It's beyond the spec, but this can actually be emulated today with the
> MH-SLD setup I describe below.  Certainly I expected a yuk factor to
> proposing it, but I think the reality is on the path to 3.0 and DCD
> devices we should at least entertain that someone will probably do this
> with real hardware.

From point of view of the Spec what you describe is an MH-SLD in which
all the memory is shared - non coherent.  That's a valid choice - be it
a much nastier option than either DCD based or sharing with coherency.

It might fall out as an option in a flexibly defined MLD, but I'm not
particularly interested in that case (don't mind if you are though!)

> 
> > > For the simplest MH-SLD device, these fields would be immutable, and
> > > there would be a single LD for each head, where head_id == ld_id.  
> > 
> > Agreed.
> >   
> > > 
> > > So summarizing, what I took away from this was the following:
> > > 
> > > In the simplest form of MH-SLD, there's is neither a switch, nor is
> > > there LD management.  So, presumably, we don't HAVE to implement the
> > > MHD commands to say we "have MH-SLD support".  
> > 
> > Whilst theoretically possible - I don' think such a device is interesting.
> > Minimum I'd want to see is something with multiple upstream SLD ports
> > and a management LD with appropriate interface to poke it.
> > 
> >
> > The MLD side of things is interesting only once we support MLDs in general
> > in QEMU CXL emulation and even then they are near invisible to a host
> > and are more interesting for emulating fabric management.
> > 
> > What you may want to do is take Fan's work on DCD and look at doing
> > a simple MH-SLD device that uses same cheat of just using QMP commands
> > to do the configuration.  That's an intermediate step to us getting
> > the FM-API and similar commands implemented.
> >   
> 
> I actually think it's a good step to go from MH-SLD to MH-SLD+DCD while
> not having to worry about the complexity of MLD and switches.

Maybe, there are other flows that only work for MLD and switches that aren't
MHD related (hotplug basically) so those might get explored in parallel.

> 
> (I have not gotten the chance to review the DCD patch set yet, it's on
> my list for after ISC'23, I presume this is what has been done).

At moment it's just an SLD with DCD presentation to host.  Nothing on the control
side.


> 
> My thoughts would be that you would have something like the following:
> 
> -device ct3d,... etc etc
> -device cxl-dcd,type3-backend=mem0,manager=true

DCD is just an aspect of a type 3 device.  I'm fine with a manager
element, but don't call it cxl-dcd.

> 
> the manager would be the owner of the FM-Owned LD, and therefore the
> system responsible for managing requests for memory.
> 
> How we pass those messages between instances is then an exercise for the
> reader.
> 
> 
> What I have been doing is just creating a shared memory region with
> mkipc and using a separate program to initiate that shared state before
> launching the guests.  I'll talk about this a little further down.
> 
> 
> > > 
> > > ... snip ...
> > > 
> > > 3. MH-SLD w/ Pool CCI  (Implementing only Get Multi-Headed Info)  
> > 
> > I'd do this + DCD.
> >   
> 
> I concur, and it's what i was looking into next.
> 
> I think your other notes on MH-* with switches is really where I was
> left scratching my head.
> 
> When I look at Switch/MLD functionality vs DCD, I have a gut feeling the
> vast majority of early device vendors are going to skip right over
> switches and MLD setups and go directly to MH-SLD+DCD.

Yup. That's likely  - though probably more driven by switch latency concerns
than by complexity.  MLDs aren't too bad, and the DCD parts etc are the
same as for SLD.

> 
> > > =================================
> > > 2. MH-SLD No Switch, No Pool CCI.
> > > =================================
> > > 
> > > But it's also not very useful.  You can only use the memory in devdax
> > > mode, since it's a shared memory region. You could already do this via
> > > the /dev/shm interface, so it's not even new functionality.
> > > 
> > > In theory you could build a pooling service in software-only on top of
> > > memory blocks. That's an exercise left to the reader.  
> > 
> > Yeah. Let's not do this step.
> >   
> 
> To late :].  It was useful as a learning exercise, but it's definitely
> not upstream quality.  I may post it for the sake of the playground, but
> I too would recommend against this method of pooling in the long term.
> 
> I made a proto-DCD command set that was reachable from each memdev
> character device, and exposed it to every qemu instance as part of ct3d
> (I'm still learning the QEMU ecosystem, so was easier to bodge it in
> than make a new device and link it up).
> 
> Then I created a shared memory region with mkipc, and implemented a
> simple mutex in the space, as well as all the record keeping needed to
> manage sections/extents.
> 
> > > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`
> > > ./cxl_mhd_init 4 $shmid1
> > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd=true,mhd_head=0,mhd_shmid=$1
> > > 
> > > ./cxl_mhd_init would simply setup the nr_heads/lds field and such
> > > and set ldmap[0-3] to the values [0-3].  i.e. the head-to-ld mappings
> > > are static (head_id==ld_id).
> > > ... snip ...
> > >
> > > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`
> > > ./cxl_mhd_init 4 $shmid1
> > > -device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid  
> 
> The last step was a few extra lines in the read/write functions to
> ensure accesses to "Valid addresses" that "Aren't allocated" produce
> errors.
> 
> At this point, each guest is capable basically using the device to do
> the coordination for you by simply calling the allocate/deallocate
> functions.
> 
> And that's it, you've got pooling.  Each guest sees the full extent of
> the entire device, but must ask the device for access to a given
> section, and the section can be translated into a memory block number
> under the given numa node.
> 

This is a valid model, but I'd do the management out of band. 
We could add BI support though if that's really useful (I don't think
I can go into why non coherency is a problem in some real hardware for
this usecase... watch this space)

> 
> Ok, now lets talk about why this is a bad and why you shouldn't do it
> this way:
> 
> * Technically a number of bios/hardware interleave functionality can
>   bite you pretty hard when making the assumption that memory blocks are
>   physically contiguous hardware addresses. However, that assumption
>   holds if you simply don't turn those options on, so it might be useful
>   as an early-adopter platform.

I'd never enumerate these from BIOS - doesn't make sense for something
there for dynamic runtime allocation.
Interleave indeed hard - don't do it (yet)

> 
> 
> * The security posutre of a device like this is bad.  It requires each
>   attached host to clear the memory before releasing it.  There isn't
>   really a good way to do this in numa-mode, so you would have to
>   implement custom firmware commands to ensure it happens, and that
>   means custom drivers blah blah blah - not great.
> 
>   Basically you're trusting each host to play nice.  Not great.
>   But potentially useful for early adopters regardless.

Agreed. I'd put it down as horrible - no one should build this ;)

> 
> 
> * General compaitibility and being in-spec - this design requires a
>   number of non-spec extensions, so just generally not recommended,
>   certainly not here in QEMU.

Hmm. Does it?  Looks to me like it would be present to hosts as multiple
SLDs with volatile regions and a CDAT that presents DSMAS with
flags for Shareable / !Hardware managed coherency (or wire up
the missing bits of BI enablement in QEMU - doesn't actually do anything
but we should provide the various registers and correctly enable it
in the kernel)

The fact it's an MHD isn't visible to the hosts, so I think this
is spec compliant if odd.

If you meant the control path.  Also fine as long as you just do it through
memory and a 'convention' on software side for where that shared set
of info is + some fun algorithms to deal with mutex etc. Needs BI support
though.

> 
> > 
> > A few different moving parts are needed and I think we'd end up with something that
> > looks like
> > 
> > -device cxl-mhd,volatile-memdev=mem0,id=backend
> > -device cxl-mhd-sld,mhd=backend,bus=rp0,mhd-head=0,id=dev1,tunnel=true
> > -device cxl-mhd-sld,mhd=backend,bus=rp1,mhd-head=1,id=dev2
> > 
> > dev1 provides the tunneling interface, but the actual implementation of
> > the pool CCI and actual memory mappings is in the backend. Note that backend
> > might be proxy to an external process, or a client/server approach between multiple
> > QEMU instances.  
> 
> I've hummed and hawwed over external process vs another QEMU instance and I
> still haven't come to a satisfying answer here.  It feels extremely
> heavy-handed to use an entirely separate QEMU instance just for this,
> but there's nothing to say you can't just host it in one of the
> head-attached instances.

MHD is only really interesting (for hardware coherent sharing anyway)
if you have multiple host OS so that's multiple QEMU instances.

If there is a 'main' instance of QEMU then everything should still work
though so can leave the subordinate instances for future work.

> 
> I basically skipped this and allowed each instance to send the command
> themselves, but serialized it with a mutex.  That way each instance can
> operate cleanly without directly coordinating with each other.  I could
> see a vendor implementing it this way on early devices.

That control would have to be out of band if using memory, or require BI.
BI requires CXL 3.0 host, whereas a DCD based MHD (sure defined in CXL 3.0) would
work with a CXL 2.0 host - well probably even a CXL 1.1 host, but who wants
to bother with those..

I suspect we'll see impdef versions.  DCD is at heart pretty simple though
so I'd expect it to turn up fairly fast after first memory pool devices.


> 
> I don't have a good answer for this yet, but maybe once I review the DCD
> patch set I'll have more opinions.
> 
> > 
> > or squish some parts and make a more extensible type3 device and have.
> > 
> > -device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=dev1,mhd-main=true
> > -device cxl-type3,mhd=dev1,bus=rp1,mhd-head=1,id=dev2
> >   
> 
> I originally went this route, but the downside of this is "What happens
> when the main dies and has to restart".  There's all of kinds of
> badness associated with that.  It's why i moved the shared state into a
> separately created mkipc region.

Modelling. I don't care if that happens :)  You are right however that
it would need more care. External process probably makes sense - can
be pretty light weight.

> 
> > 
> > To my mind there are a series of steps and questions here.
> > 
> > Which 'hotplug model'.
> > 1) LD model for moving capacity
> >   - If doing LD model, do MLDs and configurable switches first. Needed as a step along the
> >     path anyway.  Deal with all the mess that brings and come back to MHD - as you note it
> >     only makes sense with a switch in the path, so MLDs are a subset of the functionality anyway.
> > 
> > 2) DCD model for moving cacapcity
> >   - MH-SLD with a pool CCI used to do DCD operations on the LDs.  Extension of
> >     what Fan Ni is looking at.  He's making an SLD pretend to be a device
> >     where DCD makes sense - whilst still using the CXL type 3 device. We probably shouldn't
> >     do that without figuring out how to do an MHD-SLD - or at least a head that we intend
> >     to hang this new stuff off - potentially just using the existing type 3 device with
> >     more parameters as one of the MH-SLD heads that doesn't have the control interface and
> >     different parameters if it does have the tunnel to the Pool CCI.
> >   
> 
> Personally I think we should focus on the DCD model.  In fact, I think
> we're already very close to this, as my personal prototype showed this
> can work fairly cleanly, and I imagine I'll have a quick MHD patch set
> once I get the change to review the DCD patch set.

Agreed. It's easier.

> 
> If I'm being the honest, the more I look at the LD model, the less I
> like it, but I understand that's how scale is going to be achieved.  I
> don't know if focusing on that design right now is going to produce
> adoption in the short term, since we're not likely to see those devices
> for a few years.
> 
> MH-SLD+DCD is likely to show up much sooner, so I will target that.

Yup.  The switch case is interesting for driving Fabric Manager architecture
so I'd like to enable it at somepoint, but device wise MH-SLD+DCD is probably
going to come first.

I might extend the switch CCI or the MCTP CCI PoC enough to get comms up and
query stuff, but focus will be on a mailbox on the MHD - note have to ensure
only one of those for the FM API controls driving DCD. 

Jonathan

> 
> ~Gregory

Re: [RFC] cxl: Multi-headed device design

Posted by Gregory Price 11 months ago

On Wed, May 17, 2023 at 03:18:59PM +0100, Jonathan Cameron wrote:
> > 
> > i.e. an SLD does not require an FM-Owned LD for management, but an MHD,
> > MLD, and DCD all do (at least in theory).
> 
> DCD 'might' though I don't think anything in the spec rules that you 'must'
> control the SLD/MLD via the FM-API, it's just a spec provided option.
> From our point of view we don't want to get more creative so lets assume
> it does.
> 
> I can't immediately think of reason for a single head SLD to have an FM owned
> LD, though it may well have an MCTP CCI for querying stuff about it from an FM.
> 

Before I go running off into the woods, it seems like it would be simple
enough to simply make an FM-LD "device" which simply links a mhXXX device
and implements its own Mailbox CCI.

Maybe not "realistic", but to my mind this appears as a separate
character device in /dev/cxl/*. Maybe the realism here doesn't matter,
since we're just implementing for the sake of testing.  This is just a
straightforward way to pipe a DCD request into the device and trigger
DCD event log entries.

As commented early, this is done as a QEMU fed event.  If that's
sufficient, a hack like this feels like it would be at least mildly
cleaner and easier to test against.

Example: consider a user wanting to issue a DCD command to add capacity.

Real world: this would be some out of band communication, and eventually
this results in a DCD command to the device that results in a
capacity-event showing up in the log. Maybe it happens over TCP and
drills down to a Redfish event that talks to the BMC that issues a
command over etc etc MTCP emulations, etc.

With a simplistic /dev/cxl/memX-fmld device a user can simply issue these
commands without all that, and the effect is the same.

On the QEMU side you get something like:

-device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=mem0,mhd-main=true
-device cxl-mhsld,type3=mem0,bus=rp0,headid=0,id=mhsld1,shmid=XXXXX
-device cxl-fmld,mhsld=mdsld1,bus=rp1,id=mem0-fmld,shmid=YYYYY

on the Linux side you get:
/dev/cxl/mem0
/dev/cxl/mem0-fmld

in this example, the shmid for mhsld is a shared memory region created
with mkipc that implements the shared state (basically section bitmap
tracking and the actual plumbing for DCD, etc). This limits the emulation
of the mhsld to a single host for now, but that seems sufficient.

The shmid for cxl-fmld implements any shared state for the fmld,
including a mutex, that allows all hosts attached to the mhsld to have
access to the fmld.  This may or may not be realistic, but it would
allow all head-attached hosts to send DCD commands over its own local
fabric, ratehr than going out of band.

This gets us to the point where, at a minimum, each host can issue its
own DCD commands to add capacity to itself.  That's step 1.

Step 2 is allow Host A to issue a DCD command to add capacity to Host B.

I suppose this could be done via a backgruond thread that waits on a
message to show up in the shared memory region?

Being somewhat unfamiliar with QEMU, is it kosher to start background
threads that just wait on events like this, or is that generally frowed
upon?  If done this way, it would stimplify the creation and startup
sequence at least.

~Gregory

Re: [RFC] cxl: Multi-headed device design

Posted by Jonathan Cameron via 10 months, 4 weeks ago

On Mon, 29 May 2023 14:13:07 -0400
Gregory Price <gregory.price@memverge.com> wrote:

> On Wed, May 17, 2023 at 03:18:59PM +0100, Jonathan Cameron wrote:
> > > 
> > > i.e. an SLD does not require an FM-Owned LD for management, but an MHD,
> > > MLD, and DCD all do (at least in theory).  
> > 
> > DCD 'might' though I don't think anything in the spec rules that you 'must'
> > control the SLD/MLD via the FM-API, it's just a spec provided option.
> > From our point of view we don't want to get more creative so lets assume
> > it does.
> > 
> > I can't immediately think of reason for a single head SLD to have an FM owned
> > LD, though it may well have an MCTP CCI for querying stuff about it from an FM.
> >   
Sorry for slow reply - got distracted and forgot to cycle back to this.

> 
> Before I go running off into the woods, it seems like it would be simple
> enough to simply make an FM-LD "device" which simply links a mhXXX device
> and implements its own Mailbox CCI.
> 
> Maybe not "realistic", but to my mind this appears as a separate
> character device in /dev/cxl/*. Maybe the realism here doesn't matter,
> since we're just implementing for the sake of testing.  This is just a
> straightforward way to pipe a DCD request into the device and trigger
> DCD event log entries.
> 
> As commented early, this is done as a QEMU fed event.  If that's
> sufficient, a hack like this feels like it would be at least mildly
> cleaner and easier to test against.

Or MCTP over I2C which works today, but needs more commands for this :)

I plan to look at the tunneling stuff shortly.  Initially I'll punt the
guest using this to userspace, but potentially the eventual model might well be to
make it look like a bunch of direct attached CCIs from userspace point of
view. I'm not 100% keen on pushing the management of hotplug into the
kernel though as particular CCIs we are tunneling to in a wider fabric
may come and and go.  For an MHD this would be easy, not so much if
a switch CCI with tunneling to MLDs and MH-MLDs below it.

> 
> 
> Example: consider a user wanting to issue a DCD command to add capacity.
> 
> Real world: this would be some out of band communication, and eventually
> this results in a DCD command to the device that results in a
> capacity-event showing up in the log. Maybe it happens over TCP and
> drills down to a Redfish event that talks to the BMC that issues a
> command over etc etc MTCP emulations, etc.
> 
> With a simplistic /dev/cxl/memX-fmld device a user can simply issue these
> commands without all that, and the effect is the same.

Yup - something along those lines makes sense.

> 
> On the QEMU side you get something like:
> 
> -device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=mem0,mhd-main=true

I'd expect this device to present the mailbox commands for tunneling to
the FM-LD - as such I'd want a reference form here to your cxl-fmld below.

> -device cxl-mhsld,type3=mem0,bus=rp0,headid=0,id=mhsld1,shmid=XXXXX

Not sure why this is on the bus rp0. 

> -device cxl-fmld,mhsld=mdsld1,bus=rp1,id=mem0-fmld,shmid=YYYYY

To be spec compliant that cxl-fmld still has to support normal use as well
as tunnelling to the fm owned LD - so it's a superset of a type3 device.

My gut feeling is keep it simple for a PoC / supporting enablement.
1 device on the host that is also service as the FM.
  Probably just an extended type 3 with some more options to turn this
  feature on.
1 device on each other host that connects via socket
All device share same underlying memory.
Access bitmap is fiddly - either a push model over socket, or a shared
bitmap like you suggest.  Either works, not sure which ends up cleaner.

It may well become more devices over time, but that should be driven
by the different types of CCI sharing common infrastructure rather than
trying to figure out that model at the start.

> 
> on the Linux side you get:
> /dev/cxl/mem0
> /dev/cxl/mem0-fmld
> 
> in this example, the shmid for mhsld is a shared memory region created
> with mkipc that implements the shared state (basically section bitmap
> tracking and the actual plumbing for DCD, etc). This limits the emulation
> of the mhsld to a single host for now, but that seems sufficient.
> 
> The shmid for cxl-fmld implements any shared state for the fmld,
> including a mutex, that allows all hosts attached to the mhsld to have
> access to the fmld.  This may or may not be realistic, but it would
> allow all head-attached hosts to send DCD commands over its own local
> fabric, ratehr than going out of band.

Not keen on that part.  I'd like to keep close to the spec intent and only
allow one host to access the FM-LD.

> 
> This gets us to the point where, at a minimum, each host can issue its
> own DCD commands to add capacity to itself.  That's step 1.

I don't agree with this one.  I really don't want hosts to be able to do that.
They need to talk to one host that is acting as fabric manager - that can then
talk to the MHD to do the allocations.

> 
> Step 2 is allow Host A to issue a DCD command to add capacity to Host B.
> 
> I suppose this could be done via a backgruond thread that waits on a
> message to show up in the shared memory region?

The actual setup should be done via the single host with the FM, but there
is still a need to notify the other hosts.  I'd be tempted to do that
via a socket rather than shared memory.  Just keep the shared memory for
the access bitmap. Or drop that access bitmap entirely and rely on each
host keeping track of it's own access permissions.

For testing purposes I don't have a problem with insisting the owner
of the FM-LD must be started first and closed last.  That ties lifetime
of that host with that of the device, but that isn't too much of a problem
given the lifetime differences we may want to test probably sit at the
FM software layer, not the emulation of the hardware.

> 
> Being somewhat unfamiliar with QEMU, is it kosher to start background
> threads that just wait on events like this, or is that generally frowed
> upon?  If done this way, it would stimplify the creation and startup
> sequence at least.
> 
> ~Gregory