[v1] RE: [PATCH v1 0/4] vfio: report NUMA nodes for device memory

RE: [PATCH v1 0/4] vfio: report NUMA nodes for device memory

Posted by Vikram Sethi 7 months, 1 week ago

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, September 27, 2023 9:25 AM
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> 
> 
> On Wed, 27 Sep 2023 10:53:36 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> >
> > > CXL accelerators / GPUs etc are a different question but who has one
> > > of those anyway? :)
> >
> > That's exactly what I mean when I say CXL will need it too. I keep
> > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > imagine draping CXL around it. CXL doesn't solve the problem that
> > motivates all this hackying - Linux can't dynamically create NUMA
> > nodes.
> 
> Why is that and why aren't we pushing towards a solution of removing that
> barrier so that we don't require the machine topology to be configured to
> support this use case and guest OS limitations?  Thanks,
>

Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
there is additional information FW knows that the kernel doesn't. For example,
what the distance/latency between CPU and the device NUMA node is. While CXL devices
present a CDAT table which gives latency type attributes within the device, there would still be some
guesswork needed in the kernel as to what the end to end latency/distance is. 
It's probably not the best outcome to just consider this generically far memory" because 
is it further than Intersocket memory access or not matters. 
Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
no idea if this latency/distance is less than or more than inter socket memory access latency
even.
So specially for devices present at boot, FW knows this information and should provide it. 
Similarly, QEMU should pass along this information to VMs for the best outcomes.  

Thanks
> Alex

Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory

Posted by Jonathan Cameron via 7 months, 1 week ago

On Wed, 27 Sep 2023 15:03:09 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 27, 2023 9:25 AM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > 
> > 
> > On Wed, 27 Sep 2023 10:53:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > >  
> > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > of those anyway? :)  
> > >
> > > That's exactly what I mean when I say CXL will need it too. I keep
> > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > motivates all this hackying - Linux can't dynamically create NUMA
> > > nodes.  
> > 
> > Why is that and why aren't we pushing towards a solution of removing that
> > barrier so that we don't require the machine topology to be configured to
> > support this use case and guest OS limitations?  Thanks,
> >  
> 
> Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
> there is additional information FW knows that the kernel doesn't. For example,
> what the distance/latency between CPU and the device NUMA node is. While CXL devices
> present a CDAT table which gives latency type attributes within the device, there would still be some
> guesswork needed in the kernel as to what the end to end latency/distance is. 
FWIW

Shouldn't be guess work needed (for light load case anyway which is what would be
in HMAT). That's what the Generic Ports were added to SRAT for. Dave Jiang has a
patch set
https://lore.kernel.org/all/168695160531.3031571.4875512229068707023.stgit@djiang5-mobl3/
to do the maths...  For CXL there is no problem fully describing the access characteristics.

> It's probably not the best outcome to just consider this generically far memory" because 
> is it further than Intersocket memory access or not matters. 
> Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
> no idea if this latency/distance is less than or more than inter socket memory access latency
> even.

Just because I'm feeling cheeky - you could emulate a DOE and CDAT :)?
Though I suppose you don't want to teach the guest driver about it.

> So specially for devices present at boot, FW knows this information and should provide it. 
> Similarly, QEMU should pass along this information to VMs for the best outcomes.
No problem with the argument that FW has the info and should provide it,
just on the 'how' part.

Jonathan

> 
> Thanks
> > Alex  
>

Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory

Posted by Alex Williamson 7 months, 1 week ago

On Wed, 27 Sep 2023 15:03:09 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 27, 2023 9:25 AM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > 
> > 
> > On Wed, 27 Sep 2023 10:53:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > >  
> > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > of those anyway? :)  
> > >
> > > That's exactly what I mean when I say CXL will need it too. I keep
> > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > motivates all this hackying - Linux can't dynamically create NUMA
> > > nodes.  
> > 
> > Why is that and why aren't we pushing towards a solution of removing that
> > barrier so that we don't require the machine topology to be configured to
> > support this use case and guest OS limitations?  Thanks,
> >  
> 
> Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
> there is additional information FW knows that the kernel doesn't. For example,
> what the distance/latency between CPU and the device NUMA node is. While CXL devices
> present a CDAT table which gives latency type attributes within the device, there would still be some
> guesswork needed in the kernel as to what the end to end latency/distance is. 
> It's probably not the best outcome to just consider this generically far memory" because 
> is it further than Intersocket memory access or not matters. 
> Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
> no idea if this latency/distance is less than or more than inter socket memory access latency
> even.
> So specially for devices present at boot, FW knows this information and should provide it. 
> Similarly, QEMU should pass along this information to VMs for the best outcomes.  

Yeah, AFAICT we're not doing any of that in this series.  We're only
creating some number of nodes for the guest driver to make use of and
describing in the generated _DSD the set of nodes associated for use by
the device.  How many nodes are required, how the guest driver
partitions coherent memory among the nodes, and how the guest assigns a
distance to the nodes is unspecified.

A glance at the CDAT spec seems brilliant in this regard, is there such
a table for this device or could/should the vfio-pci variant driver
provide one?  I imagine this is how the VM admin or orchestration tool
would learn to configure nodes and the VMM would further virtualize
this table for the guest OS and drivers.  Thanks,

Alex

Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory

Posted by Jonathan Cameron via 7 months, 1 week ago

On Wed, 27 Sep 2023 10:37:37 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 27 Sep 2023 15:03:09 +0000
> Vikram Sethi <vsethi@nvidia.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, September 27, 2023 9:25 AM
> > > To: Jason Gunthorpe <jgg@nvidia.com>
> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > > 
> > > 
> > > On Wed, 27 Sep 2023 10:53:36 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >     
> > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > > >    
> > > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > > of those anyway? :)    
> > > >
> > > > That's exactly what I mean when I say CXL will need it too. I keep
> > > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > > motivates all this hackying - Linux can't dynamically create NUMA
> > > > nodes.    
> > > 
> > > Why is that and why aren't we pushing towards a solution of removing that
> > > barrier so that we don't require the machine topology to be configured to
> > > support this use case and guest OS limitations?  Thanks,
> > >    
> > 
> > Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
> > there is additional information FW knows that the kernel doesn't. For example,
> > what the distance/latency between CPU and the device NUMA node is. While CXL devices
> > present a CDAT table which gives latency type attributes within the device, there would still be some
> > guesswork needed in the kernel as to what the end to end latency/distance is. 
> > It's probably not the best outcome to just consider this generically far memory" because 
> > is it further than Intersocket memory access or not matters. 
> > Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
> > no idea if this latency/distance is less than or more than inter socket memory access latency
> > even.
> > So specially for devices present at boot, FW knows this information and should provide it. 
> > Similarly, QEMU should pass along this information to VMs for the best outcomes.    
> 
> Yeah, AFAICT we're not doing any of that in this series.  We're only
> creating some number of nodes for the guest driver to make use of and
> describing in the generated _DSD the set of nodes associated for use by
> the device.  How many nodes are required, how the guest driver
> partitions coherent memory among the nodes, and how the guest assigns a
> distance to the nodes is unspecified.
> 
> A glance at the CDAT spec seems brilliant in this regard, is there such
> a table for this device or could/should the vfio-pci variant driver
> provide one?  I imagine this is how the VM admin or orchestration tool
> would learn to configure nodes and the VMM would further virtualize
> this table for the guest OS and drivers.  Thanks,

I like this proposal. Host bit seems doable.  However I'm not sure
virtualizing it would help for device drivers that have no awareness unless
we slap some generic layer in the driver core in Linux that can use this if
available?  It would certainly be easy enough to add emulation into the guest
to the PCI function config space (as long as we aren't out of room!)

For CXL I think we will have to do this anyway. For a complex topology we
could either:
1) Present a matching topology in the guest with all the components involved
   including switches with right numbers of ports etc so the CDAT tables
   can be read directly from underlying hardware, or...
2) Present a somewhat matching topology in the guest having just messed with
   switches so that we can present their existence but not all the ports etc.
   So emulate a cut down CDAT for the switch.
3) Allow flattening everything and present a single CDAT with aggregate numbers
   at the EP. That will get a bit messy as we may need the kernel to ignore some
   parts it would normally account for (link widths on CXL buses etc.)

So I'm thinking 1 is to restrictive, hence need to emulate the table anyway
(not mention it might not be on the function being assigned anyway as they
are normally only on Function 0 IIRC).

Gah. Too much of CXL emulation still to do in QEMU already....

Jonathan




> 
> Alex
>

Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory

Posted by Jason Gunthorpe 7 months, 1 week ago

On Wed, Sep 27, 2023 at 03:03:09PM +0000, Vikram Sethi wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 27, 2023 9:25 AM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > 
> > 
> > On Wed, 27 Sep 2023 10:53:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > >
> > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > of those anyway? :)
> > >
> > > That's exactly what I mean when I say CXL will need it too. I keep
> > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > motivates all this hackying - Linux can't dynamically create NUMA
> > > nodes.
> > 
> > Why is that and why aren't we pushing towards a solution of removing that
> > barrier so that we don't require the machine topology to be configured to
> > support this use case and guest OS limitations?  Thanks,

I haven't looked myself, but I've been told this is very
hard. Probably the NUMA concept needs to be split a bit so that the
memory allocator pool handle is not tied to the ACPI.

> Even if Linux could create NUMA nodes dynamically for coherent CXL
> or CXL type devices, there is additional information FW knows that
> the kernel doesn't. For example, what the distance/latency between
> CPU and the device NUMA node is.

But that should just be the existing normal PCIy stuff to define
affinity of the PCI function. Since the memory is logically linked to
the PCI function we have no issue from a topology perspective.

> While CXL devices present a CDAT table which gives latency type
> attributes within the device, there would still be some guesswork
> needed in the kernel as to what the end to end latency/distance is.

Is it non-uniform per CXL function?

> knows this information and should provide it.  Similarly, QEMU
> should pass along this information to VMs for the best outcomes.

Well yes, ideally qemu would pass vCPU affinity for the vPCI functions
so existing NUMA aware allocations in PCI drivers are optimized. eg we
put queues in memory close to the PCI function already.

I think it is important to keep seperated the need to know the
topology/affinity/etc and the need for the driver to have a Linux NUMA
node handle to operate the memory alocator pool APIs.

Regardless, it is what it is for the foreseeable future :(

Jason

Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory

Posted by Jonathan Cameron via 7 months, 1 week ago

On Wed, 27 Sep 2023 12:42:40 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Sep 27, 2023 at 03:03:09PM +0000, Vikram Sethi wrote:
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, September 27, 2023 9:25 AM
> > > To: Jason Gunthorpe <jgg@nvidia.com>
> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > > 
> > > 
> > > On Wed, 27 Sep 2023 10:53:36 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > > >  
> > > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > > of those anyway? :)  
> > > >
> > > > That's exactly what I mean when I say CXL will need it too. I keep
> > > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > > motivates all this hackying - Linux can't dynamically create NUMA
> > > > nodes.  
> > > 
> > > Why is that and why aren't we pushing towards a solution of removing that
> > > barrier so that we don't require the machine topology to be configured to
> > > support this use case and guest OS limitations?  Thanks,  
> 
> I haven't looked myself, but I've been told this is very
> hard. Probably the NUMA concept needs to be split a bit so that the
> memory allocator pool handle is not tied to the ACPI.
> 
> > Even if Linux could create NUMA nodes dynamically for coherent CXL
> > or CXL type devices, there is additional information FW knows that
> > the kernel doesn't. For example, what the distance/latency between
> > CPU and the device NUMA node is.  
> 
> But that should just be the existing normal PCIy stuff to define
> affinity of the PCI function. Since the memory is logically linked to
> the PCI function we have no issue from a topology perspective.

Agreed - if there is a node to associate it with, this is easy to cover.
If aim is just to make the info available, could just Generic Initiator
nodes instead.  Those were added for memory free accelerators when the
bios was describing them to fill the gap for things that aren't processors
or memory but for which bandwidth and latency numbers were useful.
This falls into your comment below on the need not being the topology
as such, but rather the number needed for the memory allocator.

> 
> > While CXL devices present a CDAT table which gives latency type
> > attributes within the device, there would still be some guesswork
> > needed in the kernel as to what the end to end latency/distance is.  
> 
> Is it non-uniform per CXL function?

Could be, but if so it is all discoverable. Though mapping is to
device PA range rather than 'function' as such.  You are certainly
allowed lots of different PA ranges with different characteristics.

> 
> > knows this information and should provide it.  Similarly, QEMU
> > should pass along this information to VMs for the best outcomes.  
> 
> Well yes, ideally qemu would pass vCPU affinity for the vPCI functions
> so existing NUMA aware allocations in PCI drivers are optimized. eg we
> put queues in memory close to the PCI function already.

I though we did that via PXBs?

> 
> I think it is important to keep seperated the need to know the
> topology/affinity/etc and the need for the driver to have a Linux NUMA
> node handle to operate the memory alocator pool APIs.
> 
> Regardless, it is what it is for the foreseeable future :(
:(

> 
> Jason