[RFC 14/14] vfio/nvgrace-gpu: Add link from pci to EGM

ankita@nvidia.com posted 14 patches 4 weeks, 1 day ago
[RFC 14/14] vfio/nvgrace-gpu: Add link from pci to EGM
Posted by ankita@nvidia.com 4 weeks, 1 day ago
From: Ankit Agrawal <ankita@nvidia.com>

To replicate the host EGM topology in the VM in terms of
the GPU affinity, the userspace need to be aware of which
GPUs belong to the same socket as the EGM region.

Expose the list of GPUs associated with an EGM region
through sysfs. The list can be queried from the auxiliary
device path.

On a 2-socket, 4 GPU Grace Blackwell setup, it shows up as the following:
/sys/devices/pci0008:00/0008:00:00.0/0008:01:00.0/nvgrace_gpu_vfio_pci.egm.4
/sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0/nvgrace_gpu_vfio_pci.egm.4
pointing to egm4.

/sys/devices/pci0018:00/0018:00:00.0/0018:01:00.0/nvgrace_gpu_vfio_pci.egm.5
/sys/devices/pci0019:00/0019:00:00.0/0019:01:00.0/nvgrace_gpu_vfio_pci.egm.5
pointing to egm5.

Moreover
/sys/devices/pci0008:00/0008:00:00.0/0008:01:00.0/nvgrace_gpu_vfio_pci.egm.4
/sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0/nvgrace_gpu_vfio_pci.egm.4
lists links to both the 0008:01:00.0 & 0009:01:00.0 GPU devices.

and
/sys/devices/pci0018:00/0018:00:00.0/0018:01:00.0/nvgrace_gpu_vfio_pci.egm.5
/sys/devices/pci0019:00/0019:00:00.0/0019:01:00.0/nvgrace_gpu_vfio_pci.egm.5
lists links to both the 0018:01:00.0 & 0019:01:00.0.

Suggested-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 42 +++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
index b8e143542bce..20e9213aa0ac 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
@@ -56,6 +56,36 @@ int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
 	return ret;
 }
 
+static int create_egm_symlinks(struct nvgrace_egm_dev *egm_dev,
+			       struct pci_dev *pdev)
+{
+	int ret_l1, ret_l2;
+
+	ret_l1 = sysfs_create_link_nowarn(&pdev->dev.kobj,
+					  &egm_dev->aux_dev.dev.kobj,
+					  dev_name(&egm_dev->aux_dev.dev));
+
+	/*
+	 * Allow if Link already exists - created since GPU is the auxiliary
+	 * device's parent; flag the error otherwise.
+	 */
+	if (ret_l1 && ret_l1 != -EEXIST)
+		return ret_l1;
+
+	ret_l2 = sysfs_create_link(&egm_dev->aux_dev.dev.kobj,
+				   &pdev->dev.kobj,
+				   dev_name(&pdev->dev));
+
+	/*
+	 * Remove the aux dev link only if wasn't already present.
+	 */
+	if (ret_l2 && !ret_l1)
+		sysfs_remove_link(&pdev->dev.kobj,
+				  dev_name(&egm_dev->aux_dev.dev));
+
+	return ret_l2;
+}
+
 int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
 {
 	struct gpu_node *node;
@@ -68,7 +98,16 @@ int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
 
 	list_add_tail(&node->list, &egm_dev->gpus);
 
-	return 0;
+	return create_egm_symlinks(egm_dev, pdev);
+}
+
+static void remove_egm_symlinks(struct nvgrace_egm_dev *egm_dev,
+				struct pci_dev *pdev)
+{
+	sysfs_remove_link(&pdev->dev.kobj,
+			  dev_name(&egm_dev->aux_dev.dev));
+	sysfs_remove_link(&egm_dev->aux_dev.dev.kobj,
+			  dev_name(&pdev->dev));
 }
 
 void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
@@ -77,6 +116,7 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
 
 	list_for_each_entry_safe(node, tmp, &egm_dev->gpus, list) {
 		if (node->pdev == pdev) {
+			remove_egm_symlinks(egm_dev, pdev);
 			list_del(&node->list);
 			kvfree(node);
 		}
-- 
2.34.1
Re: [RFC 14/14] vfio/nvgrace-gpu: Add link from pci to EGM
Posted by Jason Gunthorpe 3 weeks, 6 days ago
On Thu, Sep 04, 2025 at 04:08:28AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> To replicate the host EGM topology in the VM in terms of
> the GPU affinity, the userspace need to be aware of which
> GPUs belong to the same socket as the EGM region.
> 
> Expose the list of GPUs associated with an EGM region
> through sysfs. The list can be queried from the auxiliary
> device path.
> 
> On a 2-socket, 4 GPU Grace Blackwell setup, it shows up as the following:
> /sys/devices/pci0008:00/0008:00:00.0/0008:01:00.0/nvgrace_gpu_vfio_pci.egm.4
> /sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0/nvgrace_gpu_vfio_pci.egm.4
> pointing to egm4.
> 
> /sys/devices/pci0018:00/0018:00:00.0/0018:01:00.0/nvgrace_gpu_vfio_pci.egm.5
> /sys/devices/pci0019:00/0019:00:00.0/0019:01:00.0/nvgrace_gpu_vfio_pci.egm.5
> pointing to egm5.
> 
> Moreover
> /sys/devices/pci0008:00/0008:00:00.0/0008:01:00.0/nvgrace_gpu_vfio_pci.egm.4
> /sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0/nvgrace_gpu_vfio_pci.egm.4
> lists links to both the 0008:01:00.0 & 0009:01:00.0 GPU devices.
> 
> and
> /sys/devices/pci0018:00/0018:00:00.0/0018:01:00.0/nvgrace_gpu_vfio_pci.egm.5
> /sys/devices/pci0019:00/0019:00:00.0/0019:01:00.0/nvgrace_gpu_vfio_pci.egm.5
> lists links to both the 0018:01:00.0 & 0019:01:00.0.

This seems backwards, I would rather the egm chardev itself have a
directory of links to the PCI devices not have EGM manipulate the
sysfs belonging to some other driver and subsystem..

Jason