Add virtualization support for EGM

[PATCH RFC v2 00/15] Add virtualization support for EGM

Posted by ankita@nvidia.com 1 month, 1 week ago

From: Ankit Agrawal <ankita@nvidia.com>

Background
----------
Grace Hopper/Blackwell systems support the Extended GPU Memory (EGM)
feature that enable the GPU to access the system memory allocations
within and across nodes through high bandwidth path. This access path
goes as: GPU <--> NVswitch <--> GPU <--> CPU. The GPU can utilize the
system memory located on the same socket or from a different socket
or even on a different node in a multi-node system [1]. This feature is
being extended to virtualization.


Design Details
--------------
EGM when enabled in the virtualization stack, the host memory
is partitioned into 2 parts: One partition for the Host OS usage
called Hypervisor region, and a second Hypervisor-Invisible (HI) region
for the VM. Only the hypervisor region is part of the host EFI map
and is thus visible to the host OS on bootup. Since the entire VM
sysmem is eligible for EGM allocations within the VM, the HI partition
is interchangeably called as EGM region in the series. This HI/EGM region
range base SPA and size is exposed through the ACPI DSDT properties.

Whilst the EGM region is accessible on the host, it is not added to
the kernel. The HI region is assigned to a VM by mapping the QEMU VMA
to the SPA using remap_pfn_range().

The following figure shows the memory map in the virtualization
environment.

|---- Sysmem ----|                  |--- GPU mem ---|  VM Memory
|                |                  |               |
|IPA <-> SPA map |                  |IPA <-> SPA map|
|                |                  |               |
|--- HI / EGM ---|-- Host Mem --|   |--- GPU mem ---|  Host Memory

The patch series introduce a new nvgrace-egm auxiliary driver module
to manage and map the HI/EGM region in the Grace Blackwell systems.
This binds to the auxiliary device created by the parent
nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio
(out-of-tree open source module for SRIOV vGPU) to manage the
EGM region for the VM. Note that there is a unique EGM region per
socket and the auxiliary device gets created for every region. The
parent module fetches the EGM region information from the ACPI
tables and populate to the data structures shared with the auxiliary
nvgrace-egm module.

nvgrace-egm module handles the following:
1. Fetch the EGM memory properties (base HPA, length, proximity domain)
from the parent device shared EGM region structure.
2. Create a char device that can be used as memory-backend-file by Qemu
for the VM and implement file operations. The char device is /dev/egmX,
where X is the PXM node ID of the EGM being mapped fetched in 1.
3. Zero the EGM memory on first device open().
4. Map the QEMU VMA to the EGM region using remap_pfn_range.
5. Cleaning up state and destroying the chardev on device unbind.
6. Handle presence of retired poisoned pages on the EGM region.

Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept
in the same directory.


Implementation
--------------
Patch 1-4 makes changes to the nvgrace-gpu module to fetch the
EGM information, create auxiliary device and save the EGM region
information in the shared structures.
Path 5-10 introduce the new nvgrace-egm module to manage the EGM
region. The module implements a char device to expose the EGM to
usermode apps such as QEMU. The module does the mapping of the
QEMU VMA to the EGM SPA using remap_pfn range.
Patch 11-12 fetches the list of pages on EGM with known poisoned errors.
Patch 13-14 expose the EGM topology and size through sysfs.
Patch 15 register EGM memory to memory_handle and track runtime poison errors


Enablement
----------
The EGM mode is enabled through a flag in the SBIOS. The size of
the Hypervisor region is modifiable through a second parameter in
the SBIOS. All the remaining system memory on the host will be
invisible to the Hypervisor.


Verification
------------
Applied over v6.19-rc4 and tested using qemu repository [3]. Tested on the
Grace Blackwell platform by booting up VM, loading NVIDIA module [2] and
running nvidia-smi in the VM to check for the presence of EGM capability.

There is a dependency on iommu support for generic dmabuf exports being
worked on by Jason Gunthorpe (jgg@nvidia.com). Need to use the patch [4]
until then.


Changelog
---------
v2:
* Replaced vmalloc calls with kmalloc for small structures in multiple
  files (Shameer Kolothum)
* Updated sysfs representation of the egm nodes in 14/15 (Jason Gunthorpe)
* Split EGM memory clearing in 1G chunks to avoid softlock logs in 10/15.
* Added EGM memory registration with memory_failure in 15/15.
* Updated aux device cleanup path to fix improper sequence in 8/15
  (Shameer Kolothum)
* Range checks for remap_pfn_range in 9/15 (Jason Gunthorpe)
* Miscellaneous cleanup (Shameer Kolothum, Jason Gunthorpe)

Link: https://lore.kernel.org/all/20250904040828.319452-1-ankita@nvidia.com/ [v1]


Recognitions
------------
Many thanks to Jason Gunthorpe, Vikram Sethi, Aniket Agashe for design
suggestions and Matt Ochs, Neo Jia, Kirti Wankhede among others for the
review feedbacks.


Links
-----
Link: https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory [1]
Link: https://github.com/NVIDIA/open-gpu-kernel-modules [2]
Link: https://github.com/NVIDIA/QEMU/tree/nvidia_stable-10.1 [3]
Link: https://github.com/ankita-nv/linux/commit/6f92e3ca1995d17c3dd45f3e0a074b0b5806f681 [4]


Github Branch
-------------
Link: https://github.com/ankita-nv/linux/tree/v6.19-egm-180226

Ankit Agrawal (15):
  vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init
  vfio/nvgrace-gpu: Create auxiliary device for EGM
  vfio/nvgrace-gpu: track GPUs associated with the EGM regions
  vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info
  vfio/nvgrace-egm: Introduce module to manage EGM
  vfio/nvgrace-egm: Introduce egm class and register char device numbers
  vfio/nvgrace-egm: Register auxiliary driver ops
  vfio/nvgrace-egm: Expose EGM region as char device
  vfio/nvgrace-egm: Add chardev ops for EGM management
  vfio/nvgrace-egm: Clear Memory before handing out to VM
  vfio/nvgrace-egm: Fetch EGM region retired pages list
  vfio/nvgrace-egm: Introduce ioctl to share retired pages
  vfio/nvgrace-egm: expose the egm size through sysfs
  vfio/nvgrace-gpu: Add link from pci to EGM
  vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure

 MAINTAINERS                            |  12 +-
 drivers/vfio/pci/nvgrace-gpu/Kconfig   |  12 +
 drivers/vfio/pci/nvgrace-gpu/Makefile  |   5 +-
 drivers/vfio/pci/nvgrace-gpu/egm.c     | 540 +++++++++++++++++++++++++
 drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 179 ++++++++
 drivers/vfio/pci/nvgrace-gpu/egm_dev.h |  24 ++
 drivers/vfio/pci/nvgrace-gpu/main.c    | 124 +++++-
 include/linux/nvgrace-egm.h            |  34 ++
 include/uapi/linux/egm.h               |  28 ++
 9 files changed, 954 insertions(+), 4 deletions(-)
 create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm.c
 create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.c
 create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.h
 create mode 100644 include/linux/nvgrace-egm.h
 create mode 100644 include/uapi/linux/egm.h

-- 
2.34.1

Re: [PATCH RFC v2 00/15] Add virtualization support for EGM

Posted by Alex Williamson 1 month ago

On Mon, 23 Feb 2026 15:54:59 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Background
> ----------
> Grace Hopper/Blackwell systems support the Extended GPU Memory (EGM)
> feature that enable the GPU to access the system memory allocations
> within and across nodes through high bandwidth path. This access path
> goes as: GPU <--> NVswitch <--> GPU <--> CPU. The GPU can utilize the
> system memory located on the same socket or from a different socket
> or even on a different node in a multi-node system [1]. This feature is
> being extended to virtualization.
> 
> 
> Design Details
> --------------
> EGM when enabled in the virtualization stack, the host memory
> is partitioned into 2 parts: One partition for the Host OS usage
> called Hypervisor region, and a second Hypervisor-Invisible (HI) region
> for the VM. Only the hypervisor region is part of the host EFI map
> and is thus visible to the host OS on bootup. Since the entire VM
> sysmem is eligible for EGM allocations within the VM, the HI partition
> is interchangeably called as EGM region in the series. This HI/EGM region
> range base SPA and size is exposed through the ACPI DSDT properties.
> 
> Whilst the EGM region is accessible on the host, it is not added to
> the kernel. The HI region is assigned to a VM by mapping the QEMU VMA
> to the SPA using remap_pfn_range().
> 
> The following figure shows the memory map in the virtualization
> environment.
> 
> |---- Sysmem ----|                  |--- GPU mem ---|  VM Memory
> |                |                  |               |
> |IPA <-> SPA map |                  |IPA <-> SPA map|
> |                |                  |               |
> |--- HI / EGM ---|-- Host Mem --|   |--- GPU mem ---|  Host Memory
> 
> The patch series introduce a new nvgrace-egm auxiliary driver module
> to manage and map the HI/EGM region in the Grace Blackwell systems.
> This binds to the auxiliary device created by the parent
> nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio
> (out-of-tree open source module for SRIOV vGPU) to manage the
> EGM region for the VM. Note that there is a unique EGM region per
> socket and the auxiliary device gets created for every region. The
> parent module fetches the EGM region information from the ACPI
> tables and populate to the data structures shared with the auxiliary
> nvgrace-egm module.
> 
> nvgrace-egm module handles the following:
> 1. Fetch the EGM memory properties (base HPA, length, proximity domain)
> from the parent device shared EGM region structure.
> 2. Create a char device that can be used as memory-backend-file by Qemu
> for the VM and implement file operations. The char device is /dev/egmX,
> where X is the PXM node ID of the EGM being mapped fetched in 1.
> 3. Zero the EGM memory on first device open().
> 4. Map the QEMU VMA to the EGM region using remap_pfn_range.
> 5. Cleaning up state and destroying the chardev on device unbind.
> 6. Handle presence of retired poisoned pages on the EGM region.
> 
> Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept
> in the same directory.

Pondering this series for a bit, is this auxiliary chardev approach
really the model we should be pursuing?

I know we're trying to disassociate the EGM region from the GPU, and
de-duplicate it between GPUs on the same socket, but is there actually a
use case of the EGM chardev separate from the GPU?

The independent lifecycle of this aux device is troubling and it hasn't
been confirmed whether or not access to the EGM region has some
dependency on the state of the GPU.  nvgrace-gpu is manipulating sysfs
on devices owned by nvgrace-egm, we don't have mechanisms to manage the
aux device relative to the state of the GPU, we're trying to add a
driver that can bind to device created by an out-of-tree driver, and
we're inventing new uAPIs on the chardev for things that already exist
for vfio regions.

Therefore, does it actually make more sense to expose EGM as a device
specific region on the vfio device fd?

For example, nvgrace-gpu might manage the de-duplication by only
exposing this device specific region on the lowest BDF GPU per socket.
The existing REGION_INFO ioctl handles reporting the size to the user.
The direct association to the GPU device handles reporting the node
locality.  If necessary, a capability on the region could report the
associated PXM, and maybe even the retired page list.

All of the lifecycle issues are automatically handled, there's no
separate aux device.  If necessary, zapping and faulting across reset
is handled just like a BAR mapping.

If we need to expose the EGM size and GPU association via sysfs for
management tooling, nvgrace-gpu could add an "egm_size" attribute to the
PCI device's sysfs node.  This could also avoid the implicit
implementation knowledge about which GPU exposes the EGM device
specific region.

Was such a design considered?  It seems much, much simpler and could be
implemented by either nvgrace-gpu or identically by an out-of-tree
driver without references in an in-kernel ID table.

I'd like to understand the pros and cons of such an approach vs the one
presented here.  Thanks,

Alex