The Grace SOC introduces Extended GPU Memory (EGM) [3], a feature that enables
GPUs to efficiently access system memory within and across nodes. This patch
series adds support for virtualizing EGM (vEGM) in libvirt, allowing VMs to
utilize dedicated EGM memory regions through ACPI.
This patch series is submitted as a follow-up RFC to the first EGM
RFC series [0], to gather feedback from the libvirt community on
the overall approach and implementation details. While kernel EGM
driver support and QEMU acpi-egm-memory device support are not yet upstream,
reference implementations are available [1][2] to enable testing and validation
of the libvirt integration.
Any community feedback is appreciated.
Background and Use Cases
=========================
EGM allows host memory to be partitioned into two regions:
1. Standard memory for Host OS usage
2. EGM region assigned to VMs as their system memory
This technology enables various high-performance computing scenarios [3]:
- Large memory pools for AI/ML workloads
- High-performance computing applications
- Memory extension for systems with limited main memory
- GPU-accelerated workloads requiring large addressable memory
Implementation Overview
=======================
This series adds a new memory device model VIR_DOMAIN_MEMORY_MODEL_EGM with
'path' source attribute and 'pciDev' target attribute to denote host EGM
device backing path and PCI device alias to associate the vEGM with,
respectively.
For instance, given the XML stanzas below:
<memory model='egm' access='shared'>
<source>
<path>/dev/egm4</path>
</source>
<target>
<size unit='KiB'>8388608</size>
<node>0</node>
<pciDev>ua-hostdev0</pciDev>
</target>
</memory>
<memory model='egm' access='shared'>
<source>
<path>/dev/egm5</path>
</source>
<target>
<size unit='KiB'>8388608</size>
<node>1</node>
<pciDev>ua-hostdev1</pciDev>
</target>
</memory>
The corresponding qemu command line will include the following arguments:
-object '{"qom-type":"memory-backend-file","id":"memegm4","mem-path":"/dev/egm4","share":true,"prealloc":true,"size":8589934592}' \
-object acpi-egm-memory,id=egm4,pci-dev=ua-hostdev0,node=0 \
-object '{"qom-type":"memory-backend-file","id":"memegm5","mem-path":"/dev/egm5","share":true,"prealloc":true,"size":8589934592}' \
-object acpi-egm-memory,id=egm5,pci-dev=ua-hostdev1,node=1 \
-numa node,nodeid=0,cpus=0-1,memdev=memegm4 \
-numa node,nodeid=1,cpus=2-3,memdev=memegm5 \
Changes from RFCv1:
- Use existing memory device infrastructure to represent EGM configuration
- Added support for multiple EGM devices
This series is on Github:
https://github.com/NathanChenNVIDIA/libvirt/tree/egm-11-06-25
Thanks,
Nathan
[0] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/4ZFLQSNSO4BGKIRH7QAIROXYD7E4ST4Z/
[1] https://github.com/ianm-nv/qemu/tree/6.8_ghvirt_egm_may2025
[2] https://github.com/NVIDIA/QEMU/commit/32db1b74fb99c0571724c7e69485e89098c14874
[3] https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory
Ian May (1):
tests: Add qemuxmlconftest for ACPI EGM memory device
Nathan Chen (3):
conf: Support EGM memory device model
qemu: Add cgroup, namespace, and seclabel setup for EGM memory device
model
qemu: Add qemu CLI support for EGM
docs/formatdomain.rst | 18 ++++-
src/conf/domain_conf.c | 33 ++++++++-
src/conf/domain_conf.h | 7 ++
src/conf/domain_postparse.c | 6 +-
src/conf/domain_validate.c | 15 +++++
src/conf/schemas/domaincommon.rng | 6 ++
src/qemu/qemu_alias.c | 17 +++++
src/qemu/qemu_capabilities.c | 2 +
src/qemu/qemu_capabilities.h | 1 +
src/qemu/qemu_cgroup.c | 10 +++
src/qemu/qemu_command.c | 66 +++++++++++++++++-
src/qemu/qemu_domain.c | 15 ++++-
src/qemu/qemu_domain_address.c | 6 ++
src/qemu/qemu_driver.c | 1 +
src/qemu/qemu_hotplug.c | 1 +
src/qemu/qemu_monitor_json.c | 1 +
src/qemu/qemu_namespace.c | 3 +
src/qemu/qemu_postparse.c | 1 +
src/qemu/qemu_process.c | 2 +
src/qemu/qemu_validate.c | 6 ++
src/security/security_apparmor.c | 2 +
src/security/security_dac.c | 8 +++
src/security/security_selinux.c | 6 ++
src/security/virt-aa-helper.c | 4 ++
src/util/virfile.h | 2 +-
tests/meson.build | 1 +
tests/qemuegmmock.c | 67 +++++++++++++++++++
.../acpi-egm-memory.aarch64-latest.args | 36 ++++++++++
.../acpi-egm-memory.aarch64-latest.xml | 57 ++++++++++++++++
tests/qemuxmlconfdata/acpi-egm-memory.xml | 33 +++++++++
tests/qemuxmlconftest.c | 8 ++-
31 files changed, 433 insertions(+), 8 deletions(-)
create mode 100644 tests/qemuegmmock.c
create mode 100644 tests/qemuxmlconfdata/acpi-egm-memory.aarch64-latest.args
create mode 100644 tests/qemuxmlconfdata/acpi-egm-memory.aarch64-latest.xml
create mode 100644 tests/qemuxmlconfdata/acpi-egm-memory.xml
--
2.43.0