When hibernate with data center dGPUs, huge number of VRAM BOs evicted
to GTT and takes too much system memory. This will cause hibernation
fail due to insufficient memory for creating the hibernation image.
Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel
hibernation code to make room for hibernation image.
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 27ab4e754b2a..a0b0682236e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -2414,6 +2414,7 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
{
struct ttm_resource_manager *man;
+ int r;
switch (mem_type) {
case TTM_PL_VRAM:
@@ -2428,7 +2429,17 @@ int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type)
return -EINVAL;
}
- return ttm_resource_manager_evict_all(&adev->mman.bdev, man);
+ r = ttm_resource_manager_evict_all(&adev->mman.bdev, man);
+ if (r) {
+ DRM_ERROR("Failed to evict memory type %d\n", mem_type);
+ return r;
+ }
+ if (adev->in_s4 && mem_type == TTM_PL_VRAM) {
+ r = ttm_device_prepare_hibernation();
+ if (r)
+ DRM_ERROR("Failed to swap out, %d\n", r);
+ }
+ return r;
}
#if defined(CONFIG_DEBUG_FS)
--
2.43.5
On 04.07.25 12:12, Samuel Zhang wrote: > When hibernate with data center dGPUs, huge number of VRAM BOs evicted > to GTT and takes too much system memory. This will cause hibernation > fail due to insufficient memory for creating the hibernation image. > > Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel > hibernation code to make room for hibernation image. > > Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > index 27ab4e754b2a..a0b0682236e3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > @@ -2414,6 +2414,7 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo, > int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type) > { > struct ttm_resource_manager *man; > + int r; > > switch (mem_type) { > case TTM_PL_VRAM: > @@ -2428,7 +2429,17 @@ int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type) > return -EINVAL; > } > > - return ttm_resource_manager_evict_all(&adev->mman.bdev, man); > + r = ttm_resource_manager_evict_all(&adev->mman.bdev, man); > + if (r) { > + DRM_ERROR("Failed to evict memory type %d\n", mem_type); > + return r; > + } > + if (adev->in_s4 && mem_type == TTM_PL_VRAM) { > + r = ttm_device_prepare_hibernation(); > + if (r) > + DRM_ERROR("Failed to swap out, %d\n", r); > + } > + return r; That call needs to go into a separate amdgpu_ttm_* function and only be called from amdgpu_device_evict_resources(). Otherwise the debugfs tests will trigger it as well which is undesirable. Regards, Christian. > } > > #if defined(CONFIG_DEBUG_FS)
On 7/4/2025 6:12 AM, Samuel Zhang wrote: > When hibernate with data center dGPUs, huge number of VRAM BOs evicted > to GTT and takes too much system memory. This will cause hibernation > fail due to insufficient memory for creating the hibernation image. > > Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel > hibernation code to make room for hibernation image. > > Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > index 27ab4e754b2a..a0b0682236e3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > @@ -2414,6 +2414,7 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo, > int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type) > { > struct ttm_resource_manager *man; > + int r; > > switch (mem_type) { > case TTM_PL_VRAM: > @@ -2428,7 +2429,17 @@ int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type) > return -EINVAL; > } > > - return ttm_resource_manager_evict_all(&adev->mman.bdev, man); > + r = ttm_resource_manager_evict_all(&adev->mman.bdev, man); > + if (r) { > + DRM_ERROR("Failed to evict memory type %d\n", mem_type); For new code can you please use the drm_err() macro instead. This will help show which GPU had the problem with eviction. > + return r; > + } > + if (adev->in_s4 && mem_type == TTM_PL_VRAM) { > + r = ttm_device_prepare_hibernation(); > + if (r) > + DRM_ERROR("Failed to swap out, %d\n", r); For new code can you please use the drm_err() macro instead. > + } > + return r; > } > > #if defined(CONFIG_DEBUG_FS)
© 2016 - 2025 Red Hat, Inc.