[PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw()

Samuel Zhang posted 3 patches 3 months, 1 week ago
There is a newer version of this series
[PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw()
Posted by Samuel Zhang 3 months, 1 week ago
The hibernation successful workflow:
- prepare: evict VRAM and swapout GTT BOs
- freeze
- create the hibernation image in system memory
- thaw: swapin and restore BOs
- complete
- write hibernation image to disk
- amdgpu_pci_shutdown
- goto S5, turn off the system.

During prepare stage of hibernation, VRAM and GTT BOs will be swapout to
shmem. Then in thaw stage, all BOs will be swapin and restored.

On server with 192GB VRAM * 8 dGPUs and 1.7TB system memory,
the swapin and restore BOs takes too long (50 minutes) and it is not
necessary since the follow-up stages does not use GPU.

This patch is to skip BOs restore during thaw to reduce the hibernation
time.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a8f4697deb1b..b550d07190a2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5328,7 +5328,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
 		amdgpu_virt_init_data_exchange(adev);
 		amdgpu_virt_release_full_gpu(adev, true);
 
-		if (!adev->in_s0ix && !r && !adev->in_runpm)
+		if (!adev->in_s0ix && !r && !adev->in_runpm && !adev->in_s4)
 			r = amdgpu_amdkfd_resume_process(adev);
 	}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 571b70da4562..23b76e8ac2fd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2734,7 +2734,9 @@ static int amdgpu_pmops_poweroff(struct device *dev)
 static int amdgpu_pmops_restore(struct device *dev)
 {
 	struct drm_device *drm_dev = dev_get_drvdata(dev);
+	struct amdgpu_device *adev = drm_to_adev(drm_dev);
 
+	adev->in_s4 = false;
 	return amdgpu_device_resume(drm_dev, true);
 }
 
-- 
2.43.5
Re: [PATCH 3/3] drm/amdgpu: skip kfd resume_process for dev_pm_ops.thaw()
Posted by Christian König 3 months, 1 week ago
On 30.06.25 12:41, Samuel Zhang wrote:
> The hibernation successful workflow:
> - prepare: evict VRAM and swapout GTT BOs
> - freeze
> - create the hibernation image in system memory
> - thaw: swapin and restore BOs

Why should a thaw happen here in between?

> - complete
> - write hibernation image to disk
> - amdgpu_pci_shutdown
> - goto S5, turn off the system.
> 
> During prepare stage of hibernation, VRAM and GTT BOs will be swapout to
> shmem. Then in thaw stage, all BOs will be swapin and restored.

That's not correct. This is done by the application starting again and not during thaw.

> 
> On server with 192GB VRAM * 8 dGPUs and 1.7TB system memory,
> the swapin and restore BOs takes too long (50 minutes) and it is not
> necessary since the follow-up stages does not use GPU.
> 
> This patch is to skip BOs restore during thaw to reduce the hibernation
> time.

As far as I can see that doesn't make sense. The KFD processes need to be resumed here and that can't be skipped.

Regards,
Christian.

> 
> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 ++
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a8f4697deb1b..b550d07190a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5328,7 +5328,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>  		amdgpu_virt_init_data_exchange(adev);
>  		amdgpu_virt_release_full_gpu(adev, true);
>  
> -		if (!adev->in_s0ix && !r && !adev->in_runpm)
> +		if (!adev->in_s0ix && !r && !adev->in_runpm && !adev->in_s4)
>  			r = amdgpu_amdkfd_resume_process(adev);
>  	}
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 571b70da4562..23b76e8ac2fd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2734,7 +2734,9 @@ static int amdgpu_pmops_poweroff(struct device *dev)
>  static int amdgpu_pmops_restore(struct device *dev)
>  {
>  	struct drm_device *drm_dev = dev_get_drvdata(dev);
> +	struct amdgpu_device *adev = drm_to_adev(drm_dev);
>  
> +	adev->in_s4 = false;
>  	return amdgpu_device_resume(drm_dev, true);
>  }
>