[PATCH v1] drm/amdgpu: fix sync handling in amdgpu_dma_buf_move_notify

Pierre-Eric Pelloux-Prayer posted 1 patch 1 month, 3 weeks ago
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
[PATCH v1] drm/amdgpu: fix sync handling in amdgpu_dma_buf_move_notify
Posted by Pierre-Eric Pelloux-Prayer 1 month, 3 weeks ago
Invalidating a dmabuf will impact other users of the shared BO.
In the scenario where process A moves the BO, it needs to inform
process B about the move and process B will need to update its
page table.

The commit fixes a synchronisation bug caused by the use of the
ticket: it made amdgpu_vm_handle_moved behave as if updating
the page table immediately was correct but in this case it's not.

An example is the following scenario, with 2 GPUs and glxgears
running on GPU0 and Xorg running on GPU1, on a system where P2P
PCI isn't supported:

glxgears:
  export linear buffer from GPU0 and import using GPU1
  submit frame rendering to GPU0
  submit tiled->linear blit
Xorg:
  copy of linear buffer

The sequence of jobs would be:
  drm_sched_job_run                       # GPU0, frame rendering
  drm_sched_job_queue                     # GPU0, blit
  drm_sched_job_done                      # GPU0, frame rendering
  drm_sched_job_run                       # GPU0, blit
  move linear buffer for GPU1 access      #
  amdgpu_dma_buf_move_notify -> update pt # GPU0

It this point the blit job on GPU0 is still running and would
likely produce a page fault.

Fixes: a448cb003edc ("drm/amdgpu: implement amdgpu_gem_prime_move_notify v2")
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index b9c38a4fe546..656c267dbe58 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -514,8 +514,15 @@ amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
 		r = dma_resv_reserve_fences(resv, 2);
 		if (!r)
 			r = amdgpu_vm_clear_freed(adev, vm, NULL);
+
+		/* Don't pass 'ticket' to amdgpu_vm_handle_moved: we want the clear=true
+		 * path to be used otherwise we might update the PT of another process
+		 * while it's using the BO.
+		 * With clear=true, amdgpu_vm_bo_update will sync to command submission
+		 * from the same VM.
+		 */
 		if (!r)
-			r = amdgpu_vm_handle_moved(adev, vm, ticket);
+			r = amdgpu_vm_handle_moved(adev, vm, NULL);
 
 		if (r && r != -EBUSY)
 			DRM_ERROR("Failed to invalidate VM page tables (%d))\n",
-- 
2.43.0
Re: [PATCH v1] drm/amdgpu: fix sync handling in amdgpu_dma_buf_move_notify
Posted by Christian König 1 month, 3 weeks ago
On 2/10/26 10:14, Pierre-Eric Pelloux-Prayer wrote:
> Invalidating a dmabuf will impact other users of the shared BO.
> In the scenario where process A moves the BO, it needs to inform
> process B about the move and process B will need to update its
> page table.
> 
> The commit fixes a synchronisation bug caused by the use of the
> ticket: it made amdgpu_vm_handle_moved behave as if updating
> the page table immediately was correct but in this case it's not.
> 
> An example is the following scenario, with 2 GPUs and glxgears
> running on GPU0 and Xorg running on GPU1, on a system where P2P
> PCI isn't supported:
> 
> glxgears:
>   export linear buffer from GPU0 and import using GPU1
>   submit frame rendering to GPU0
>   submit tiled->linear blit
> Xorg:
>   copy of linear buffer
> 
> The sequence of jobs would be:
>   drm_sched_job_run                       # GPU0, frame rendering
>   drm_sched_job_queue                     # GPU0, blit
>   drm_sched_job_done                      # GPU0, frame rendering
>   drm_sched_job_run                       # GPU0, blit
>   move linear buffer for GPU1 access      #
>   amdgpu_dma_buf_move_notify -> update pt # GPU0
> 
> It this point the blit job on GPU0 is still running and would
> likely produce a page fault.
> 
> Fixes: a448cb003edc ("drm/amdgpu: implement amdgpu_gem_prime_move_notify v2")

CC: stable?

> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> index b9c38a4fe546..656c267dbe58 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> @@ -514,8 +514,15 @@ amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
>  		r = dma_resv_reserve_fences(resv, 2);
>  		if (!r)
>  			r = amdgpu_vm_clear_freed(adev, vm, NULL);
> +
> +		/* Don't pass 'ticket' to amdgpu_vm_handle_moved: we want the clear=true
> +		 * path to be used otherwise we might update the PT of another process
> +		 * while it's using the BO.
> +		 * With clear=true, amdgpu_vm_bo_update will sync to command submission
> +		 * from the same VM.
> +		 */
>  		if (!r)
> -			r = amdgpu_vm_handle_moved(adev, vm, ticket);
> +			r = amdgpu_vm_handle_moved(adev, vm, NULL);
>  
>  		if (r && r != -EBUSY)
>  			DRM_ERROR("Failed to invalidate VM page tables (%d))\n",