[PATCH] drm/sched: Clarify docu concerning drm_sched_job_arm()

Philipp Stanner posted 1 patch 9 months, 1 week ago
drivers/gpu/drm/scheduler/sched_main.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
[PATCH] drm/sched: Clarify docu concerning drm_sched_job_arm()
Posted by Philipp Stanner 9 months, 1 week ago
The documentation for drm_sched_job_arm() and especially
drm_sched_job_cleanup() does not make it very clear why
drm_sched_job_arm() is a point of no return, which it indeed is.

Make the nature of drm_sched_job_arm() in the docu as clear as possible.

Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 4d4219fbe49d..829579c41c6b 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -828,11 +828,15 @@ EXPORT_SYMBOL(drm_sched_job_init);
  *
  * This arms a scheduler job for execution. Specifically it initializes the
  * &drm_sched_job.s_fence of @job, so that it can be attached to struct dma_resv
- * or other places that need to track the completion of this job.
+ * or other places that need to track the completion of this job. It also
+ * initializes sequence numbers, which are fundamental for fence ordering.
  *
  * Refer to drm_sched_entity_push_job() documentation for locking
  * considerations.
  *
+ * Once this function was called, you *must* submit @job with
+ * drm_sched_entity_push_job().
+ *
  * This can only be called if drm_sched_job_init() succeeded.
  */
 void drm_sched_job_arm(struct drm_sched_job *job)
@@ -1017,9 +1021,12 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
  * Drivers should call this from their error unwind code if @job is aborted
  * before drm_sched_job_arm() is called.
  *
- * After that point of no return @job is committed to be executed by the
- * scheduler, and this function should be called from the
- * &drm_sched_backend_ops.free_job callback.
+ * drm_sched_job_arm() is a point of no return since it initializes the fences
+ * and their sequence number etc. Once that function has been called, you *must*
+ * submit it with drm_sched_entity_push_job() and cannot simply abort it by
+ * calling drm_sched_job_cleanup().
+ *
+ * This function should be called in the &drm_sched_backend_ops.free_job callback.
  */
 void drm_sched_job_cleanup(struct drm_sched_job *job)
 {
@@ -1027,10 +1034,15 @@ void drm_sched_job_cleanup(struct drm_sched_job *job)
 	unsigned long index;
 
 	if (kref_read(&job->s_fence->finished.refcount)) {
-		/* drm_sched_job_arm() has been called */
+		/* The job has been processed by the scheduler, i.e.,
+		 * drm_sched_job_arm() and drm_sched_entity_push_job() have
+		 * been called.
+		 */
 		dma_fence_put(&job->s_fence->finished);
 	} else {
-		/* aborted job before committing to run it */
+		/* The job was aborted before it has been committed to be run;
+		 * notably, drm_sched_job_arm() has not been called.
+		 */
 		drm_sched_fence_free(job->s_fence);
 	}
 
-- 
2.48.1

Re: [PATCH] drm/sched: Clarify docu concerning drm_sched_job_arm()
Posted by Christian König 9 months, 1 week ago
Am 13.03.25 um 10:30 schrieb Philipp Stanner:
> The documentation for drm_sched_job_arm() and especially
> drm_sched_job_cleanup() does not make it very clear why
> drm_sched_job_arm() is a point of no return, which it indeed is.
>
> Make the nature of drm_sched_job_arm() in the docu as clear as possible.
>
> Suggested-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Philipp Stanner <phasta@kernel.org>

Reviewed-by: Christian König <christian.koenig@amd.com>

I'm currently looking into how to fix the amdgpu CS path for gang submission regarding this.

Any objections that I add a preload function to allocate the memory for the XA outside of the critical section?

Regards,
Christian.

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 4d4219fbe49d..829579c41c6b 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -828,11 +828,15 @@ EXPORT_SYMBOL(drm_sched_job_init);
>   *
>   * This arms a scheduler job for execution. Specifically it initializes the
>   * &drm_sched_job.s_fence of @job, so that it can be attached to struct dma_resv
> - * or other places that need to track the completion of this job.
> + * or other places that need to track the completion of this job. It also
> + * initializes sequence numbers, which are fundamental for fence ordering.
>   *
>   * Refer to drm_sched_entity_push_job() documentation for locking
>   * considerations.
>   *
> + * Once this function was called, you *must* submit @job with
> + * drm_sched_entity_push_job().
> + *
>   * This can only be called if drm_sched_job_init() succeeded.
>   */
>  void drm_sched_job_arm(struct drm_sched_job *job)
> @@ -1017,9 +1021,12 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>   * Drivers should call this from their error unwind code if @job is aborted
>   * before drm_sched_job_arm() is called.
>   *
> - * After that point of no return @job is committed to be executed by the
> - * scheduler, and this function should be called from the
> - * &drm_sched_backend_ops.free_job callback.
> + * drm_sched_job_arm() is a point of no return since it initializes the fences
> + * and their sequence number etc. Once that function has been called, you *must*
> + * submit it with drm_sched_entity_push_job() and cannot simply abort it by
> + * calling drm_sched_job_cleanup().
> + *
> + * This function should be called in the &drm_sched_backend_ops.free_job callback.
>   */
>  void drm_sched_job_cleanup(struct drm_sched_job *job)
>  {
> @@ -1027,10 +1034,15 @@ void drm_sched_job_cleanup(struct drm_sched_job *job)
>  	unsigned long index;
>  
>  	if (kref_read(&job->s_fence->finished.refcount)) {
> -		/* drm_sched_job_arm() has been called */
> +		/* The job has been processed by the scheduler, i.e.,
> +		 * drm_sched_job_arm() and drm_sched_entity_push_job() have
> +		 * been called.
> +		 */
>  		dma_fence_put(&job->s_fence->finished);
>  	} else {
> -		/* aborted job before committing to run it */
> +		/* The job was aborted before it has been committed to be run;
> +		 * notably, drm_sched_job_arm() has not been called.
> +		 */
>  		drm_sched_fence_free(job->s_fence);
>  	}
>  

Re: [PATCH] drm/sched: Clarify docu concerning drm_sched_job_arm()
Posted by Philipp Stanner 9 months, 1 week ago
On Thu, 2025-03-13 at 11:07 +0100, Christian König wrote:
> Am 13.03.25 um 10:30 schrieb Philipp Stanner:
> > The documentation for drm_sched_job_arm() and especially
> > drm_sched_job_cleanup() does not make it very clear why
> > drm_sched_job_arm() is a point of no return, which it indeed is.
> > 
> > Make the nature of drm_sched_job_arm() in the docu as clear as
> > possible.
> > 
> > Suggested-by: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> 
> Reviewed-by: Christian König <christian.koenig@amd.com>

Applied to drm-misc-next

P.

> 
> I'm currently looking into how to fix the amdgpu CS path for gang
> submission regarding this.
> 
> Any objections that I add a preload function to allocate the memory
> for the XA outside of the critical section?
> 
> Regards,
> Christian.
> 
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 24 ++++++++++++++++++----
> > --
> >  1 file changed, 18 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 4d4219fbe49d..829579c41c6b 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -828,11 +828,15 @@ EXPORT_SYMBOL(drm_sched_job_init);
> >   *
> >   * This arms a scheduler job for execution. Specifically it
> > initializes the
> >   * &drm_sched_job.s_fence of @job, so that it can be attached to
> > struct dma_resv
> > - * or other places that need to track the completion of this job.
> > + * or other places that need to track the completion of this job.
> > It also
> > + * initializes sequence numbers, which are fundamental for fence
> > ordering.
> >   *
> >   * Refer to drm_sched_entity_push_job() documentation for locking
> >   * considerations.
> >   *
> > + * Once this function was called, you *must* submit @job with
> > + * drm_sched_entity_push_job().
> > + *
> >   * This can only be called if drm_sched_job_init() succeeded.
> >   */
> >  void drm_sched_job_arm(struct drm_sched_job *job)
> > @@ -1017,9 +1021,12 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
> >   * Drivers should call this from their error unwind code if @job
> > is aborted
> >   * before drm_sched_job_arm() is called.
> >   *
> > - * After that point of no return @job is committed to be executed
> > by the
> > - * scheduler, and this function should be called from the
> > - * &drm_sched_backend_ops.free_job callback.
> > + * drm_sched_job_arm() is a point of no return since it
> > initializes the fences
> > + * and their sequence number etc. Once that function has been
> > called, you *must*
> > + * submit it with drm_sched_entity_push_job() and cannot simply
> > abort it by
> > + * calling drm_sched_job_cleanup().
> > + *
> > + * This function should be called in the
> > &drm_sched_backend_ops.free_job callback.
> >   */
> >  void drm_sched_job_cleanup(struct drm_sched_job *job)
> >  {
> > @@ -1027,10 +1034,15 @@ void drm_sched_job_cleanup(struct
> > drm_sched_job *job)
> >  	unsigned long index;
> >  
> >  	if (kref_read(&job->s_fence->finished.refcount)) {
> > -		/* drm_sched_job_arm() has been called */
> > +		/* The job has been processed by the scheduler,
> > i.e.,
> > +		 * drm_sched_job_arm() and
> > drm_sched_entity_push_job() have
> > +		 * been called.
> > +		 */
> >  		dma_fence_put(&job->s_fence->finished);
> >  	} else {
> > -		/* aborted job before committing to run it */
> > +		/* The job was aborted before it has been
> > committed to be run;
> > +		 * notably, drm_sched_job_arm() has not been
> > called.
> > +		 */
> >  		drm_sched_fence_free(job->s_fence);
> >  	}
> >  
> 
Re: [PATCH] drm/sched: Clarify docu concerning drm_sched_job_arm()
Posted by Philipp Stanner 9 months, 1 week ago
On Thu, 2025-03-13 at 11:07 +0100, Christian König wrote:
> Am 13.03.25 um 10:30 schrieb Philipp Stanner:
> > The documentation for drm_sched_job_arm() and especially
> > drm_sched_job_cleanup() does not make it very clear why
> > drm_sched_job_arm() is a point of no return, which it indeed is.
> > 
> > Make the nature of drm_sched_job_arm() in the docu as clear as
> > possible.
> > 
> > Suggested-by: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> 
> Reviewed-by: Christian König <christian.koenig@amd.com>

Thx

> 
> I'm currently looking into how to fix the amdgpu CS path for gang
> submission regarding this.
> 
> Any objections that I add a preload function to allocate the memory
> for the XA outside of the critical section?

I can't fully follow, you mean when creating the dependencies xarray?

I think in a perfect world we wouldn't have any more functions that
only have 1 user to work around driver-internal problems. That said, I
realize that the world isn't perfect. Would still be cool if every
function we add would be useful to >1 driver.

I think it's probably best if you just propose sth in form of a patch
and we discuss it there

Grüße
P.


> 
> Regards,
> Christian.
> 
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 24 ++++++++++++++++++----
> > --
> >  1 file changed, 18 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 4d4219fbe49d..829579c41c6b 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -828,11 +828,15 @@ EXPORT_SYMBOL(drm_sched_job_init);
> >   *
> >   * This arms a scheduler job for execution. Specifically it
> > initializes the
> >   * &drm_sched_job.s_fence of @job, so that it can be attached to
> > struct dma_resv
> > - * or other places that need to track the completion of this job.
> > + * or other places that need to track the completion of this job.
> > It also
> > + * initializes sequence numbers, which are fundamental for fence
> > ordering.
> >   *
> >   * Refer to drm_sched_entity_push_job() documentation for locking
> >   * considerations.
> >   *
> > + * Once this function was called, you *must* submit @job with
> > + * drm_sched_entity_push_job().
> > + *
> >   * This can only be called if drm_sched_job_init() succeeded.
> >   */
> >  void drm_sched_job_arm(struct drm_sched_job *job)
> > @@ -1017,9 +1021,12 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
> >   * Drivers should call this from their error unwind code if @job
> > is aborted
> >   * before drm_sched_job_arm() is called.
> >   *
> > - * After that point of no return @job is committed to be executed
> > by the
> > - * scheduler, and this function should be called from the
> > - * &drm_sched_backend_ops.free_job callback.
> > + * drm_sched_job_arm() is a point of no return since it
> > initializes the fences
> > + * and their sequence number etc. Once that function has been
> > called, you *must*
> > + * submit it with drm_sched_entity_push_job() and cannot simply
> > abort it by
> > + * calling drm_sched_job_cleanup().
> > + *
> > + * This function should be called in the
> > &drm_sched_backend_ops.free_job callback.
> >   */
> >  void drm_sched_job_cleanup(struct drm_sched_job *job)
> >  {
> > @@ -1027,10 +1034,15 @@ void drm_sched_job_cleanup(struct
> > drm_sched_job *job)
> >  	unsigned long index;
> >  
> >  	if (kref_read(&job->s_fence->finished.refcount)) {
> > -		/* drm_sched_job_arm() has been called */
> > +		/* The job has been processed by the scheduler,
> > i.e.,
> > +		 * drm_sched_job_arm() and
> > drm_sched_entity_push_job() have
> > +		 * been called.
> > +		 */
> >  		dma_fence_put(&job->s_fence->finished);
> >  	} else {
> > -		/* aborted job before committing to run it */
> > +		/* The job was aborted before it has been
> > committed to be run;
> > +		 * notably, drm_sched_job_arm() has not been
> > called.
> > +		 */
> >  		drm_sched_fence_free(job->s_fence);
> >  	}
> >  
> 
Re: [PATCH] drm/sched: Clarify docu concerning drm_sched_job_arm()
Posted by Christian König 9 months, 1 week ago
Am 13.03.25 um 11:26 schrieb Philipp Stanner:
> On Thu, 2025-03-13 at 11:07 +0100, Christian König wrote:
>> Am 13.03.25 um 10:30 schrieb Philipp Stanner:
>>> The documentation for drm_sched_job_arm() and especially
>>> drm_sched_job_cleanup() does not make it very clear why
>>> drm_sched_job_arm() is a point of no return, which it indeed is.
>>>
>>> Make the nature of drm_sched_job_arm() in the docu as clear as
>>> possible.
>>>
>>> Suggested-by: Christian König <christian.koenig@amd.com>
>>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>> Reviewed-by: Christian König <christian.koenig@amd.com>
> Thx
>
>> I'm currently looking into how to fix the amdgpu CS path for gang
>> submission regarding this.
>>
>> Any objections that I add a preload function to allocate the memory
>> for the XA outside of the critical section?
> I can't fully follow, you mean when creating the dependencies xarray?
>
> I think in a perfect world we wouldn't have any more functions that
> only have 1 user to work around driver-internal problems. That said, I
> realize that the world isn't perfect. Would still be cool if every
> function we add would be useful to >1 driver.
>
> I think it's probably best if you just propose sth in form of a patch
> and we discuss it there

Ok I will give that a try.

It's tricky at best and I'm just now throwing away the second try to solve it.

Let's see if approach number three works.

Thanks,
Christian.

>
> Grüße
> P.
>
>
>> Regards,
>> Christian.
>>
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_main.c | 24 ++++++++++++++++++----
>>> --
>>>  1 file changed, 18 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 4d4219fbe49d..829579c41c6b 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -828,11 +828,15 @@ EXPORT_SYMBOL(drm_sched_job_init);
>>>   *
>>>   * This arms a scheduler job for execution. Specifically it
>>> initializes the
>>>   * &drm_sched_job.s_fence of @job, so that it can be attached to
>>> struct dma_resv
>>> - * or other places that need to track the completion of this job.
>>> + * or other places that need to track the completion of this job.
>>> It also
>>> + * initializes sequence numbers, which are fundamental for fence
>>> ordering.
>>>   *
>>>   * Refer to drm_sched_entity_push_job() documentation for locking
>>>   * considerations.
>>>   *
>>> + * Once this function was called, you *must* submit @job with
>>> + * drm_sched_entity_push_job().
>>> + *
>>>   * This can only be called if drm_sched_job_init() succeeded.
>>>   */
>>>  void drm_sched_job_arm(struct drm_sched_job *job)
>>> @@ -1017,9 +1021,12 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>>>   * Drivers should call this from their error unwind code if @job
>>> is aborted
>>>   * before drm_sched_job_arm() is called.
>>>   *
>>> - * After that point of no return @job is committed to be executed
>>> by the
>>> - * scheduler, and this function should be called from the
>>> - * &drm_sched_backend_ops.free_job callback.
>>> + * drm_sched_job_arm() is a point of no return since it
>>> initializes the fences
>>> + * and their sequence number etc. Once that function has been
>>> called, you *must*
>>> + * submit it with drm_sched_entity_push_job() and cannot simply
>>> abort it by
>>> + * calling drm_sched_job_cleanup().
>>> + *
>>> + * This function should be called in the
>>> &drm_sched_backend_ops.free_job callback.
>>>   */
>>>  void drm_sched_job_cleanup(struct drm_sched_job *job)
>>>  {
>>> @@ -1027,10 +1034,15 @@ void drm_sched_job_cleanup(struct
>>> drm_sched_job *job)
>>>  	unsigned long index;
>>>  
>>>  	if (kref_read(&job->s_fence->finished.refcount)) {
>>> -		/* drm_sched_job_arm() has been called */
>>> +		/* The job has been processed by the scheduler,
>>> i.e.,
>>> +		 * drm_sched_job_arm() and
>>> drm_sched_entity_push_job() have
>>> +		 * been called.
>>> +		 */
>>>  		dma_fence_put(&job->s_fence->finished);
>>>  	} else {
>>> -		/* aborted job before committing to run it */
>>> +		/* The job was aborted before it has been
>>> committed to be run;
>>> +		 * notably, drm_sched_job_arm() has not been
>>> called.
>>> +		 */
>>>  		drm_sched_fence_free(job->s_fence);
>>>  	}
>>>