drm/sched: Mandate usage of drm_sched_job_cleanup()

[PATCH] drm/sched: Mandate usage of drm_sched_job_cleanup()

Posted by Philipp Stanner 4 months, 2 weeks ago

drm_sched_job_cleanup()'s documentation so far uses relatively soft
language, only "recommending" usage of the function. To avoid memory
leaks and, potentiall, other bugs, however, the function has to be used.

Demand usage of the function explicitly.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 46119aacb809..0a9df9e61963 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1030,7 +1030,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
  *
  * Cleans up the resources allocated with drm_sched_job_init().
  *
- * Drivers should call this from their error unwind code if @job is aborted
+ * Drivers need to call this from their error unwind code if @job is aborted
  * before drm_sched_job_arm() is called.
  *
  * drm_sched_job_arm() is a point of no return since it initializes the fences
@@ -1038,7 +1038,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
  * submit it with drm_sched_entity_push_job() and cannot simply abort it by
  * calling drm_sched_job_cleanup().
  *
- * This function should be called in the &drm_sched_backend_ops.free_job callback.
+ * This function must be called in the &drm_sched_backend_ops.free_job callback.
  */
 void drm_sched_job_cleanup(struct drm_sched_job *job)
 {
-- 
2.49.0

Re: [PATCH] drm/sched: Mandate usage of drm_sched_job_cleanup()

Posted by Tvrtko Ursulin 3 months, 3 weeks ago

On 26/09/2025 13:36, Philipp Stanner wrote:
> drm_sched_job_cleanup()'s documentation so far uses relatively soft
> language, only "recommending" usage of the function. To avoid memory
> leaks and, potentiall, other bugs, however, the function has to be used.
> 
> Demand usage of the function explicitly.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 46119aacb809..0a9df9e61963 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1030,7 +1030,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>    *
>    * Cleans up the resources allocated with drm_sched_job_init().
>    *
> - * Drivers should call this from their error unwind code if @job is aborted
> + * Drivers need to call this from their error unwind code if @job is aborted

Should is indeed wrong. I think it could be better to go with "shall" or 
"must" though. Since those two are more standard language for this.

>    * before drm_sched_job_arm() is called.
>    *
>    * drm_sched_job_arm() is a point of no return since it initializes the fences
> @@ -1038,7 +1038,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>    * submit it with drm_sched_entity_push_job() and cannot simply abort it by
>    * calling drm_sched_job_cleanup().
>    *
> - * This function should be called in the &drm_sched_backend_ops.free_job callback.
> + * This function must be called in the &drm_sched_backend_ops.free_job callback.
>    */
>   void drm_sched_job_cleanup(struct drm_sched_job *job)
>   {

Hm, having read the thread so far the situation seems not so easy to 
untangle.

On one hand I see Matt's point that free_job callback is a bit 
misleadingly named and that there isn't anything really requiring 
drm_sched_job_cleanup() to be called exactly from there. Free_job 
semantics seem to be "job is done, *scheduler* is done with it".

Drm_sched_job_cleanup() already says it has to be called after the point 
of no return (arm) so it could be left at drivers discretion (de facto 
state today) to choose how and when to do it.

What I did not get is what is actually the perceived problem with 
letting this stay? (Ie. "should" indicating it is recommended, but not a 
must/shall.)

Regards,

Tvrtko

Re: [PATCH] drm/sched: Mandate usage of drm_sched_job_cleanup()

Posted by Philipp Stanner 3 months, 3 weeks ago

On Mon, 2025-10-13 at 13:20 +0100, Tvrtko Ursulin wrote:
> 
> On 26/09/2025 13:36, Philipp Stanner wrote:
> > drm_sched_job_cleanup()'s documentation so far uses relatively soft
> > language, only "recommending" usage of the function. To avoid memory
> > leaks and, potentiall, other bugs, however, the function has to be used.
> > 
> > Demand usage of the function explicitly.
> > 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 46119aacb809..0a9df9e61963 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1030,7 +1030,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
> >    *
> >    * Cleans up the resources allocated with drm_sched_job_init().
> >    *
> > - * Drivers should call this from their error unwind code if @job is aborted
> > + * Drivers need to call this from their error unwind code if @job is aborted
> 
> Should is indeed wrong. I think it could be better to go with "shall" or 
> "must" though. Since those two are more standard language for this.

"need to" is standard UK(?) English for "must to" I think.

But I'm fine with "must"

> 
> >    * before drm_sched_job_arm() is called.
> >    *
> >    * drm_sched_job_arm() is a point of no return since it initializes the fences
> > @@ -1038,7 +1038,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
> >    * submit it with drm_sched_entity_push_job() and cannot simply abort it by
> >    * calling drm_sched_job_cleanup().
> >    *
> > - * This function should be called in the &drm_sched_backend_ops.free_job callback.
> > + * This function must be called in the &drm_sched_backend_ops.free_job callback.
> >    */
> >   void drm_sched_job_cleanup(struct drm_sched_job *job)
> >   {
> 
> Hm, having read the thread so far the situation seems not so easy to 
> untangle.
> 
> On one hand I see Matt's point that free_job callback is a bit 
> misleadingly named and that there isn't anything really requiring 
> drm_sched_job_cleanup() to be called exactly from there. Free_job 
> semantics seem to be "job is done, *scheduler* is done with it".
> 
> Drm_sched_job_cleanup() already says it has to be called after the point 
> of no return (arm) so it could be left at drivers discretion (de facto 
> state today) to choose how and when to do it.
> 
> What I did not get is what is actually the perceived problem with 
> letting this stay? (Ie. "should" indicating it is recommended, but not a 
> must/shall.)

I think the most correct formulation with our current mess would be
"cleanup() must be called earliest in free_job(), or later (if there is
a very good reason for that)."

Question would be how to document that.

I still have that big life time docu patch pending. Let me dig into how
I could combine that..


P.

> 
> Regards,
> 
> Tvrtko
>

Re: [PATCH] drm/sched: Mandate usage of drm_sched_job_cleanup()

Posted by Tvrtko Ursulin 3 months, 3 weeks ago

On 14/10/2025 12:56, Philipp Stanner wrote:
> On Mon, 2025-10-13 at 13:20 +0100, Tvrtko Ursulin wrote:
>>
>> On 26/09/2025 13:36, Philipp Stanner wrote:
>>> drm_sched_job_cleanup()'s documentation so far uses relatively soft
>>> language, only "recommending" usage of the function. To avoid memory
>>> leaks and, potentiall, other bugs, however, the function has to be used.
>>>
>>> Demand usage of the function explicitly.
>>>
>>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 46119aacb809..0a9df9e61963 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1030,7 +1030,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>>>     *
>>>     * Cleans up the resources allocated with drm_sched_job_init().
>>>     *
>>> - * Drivers should call this from their error unwind code if @job is aborted
>>> + * Drivers need to call this from their error unwind code if @job is aborted
>>
>> Should is indeed wrong. I think it could be better to go with "shall" or
>> "must" though. Since those two are more standard language for this.
> 
> "need to" is standard UK(?) English for "must to" I think.
> 
> But I'm fine with "must"

Yeah will is standard English and IMO understandable. I was just 
thinking whether it would best to stick to the established "requirements 
language". For example:

https://www.nasa.gov/reference/appendix-c-how-to-write-a-good-requirement/

https://argondigital.com/blog/product-management/using-the-correct-terms-shall-will-should/

https://reqi.io/articles/shall-vs-should-vs-will-vs-must-whats-the-difference

But I am okay with need, your call.
>>>     * before drm_sched_job_arm() is called.
>>>     *
>>>     * drm_sched_job_arm() is a point of no return since it initializes the fences
>>> @@ -1038,7 +1038,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>>>     * submit it with drm_sched_entity_push_job() and cannot simply abort it by
>>>     * calling drm_sched_job_cleanup().
>>>     *
>>> - * This function should be called in the &drm_sched_backend_ops.free_job callback.
>>> + * This function must be called in the &drm_sched_backend_ops.free_job callback.
>>>     */
>>>    void drm_sched_job_cleanup(struct drm_sched_job *job)
>>>    {
>>
>> Hm, having read the thread so far the situation seems not so easy to
>> untangle.
>>
>> On one hand I see Matt's point that free_job callback is a bit
>> misleadingly named and that there isn't anything really requiring
>> drm_sched_job_cleanup() to be called exactly from there. Free_job
>> semantics seem to be "job is done, *scheduler* is done with it".
>>
>> Drm_sched_job_cleanup() already says it has to be called after the point
>> of no return (arm) so it could be left at drivers discretion (de facto
>> state today) to choose how and when to do it.
>>
>> What I did not get is what is actually the perceived problem with
>> letting this stay? (Ie. "should" indicating it is recommended, but not a
>> must/shall.)
> 
> I think the most correct formulation with our current mess would be
> "cleanup() must be called earliest in free_job(), or later (if there is
> a very good reason for that)."
> 
> Question would be how to document that.
> 
> I still have that big life time docu patch pending. Let me dig into how
> I could combine that..

Ok.

Regards,

Tvrtko

Re: [PATCH] drm/sched: Mandate usage of drm_sched_job_cleanup()

Posted by Matthew Brost 4 months, 1 week ago

On Fri, Sep 26, 2025 at 02:36:31PM +0200, Philipp Stanner wrote:
> drm_sched_job_cleanup()'s documentation so far uses relatively soft
> language, only "recommending" usage of the function. To avoid memory
> leaks and, potentiall, other bugs, however, the function has to be used.
> 
> Demand usage of the function explicitly.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 46119aacb809..0a9df9e61963 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1030,7 +1030,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>   *
>   * Cleans up the resources allocated with drm_sched_job_init().
>   *
> - * Drivers should call this from their error unwind code if @job is aborted
> + * Drivers need to call this from their error unwind code if @job is aborted
>   * before drm_sched_job_arm() is called.
>   *
>   * drm_sched_job_arm() is a point of no return since it initializes the fences
> @@ -1038,7 +1038,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>   * submit it with drm_sched_entity_push_job() and cannot simply abort it by
>   * calling drm_sched_job_cleanup().
>   *
> - * This function should be called in the &drm_sched_backend_ops.free_job callback.
> + * This function must be called in the &drm_sched_backend_ops.free_job callback.

I believe both the existing documentation and the new version are
incorrect.

Also, the free_job callback is probably misnamed. This ties into a
statement made during your XDC presentation today that confused me. The
DRM scheduler only holds a reference to the driver-allocated job from
the time run_job is called (i.e., once drm_sched_job_arm is invoked)
until the job’s fence signals. It does not own the responsibility of
freeing the job — the driver should manage the job’s reference count and
free it as appropriate.

For most drivers, the only reference is the creation reference, which is
transferred to the DRM scheduler after drm_sched_job_arm and released in
free_job.

So, I think what really needs to be documented is this reference
counting transfer. drm_sched_job_cleanup is actually associated with
drm_sched_job_init — meaning drm_sched_job_cleanup must be called if
drm_sched_job_init was called. This is typically handled by the driver
through reference counting of the job.

Finally, I think free_job should probably be renamed to something like
sched_job_put to better reflect its purpose.

We can discuss this more in person on Wednesday at the workshop or grab
in the hallway.

Matt

>   */
>  void drm_sched_job_cleanup(struct drm_sched_job *job)
>  {
> -- 
> 2.49.0
>

Re: [PATCH] drm/sched: Mandate usage of drm_sched_job_cleanup()

Posted by Philipp Stanner 4 months ago

On Mon, 2025-09-29 at 10:12 -0700, Matthew Brost wrote:
> On Fri, Sep 26, 2025 at 02:36:31PM +0200, Philipp Stanner wrote:
> > drm_sched_job_cleanup()'s documentation so far uses relatively soft
> > language, only "recommending" usage of the function. To avoid memory
> > leaks and, potentiall, other bugs, however, the function has to be used.
> > 
> > Demand usage of the function explicitly.
> > 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 46119aacb809..0a9df9e61963 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1030,7 +1030,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
> >   *
> >   * Cleans up the resources allocated with drm_sched_job_init().
> >   *
> > - * Drivers should call this from their error unwind code if @job is aborted
> > + * Drivers need to call this from their error unwind code if @job is aborted
> >   * before drm_sched_job_arm() is called.
> >   *
> >   * drm_sched_job_arm() is a point of no return since it initializes the fences
> > @@ -1038,7 +1038,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
> >   * submit it with drm_sched_entity_push_job() and cannot simply abort it by
> >   * calling drm_sched_job_cleanup().
> >   *
> > - * This function should be called in the &drm_sched_backend_ops.free_job callback.
> > + * This function must be called in the &drm_sched_backend_ops.free_job callback.
> 
> I believe both the existing documentation and the new version are
> incorrect.

Feel free to submit a proposal.

What's important for this patch is: do you agree that cleanup_job()
must be called in free_job()?

> 
> Also, the free_job callback is probably misnamed. This ties into a
> statement made during your XDC presentation today that confused me. The
> DRM scheduler only holds a reference to the driver-allocated job from
> the time run_job is called (i.e., once drm_sched_job_arm is invoked)
> until the job’s fence signals. It does not own the responsibility of
> freeing the job — the driver should manage the job’s reference count and
> free it as appropriate.
> 
> For most drivers, the only reference is the creation reference, which is
> transferred to the DRM scheduler after drm_sched_job_arm and released in
> free_job.
> 
> So, I think what really needs to be documented is this reference
> counting transfer. drm_sched_job_cleanup is actually associated with
> drm_sched_job_init — meaning drm_sched_job_cleanup must be called if
> drm_sched_job_init was called. This is typically handled by the driver
> through reference counting of the job.
> 
> Finally, I think free_job should probably be renamed to something like
> sched_job_put to better reflect its purpose.

That's maybe how free_job() was intended once, but it's not how it's
implemented in some drivers currently. The implementation is fractured.
Some drivers maybe have refcounted jobs, some others don't.

That's exactly where our problems come from. free_job() and the
scheduler de facto manage job lifetime.

Moreover, I'm wondering what good it would even be to have everything
refcounted and then have the scheduler tell the driver to drop its
refcount.. that sounds fundamentally broken to me. It's not the purpose
of refcounting.
The idea is that everyone holds his reference until he doesn't need it
anymore, and then drops it.


I think we should leave free_job() as is, agree that having it in the
first place was a fundamental design mistake, and go on and do better
with future job-managing infrastructure.


P.

> 
> We can discuss this more in person on Wednesday at the workshop or grab
> in the hallway.
> 
> Matt
> 
> >   */
> >  void drm_sched_job_cleanup(struct drm_sched_job *job)
> >  {
> > -- 
> > 2.49.0
> >

Re: [PATCH] drm/sched: Mandate usage of drm_sched_job_cleanup()

Posted by Matthew Brost 4 months, 1 week ago

On Mon, Sep 29, 2025 at 10:12:31AM -0700, Matthew Brost wrote:
> On Fri, Sep 26, 2025 at 02:36:31PM +0200, Philipp Stanner wrote:
> > drm_sched_job_cleanup()'s documentation so far uses relatively soft
> > language, only "recommending" usage of the function. To avoid memory
> > leaks and, potentiall, other bugs, however, the function has to be used.
> > 
> > Demand usage of the function explicitly.
> > 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 46119aacb809..0a9df9e61963 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1030,7 +1030,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
> >   *
> >   * Cleans up the resources allocated with drm_sched_job_init().
> >   *
> > - * Drivers should call this from their error unwind code if @job is aborted
> > + * Drivers need to call this from their error unwind code if @job is aborted
> >   * before drm_sched_job_arm() is called.
> >   *
> >   * drm_sched_job_arm() is a point of no return since it initializes the fences
> > @@ -1038,7 +1038,7 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
> >   * submit it with drm_sched_entity_push_job() and cannot simply abort it by
> >   * calling drm_sched_job_cleanup().
> >   *
> > - * This function should be called in the &drm_sched_backend_ops.free_job callback.
> > + * This function must be called in the &drm_sched_backend_ops.free_job callback.
> 
> I believe both the existing documentation and the new version are
> incorrect.
> 
> Also, the free_job callback is probably misnamed. This ties into a
> statement made during your XDC presentation today that confused me. The
> DRM scheduler only holds a reference to the driver-allocated job from
> the time run_job is called (i.e., once drm_sched_job_arm is invoked)
> until the job’s fence signals. It does not own the responsibility of
> freeing the job — the driver should manage the job’s reference count and
> free it as appropriate.
> 
> For most drivers, the only reference is the creation reference, which is
> transferred to the DRM scheduler after drm_sched_job_arm and released in
> free_job.
> 
> So, I think what really needs to be documented is this reference
> counting transfer. drm_sched_job_cleanup is actually associated with
> drm_sched_job_init — meaning drm_sched_job_cleanup must be called if
> drm_sched_job_init was called. This is typically handled by the driver
> through reference counting of the job.
> 
> Finally, I think free_job should probably be renamed to something like
> sched_job_put to better reflect its purpose.

Jetlagged - s/sched_job_put/put_job/

Matt

> 
> We can discuss this more in person on Wednesday at the workshop or grab
> in the hallway.
> 
> Matt
> 
> >   */
> >  void drm_sched_job_cleanup(struct drm_sched_job *job)
> >  {
> > -- 
> > 2.49.0
> >