[v2] drm/sched: Fix memory leaks in drm_sched_fini()

[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

Posted by Philipp Stanner 9 months, 2 weeks ago

The waitqueue that ensures that drm_sched_fini() blocks until the
pending_list has become empty could theoretically cause that function to
block for a very long time. That, ultimately, could block userspace
procesess and prevent them from being killable through SIGKILL.

When a driver calls drm_sched_fini(), it is safe to assume that all
still pending jobs are not needed anymore anyways. Thus, they can be
cancelled and thereby it can be ensured that drm_sched_fini() will
return relatively quickly.

Implement a new helper to stop all work items / submission except for
the drm_sched_backend_ops.run_job().

Implement a driver callback, kill_fence_context(), that instructs the
driver to kill the fence context associated with this scheduler, thereby
causing all pending hardware fences to be signalled.

Call those new routines in drm_sched_fini() and ensure backwards
compatibility if the new callback is not implemented.

Suggested-by: Danilo Krummrich <dakr@redhat.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 47 +++++++++++++++++---------
 include/drm/gpu_scheduler.h            | 11 ++++++
 2 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index ea82e69a72a8..c2ad6c70bfb6 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1390,31 +1390,46 @@ drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
 	return empty;
 }
 
+/**
+ * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending jobs
+ * @sched: scheduler instance
+ *
+ * Must only be called if &struct drm_sched_backend_ops.kill_fence_context is
+ * implemented.
+ *
+ * Instructs the driver to kill the fence context associated with this scheduler,
+ * thereby signalling all pending fences. This, in turn, will trigger
+ * &struct drm_sched_backend_ops.free_job to be called for all pending jobs.
+ * The function then blocks until all pending jobs have been freed.
+ */
+static inline void
+drm_sched_cancel_jobs_and_wait(struct drm_gpu_scheduler *sched)
+{
+	sched->ops->kill_fence_context(sched);
+	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
+}
+
 /**
  * drm_sched_fini - Destroy a gpu scheduler
  *
  * @sched: scheduler instance
  *
- * Tears down and cleans up the scheduler.
- *
- * Note that this function blocks until all the fences returned by
- * &struct drm_sched_backend_ops.run_job have been signalled.
+ * Tears down and cleans up the scheduler. Might leak memory if
+ * &struct drm_sched_backend_ops.kill_fence_context is not implemented.
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
 	struct drm_sched_entity *s_entity;
 	int i;
 
-	/*
-	 * Jobs that have neither been scheduled or which have timed out are
-	 * gone by now, but jobs that have been submitted through
-	 * backend_ops.run_job() and have not yet terminated are still pending.
-	 *
-	 * Wait for the pending_list to become empty to avoid leaking those jobs.
-	 */
-	drm_sched_submission_and_timeout_stop(sched);
-	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
-	drm_sched_free_stop(sched);
+	if (sched->ops->kill_fence_context) {
+		drm_sched_submission_and_timeout_stop(sched);
+		drm_sched_cancel_jobs_and_wait(sched);
+		drm_sched_free_stop(sched);
+	} else {
+		/* We're in "legacy free-mode" and ignore potential mem leaks */
+		drm_sched_wqueue_stop(sched);
+	}
 
 	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
 		struct drm_sched_rq *rq = sched->sched_rq[i];
@@ -1502,7 +1517,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched)
 EXPORT_SYMBOL(drm_sched_wqueue_ready);
 
 /**
- * drm_sched_wqueue_stop - stop scheduler submission
+ * drm_sched_wqueue_stop - stop scheduler submission and freeing
  * @sched: scheduler instance
  *
  * Stops the scheduler from pulling new jobs from entities. It also stops
@@ -1518,7 +1533,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched)
 EXPORT_SYMBOL(drm_sched_wqueue_stop);
 
 /**
- * drm_sched_wqueue_start - start scheduler submission
+ * drm_sched_wqueue_start - start scheduler submission and freeing
  * @sched: scheduler instance
  *
  * Restarts the scheduler after drm_sched_wqueue_stop() has stopped it.
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index d0b1f416b4d9..8630b4a26f10 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -509,6 +509,17 @@ struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @kill_fence_context: kill the fence context belonging to this scheduler
+	 *
+	 * Needed to cleanly tear the scheduler down in drm_sched_fini(). This
+	 * callback will cause all hardware fences to be signalled by the driver,
+	 * which, ultimately, ensures that all jobs get freed before teardown.
+	 *
+	 * This callback is optional, but it is highly recommended to implement it.
+	 */
+	void (*kill_fence_context)(struct drm_gpu_scheduler *sched);
 };
 
 /**
-- 
2.48.1

Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

Posted by Tvrtko Ursulin 8 months, 3 weeks ago

On 24/04/2025 10:55, Philipp Stanner wrote:
> The waitqueue that ensures that drm_sched_fini() blocks until the
> pending_list has become empty could theoretically cause that function to
> block for a very long time. That, ultimately, could block userspace
> procesess and prevent them from being killable through SIGKILL.
> 
> When a driver calls drm_sched_fini(), it is safe to assume that all
> still pending jobs are not needed anymore anyways. Thus, they can be
> cancelled and thereby it can be ensured that drm_sched_fini() will
> return relatively quickly.
> 
> Implement a new helper to stop all work items / submission except for
> the drm_sched_backend_ops.run_job().
> 
> Implement a driver callback, kill_fence_context(), that instructs the
> driver to kill the fence context associated with this scheduler, thereby
> causing all pending hardware fences to be signalled.
> 
> Call those new routines in drm_sched_fini() and ensure backwards
> compatibility if the new callback is not implemented.
> 
> Suggested-by: Danilo Krummrich <dakr@redhat.com>
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 47 +++++++++++++++++---------
>   include/drm/gpu_scheduler.h            | 11 ++++++
>   2 files changed, 42 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index ea82e69a72a8..c2ad6c70bfb6 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1390,31 +1390,46 @@ drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
>   	return empty;
>   }
>   
> +/**
> + * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending jobs
> + * @sched: scheduler instance
> + *
> + * Must only be called if &struct drm_sched_backend_ops.kill_fence_context is
> + * implemented.
> + *
> + * Instructs the driver to kill the fence context associated with this scheduler,
> + * thereby signalling all pending fences. This, in turn, will trigger
> + * &struct drm_sched_backend_ops.free_job to be called for all pending jobs.
> + * The function then blocks until all pending jobs have been freed.
> + */
> +static inline void
> +drm_sched_cancel_jobs_and_wait(struct drm_gpu_scheduler *sched)
> +{
> +	sched->ops->kill_fence_context(sched);
> +	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
> +}
> +
>   /**
>    * drm_sched_fini - Destroy a gpu scheduler
>    *
>    * @sched: scheduler instance
>    *
> - * Tears down and cleans up the scheduler.
> - *
> - * Note that this function blocks until all the fences returned by
> - * &struct drm_sched_backend_ops.run_job have been signalled.
> + * Tears down and cleans up the scheduler. Might leak memory if
> + * &struct drm_sched_backend_ops.kill_fence_context is not implemented.
>    */
>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>   {
>   	struct drm_sched_entity *s_entity;
>   	int i;
>   
> -	/*
> -	 * Jobs that have neither been scheduled or which have timed out are
> -	 * gone by now, but jobs that have been submitted through
> -	 * backend_ops.run_job() and have not yet terminated are still pending.
> -	 *
> -	 * Wait for the pending_list to become empty to avoid leaking those jobs.
> -	 */
> -	drm_sched_submission_and_timeout_stop(sched);
> -	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
> -	drm_sched_free_stop(sched);
> +	if (sched->ops->kill_fence_context) {
> +		drm_sched_submission_and_timeout_stop(sched);
> +		drm_sched_cancel_jobs_and_wait(sched);
> +		drm_sched_free_stop(sched);
> +	} else {
> +		/* We're in "legacy free-mode" and ignore potential mem leaks */
> +		drm_sched_wqueue_stop(sched);
> +	}
>   
>   	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>   		struct drm_sched_rq *rq = sched->sched_rq[i];
> @@ -1502,7 +1517,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched)
>   EXPORT_SYMBOL(drm_sched_wqueue_ready);
>   
>   /**
> - * drm_sched_wqueue_stop - stop scheduler submission
> + * drm_sched_wqueue_stop - stop scheduler submission and freeing

Looks like the kerneldoc corrections (below too) belong to the previous 
patch. Irrelevant if you decide to squash them though.

>    * @sched: scheduler instance
>    *
>    * Stops the scheduler from pulling new jobs from entities. It also stops
> @@ -1518,7 +1533,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched)
>   EXPORT_SYMBOL(drm_sched_wqueue_stop);
>   
>   /**
> - * drm_sched_wqueue_start - start scheduler submission
> + * drm_sched_wqueue_start - start scheduler submission and freeing
>    * @sched: scheduler instance
>    *
>    * Restarts the scheduler after drm_sched_wqueue_stop() has stopped it.
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index d0b1f416b4d9..8630b4a26f10 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -509,6 +509,17 @@ struct drm_sched_backend_ops {
>            * and it's time to clean it up.
>   	 */
>   	void (*free_job)(struct drm_sched_job *sched_job);
> +
> +	/**
> +	 * @kill_fence_context: kill the fence context belonging to this scheduler

Which fence context would that be? ;)

Also, "fence context" would be a new terminology in gpu_scheduler.h API 
level. You could call it ->sched_fini() or similar to signify at which 
point in the API it gets called and then the fact it takes sched as 
parameter would be natural.

We also probably want some commentary on the topic of indefinite (or 
very long at least) blocking a thread exit / SIGINT/TERM/KILL time. 
Cover letter touches upon that problem but I don't see you address it. 
Is the idea to let drivers shoot themselves in the foot or what?

Regards,

Tvrtko

> +	 *
> +	 * Needed to cleanly tear the scheduler down in drm_sched_fini(). This
> +	 * callback will cause all hardware fences to be signalled by the driver,
> +	 * which, ultimately, ensures that all jobs get freed before teardown.
> +	 *
> +	 * This callback is optional, but it is highly recommended to implement it.
> +	 */
> +	void (*kill_fence_context)(struct drm_gpu_scheduler *sched);
>   };
>   
>   /**

Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

Posted by Danilo Krummrich 8 months, 3 weeks ago

On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote:
> On 24/04/2025 10:55, Philipp Stanner wrote:
> > +	 * @kill_fence_context: kill the fence context belonging to this scheduler
> 
> Which fence context would that be? ;)

There's one one per ring and a scheduler instance represents a single ring. So,
what should be specified here?

> Also, "fence context" would be a new terminology in gpu_scheduler.h API
> level. You could call it ->sched_fini() or similar to signify at which point
> in the API it gets called and then the fact it takes sched as parameter
> would be natural.

The driver should tear down the fence context in this callback, not the while
scheduler. ->sched_fini() would hence be misleading.

> We also probably want some commentary on the topic of indefinite (or very
> long at least) blocking a thread exit / SIGINT/TERM/KILL time.

You mean in case the driver does implement the callback, but does *not* properly
tear down the fence context? So, you ask for describing potential consequences
of drivers having bugs in the implementation of the callback? Or something else?

> Is the idea to let drivers shoot themselves in the foot or what?

Please abstain from such rhetorical questions, that's not a good way of having
technical discussions.

- Danilo

Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

Posted by Tvrtko Ursulin 8 months, 3 weeks ago

On 16/05/2025 10:53, Danilo Krummrich wrote:
> On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote:
>> On 24/04/2025 10:55, Philipp Stanner wrote:
>>> +	 * @kill_fence_context: kill the fence context belonging to this scheduler
>>
>> Which fence context would that be? ;)
> 
> There's one one per ring and a scheduler instance represents a single ring. So,
> what should be specified here?

I was pointing out the fact not all drivers are 1:1 sched:entity. So 
plural at least. Thought it would be obvious from the ";)".

>> Also, "fence context" would be a new terminology in gpu_scheduler.h API
>> level. You could call it ->sched_fini() or similar to signify at which point
>> in the API it gets called and then the fact it takes sched as parameter
>> would be natural.
> 
> The driver should tear down the fence context in this callback, not the while
> scheduler. ->sched_fini() would hence be misleading.

Not the while what? Not while drm_sched_fini()? Could call it 
sched_kill() or anything. My point is that we dont' have "fence context" 
in the API but entities so adding a new term sounds sub-optimal.

>> We also probably want some commentary on the topic of indefinite (or very
>> long at least) blocking a thread exit / SIGINT/TERM/KILL time.
> 
> You mean in case the driver does implement the callback, but does *not* properly
> tear down the fence context? So, you ask for describing potential consequences
> of drivers having bugs in the implementation of the callback? Or something else?

I was proposing the kerneldoc for the vfunc should document the callback 
must not block, or if blocking is unavoidable, either document a 
guideline on how long is acceptable. Maybe even enforce a limit in the 
scheduler core itself.

Regards,

Tvrtko

Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

Posted by Danilo Krummrich 8 months, 3 weeks ago

On Fri, May 16, 2025 at 11:19:50AM +0100, Tvrtko Ursulin wrote:
> 
> On 16/05/2025 10:53, Danilo Krummrich wrote:
> > On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote:
> > > On 24/04/2025 10:55, Philipp Stanner wrote:
> > > > +	 * @kill_fence_context: kill the fence context belonging to this scheduler
> > > 
> > > Which fence context would that be? ;)
> > 
> > There's one one per ring and a scheduler instance represents a single ring. So,
> > what should be specified here?
> 
> I was pointing out the fact not all drivers are 1:1 sched:entity.

I'm well aware, but how is that relevant? Entities don't have an associated
fence context, but a GPU Ring (either hardware or software) has, which a
scheduler instance represents.

> Thought it would be obvious from the ";)".

I should read from ";)" that you refer to a 1:N-sched:entity relationship (which
doesn't seem to be related)?

> > > Also, "fence context" would be a new terminology in gpu_scheduler.h API
> > > level. You could call it ->sched_fini() or similar to signify at which point
> > > in the API it gets called and then the fact it takes sched as parameter
> > > would be natural.
> > 
> > The driver should tear down the fence context in this callback, not the while
> > scheduler. ->sched_fini() would hence be misleading.
> 
> Not the while what? Not while drm_sched_fini()?

*whole

> Could call it sched_kill()
> or anything. My point is that we dont' have "fence context" in the API but
> entities so adding a new term sounds sub-optimal.

In the callback the driver should neither tear down an entity, nor the whole
scheduler, hence we shouldn't call it like that. sched_kill() is therefore
misleading as well.

It should be named after what it actually does (or should do). Feel free to
propose a different name that conforms with that.

> > > We also probably want some commentary on the topic of indefinite (or very
> > > long at least) blocking a thread exit / SIGINT/TERM/KILL time.
> > 
> > You mean in case the driver does implement the callback, but does *not* properly
> > tear down the fence context? So, you ask for describing potential consequences
> > of drivers having bugs in the implementation of the callback? Or something else?
> 
> I was proposing the kerneldoc for the vfunc should document the callback
> must not block, or if blocking is unavoidable, either document a guideline
> on how long is acceptable. Maybe even enforce a limit in the scheduler core
> itself.

Killing the fence context shouldn't block.

Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

Posted by Tvrtko Ursulin 8 months, 3 weeks ago

On 16/05/2025 11:54, Danilo Krummrich wrote:
> On Fri, May 16, 2025 at 11:19:50AM +0100, Tvrtko Ursulin wrote:
>>
>> On 16/05/2025 10:53, Danilo Krummrich wrote:
>>> On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote:
>>>> On 24/04/2025 10:55, Philipp Stanner wrote:
>>>>> +	 * @kill_fence_context: kill the fence context belonging to this scheduler
>>>>
>>>> Which fence context would that be? ;)
>>>
>>> There's one one per ring and a scheduler instance represents a single ring. So,
>>> what should be specified here?
>>
>> I was pointing out the fact not all drivers are 1:1 sched:entity.
> 
> I'm well aware, but how is that relevant? Entities don't have an associated
> fence context, but a GPU Ring (either hardware or software) has, which a
> scheduler instance represents.

Aha! Well.. how it is relevant and do entities not have an associated 
fence context? Well, entity->fence_context.. that was my first 
association this whole time. Never it crossed my mind this is talking 
about the hardware fence context. Proof in the pudding naming should be 
improved.

But I also don't think there is a requirement for fences returned from 
->run_job() to have a single context. Which again makes it not the best 
naming.

>> Thought it would be obvious from the ";)".
> 
> I should read from ";)" that you refer to a 1:N-sched:entity relationship (which
> doesn't seem to be related)?
> 
>>>> Also, "fence context" would be a new terminology in gpu_scheduler.h API
>>>> level. You could call it ->sched_fini() or similar to signify at which point
>>>> in the API it gets called and then the fact it takes sched as parameter
>>>> would be natural.
>>>
>>> The driver should tear down the fence context in this callback, not the while
>>> scheduler. ->sched_fini() would hence be misleading.
>>
>> Not the while what? Not while drm_sched_fini()?
> 
> *whole
> 
>> Could call it sched_kill()
>> or anything. My point is that we dont' have "fence context" in the API but
>> entities so adding a new term sounds sub-optimal.
> 
> In the callback the driver should neither tear down an entity, nor the whole
> scheduler, hence we shouldn't call it like that. sched_kill() is therefore
> misleading as well.

  ->sched_exit()? ->sched_stop()? ->sched_cleanup()?

> It should be named after what it actually does (or should do). Feel free to
> propose a different name that conforms with that.
> 
>>>> We also probably want some commentary on the topic of indefinite (or very
>>>> long at least) blocking a thread exit / SIGINT/TERM/KILL time.
>>>
>>> You mean in case the driver does implement the callback, but does *not* properly
>>> tear down the fence context? So, you ask for describing potential consequences
>>> of drivers having bugs in the implementation of the callback? Or something else?
>>
>> I was proposing the kerneldoc for the vfunc should document the callback
>> must not block, or if blocking is unavoidable, either document a guideline
>> on how long is acceptable. Maybe even enforce a limit in the scheduler core
>> itself.
> 
> Killing the fence context shouldn't block.

Cool. And maybe convert the wait_event to wait_event_timeout with a 
warning to be robust.

Mind you, I still think a solution which does not add new state 
machinery should be preferred.

Regards,

Tvrtko

Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

Posted by Danilo Krummrich 8 months, 3 weeks ago

On Fri, May 16, 2025 at 12:35:52PM +0100, Tvrtko Ursulin wrote:
> 
> On 16/05/2025 11:54, Danilo Krummrich wrote:
> > On Fri, May 16, 2025 at 11:19:50AM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 16/05/2025 10:53, Danilo Krummrich wrote:
> > > > On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote:
> > > > > On 24/04/2025 10:55, Philipp Stanner wrote:
> > > > > > +	 * @kill_fence_context: kill the fence context belonging to this scheduler
> > > > > 
> > > > > Which fence context would that be? ;)
> > > > 
> > > > There's one one per ring and a scheduler instance represents a single ring. So,
> > > > what should be specified here?
> > > 
> > > I was pointing out the fact not all drivers are 1:1 sched:entity.
> > 
> > I'm well aware, but how is that relevant? Entities don't have an associated
> > fence context, but a GPU Ring (either hardware or software) has, which a
> > scheduler instance represents.
> 
> Aha! Well.. how it is relevant and do entities not have an associated fence
> context? Well, entity->fence_context.. that was my first association this
> whole time. Never it crossed my mind this is talking about the hardware
> fence context. Proof in the pudding naming should be improved.

It says "fence context belonging to this scheduler", which should be
unambiguous, however I agree that we could mark out the difference even more.

> But I also don't think there is a requirement for fences returned from
> ->run_job() to have a single context. Which again makes it not the best
> naming.

It's implied by the fact that a scheduler instance represents a ring. Having
multiple fence contexts per ring doesn't make any sense.

But it's indeed not written down -- we should do that then.

> > In the callback the driver should neither tear down an entity, nor the whole
> > scheduler, hence we shouldn't call it like that. sched_kill() is therefore
> > misleading as well.
> 
>  ->sched_exit()? ->sched_stop()? ->sched_cleanup()?

I think this all would throw up questions like "What does {exit,stop,cleanup}
mean in this context?". And the answer would be "kill the fence context of the
ring represented by the scheduler".

I think we want a name that represents that without an indirection that we have
to define.

> > It should be named after what it actually does (or should do). Feel free to
> > propose a different name that conforms with that.
> > 
> > > > > We also probably want some commentary on the topic of indefinite (or very
> > > > > long at least) blocking a thread exit / SIGINT/TERM/KILL time.
> > > > 
> > > > You mean in case the driver does implement the callback, but does *not* properly
> > > > tear down the fence context? So, you ask for describing potential consequences
> > > > of drivers having bugs in the implementation of the callback? Or something else?
> > > 
> > > I was proposing the kerneldoc for the vfunc should document the callback
> > > must not block, or if blocking is unavoidable, either document a guideline
> > > on how long is acceptable. Maybe even enforce a limit in the scheduler core
> > > itself.
> > 
> > Killing the fence context shouldn't block.
> 
> Cool. And maybe convert the wait_event to wait_event_timeout with a warning
> to be robust.

That would make sense if it could deadlock, but even if the driver does nothing
it should terminate eventually. The rule that we always rely on is that we
guarantee throughout the kernel that fences are signalled eventually.

Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

Posted by Tvrtko Ursulin 8 months, 3 weeks ago

On 16/05/2025 13:00, Danilo Krummrich wrote:
> On Fri, May 16, 2025 at 12:35:52PM +0100, Tvrtko Ursulin wrote:
>>
>> On 16/05/2025 11:54, Danilo Krummrich wrote:
>>> On Fri, May 16, 2025 at 11:19:50AM +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 16/05/2025 10:53, Danilo Krummrich wrote:
>>>>> On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote:
>>>>>> On 24/04/2025 10:55, Philipp Stanner wrote:
>>>>>>> +	 * @kill_fence_context: kill the fence context belonging to this scheduler
>>>>>>
>>>>>> Which fence context would that be? ;)
>>>>>
>>>>> There's one one per ring and a scheduler instance represents a single ring. So,
>>>>> what should be specified here?
>>>>
>>>> I was pointing out the fact not all drivers are 1:1 sched:entity.
>>>
>>> I'm well aware, but how is that relevant? Entities don't have an associated
>>> fence context, but a GPU Ring (either hardware or software) has, which a
>>> scheduler instance represents.
>>
>> Aha! Well.. how it is relevant and do entities not have an associated fence
>> context? Well, entity->fence_context.. that was my first association this
>> whole time. Never it crossed my mind this is talking about the hardware
>> fence context. Proof in the pudding naming should be improved.
> 
> It says "fence context belonging to this scheduler", which should be
> unambiguous, however I agree that we could mark out the difference even more.

Cool, I had tunnel vision due working with entity->fence_context a lot 
and this just had misfortune to re-use the same name.

>> But I also don't think there is a requirement for fences returned from
>> ->run_job() to have a single context. Which again makes it not the best
>> naming.
> 
> It's implied by the fact that a scheduler instance represents a ring. Having
> multiple fence contexts per ring doesn't make any sense.
> 
> But it's indeed not written down -- we should do that then.

Would you do it in code or just in docs? I don't see a real benefit to 
it to be honest, since nothing depends on anything apart that it is a 
fence which will signal when job is done. But I am not aware of anything 
where it would be a problem either. One to run past driver maintainers I 
guess.

>>> In the callback the driver should neither tear down an entity, nor the whole
>>> scheduler, hence we shouldn't call it like that. sched_kill() is therefore
>>> misleading as well.
>>
>>   ->sched_exit()? ->sched_stop()? ->sched_cleanup()?
> 
> I think this all would throw up questions like "What does {exit,stop,cleanup}
> mean in this context?". And the answer would be "kill the fence context of the
> ring represented by the scheduler".
> 
> I think we want a name that represents that without an indirection that we have
> to define.

Well fence_kill_context wasn't self-descriptive to me neither so there 
is that too. In other words some kerneldoc will be needed anyway and a 
callback to call while tearing something down is pretty standard stuff. 
So I don't think it is a big deal to explain what that callback should do.

>>> It should be named after what it actually does (or should do). Feel free to
>>> propose a different name that conforms with that.
>>>
>>>>>> We also probably want some commentary on the topic of indefinite (or very
>>>>>> long at least) blocking a thread exit / SIGINT/TERM/KILL time.
>>>>>
>>>>> You mean in case the driver does implement the callback, but does *not* properly
>>>>> tear down the fence context? So, you ask for describing potential consequences
>>>>> of drivers having bugs in the implementation of the callback? Or something else?
>>>>
>>>> I was proposing the kerneldoc for the vfunc should document the callback
>>>> must not block, or if blocking is unavoidable, either document a guideline
>>>> on how long is acceptable. Maybe even enforce a limit in the scheduler core
>>>> itself.
>>>
>>> Killing the fence context shouldn't block.
>>
>> Cool. And maybe convert the wait_event to wait_event_timeout with a warning
>> to be robust.
> 
> That would make sense if it could deadlock, but even if the driver does nothing
> it should terminate eventually. The rule that we always rely on is that we
> guarantee throughout the kernel that fences are signalled eventually.

Given it is an opt-in fair enough. (Some drivers don't have automatic 
fence expiration, but then again they don't have this callback either. 
And once they start adding it, there will be kerneldoc to say callback 
must not block.)

Regards,

Tvrtko

Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

Posted by Philipp Stanner 8 months, 3 weeks ago

On Fri, 2025-05-16 at 10:33 +0100, Tvrtko Ursulin wrote:
> 
> On 24/04/2025 10:55, Philipp Stanner wrote:
> > The waitqueue that ensures that drm_sched_fini() blocks until the
> > pending_list has become empty could theoretically cause that
> > function to
> > block for a very long time. That, ultimately, could block userspace
> > procesess and prevent them from being killable through SIGKILL.
> > 
> > When a driver calls drm_sched_fini(), it is safe to assume that all
> > still pending jobs are not needed anymore anyways. Thus, they can
> > be
> > cancelled and thereby it can be ensured that drm_sched_fini() will
> > return relatively quickly.
> > 
> > Implement a new helper to stop all work items / submission except
> > for
> > the drm_sched_backend_ops.run_job().
> > 
> > Implement a driver callback, kill_fence_context(), that instructs
> > the
> > driver to kill the fence context associated with this scheduler,
> > thereby
> > causing all pending hardware fences to be signalled.
> > 
> > Call those new routines in drm_sched_fini() and ensure backwards
> > compatibility if the new callback is not implemented.
> > 
> > Suggested-by: Danilo Krummrich <dakr@redhat.com>
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 47 +++++++++++++++++----
> > -----
> >   include/drm/gpu_scheduler.h            | 11 ++++++
> >   2 files changed, 42 insertions(+), 16 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index ea82e69a72a8..c2ad6c70bfb6 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1390,31 +1390,46 @@ drm_sched_no_jobs_pending(struct
> > drm_gpu_scheduler *sched)
> >   	return empty;
> >   }
> >   
> > +/**
> > + * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending
> > jobs
> > + * @sched: scheduler instance
> > + *
> > + * Must only be called if &struct
> > drm_sched_backend_ops.kill_fence_context is
> > + * implemented.
> > + *
> > + * Instructs the driver to kill the fence context associated with
> > this scheduler,
> > + * thereby signalling all pending fences. This, in turn, will
> > trigger
> > + * &struct drm_sched_backend_ops.free_job to be called for all
> > pending jobs.
> > + * The function then blocks until all pending jobs have been
> > freed.
> > + */
> > +static inline void
> > +drm_sched_cancel_jobs_and_wait(struct drm_gpu_scheduler *sched)
> > +{
> > +	sched->ops->kill_fence_context(sched);
> > +	wait_event(sched->pending_list_waitque,
> > drm_sched_no_jobs_pending(sched));
> > +}
> > +
> >   /**
> >    * drm_sched_fini - Destroy a gpu scheduler
> >    *
> >    * @sched: scheduler instance
> >    *
> > - * Tears down and cleans up the scheduler.
> > - *
> > - * Note that this function blocks until all the fences returned by
> > - * &struct drm_sched_backend_ops.run_job have been signalled.
> > + * Tears down and cleans up the scheduler. Might leak memory if
> > + * &struct drm_sched_backend_ops.kill_fence_context is not
> > implemented.
> >    */
> >   void drm_sched_fini(struct drm_gpu_scheduler *sched)
> >   {
> >   	struct drm_sched_entity *s_entity;
> >   	int i;
> >   
> > -	/*
> > -	 * Jobs that have neither been scheduled or which have
> > timed out are
> > -	 * gone by now, but jobs that have been submitted through
> > -	 * backend_ops.run_job() and have not yet terminated are
> > still pending.
> > -	 *
> > -	 * Wait for the pending_list to become empty to avoid
> > leaking those jobs.
> > -	 */
> > -	drm_sched_submission_and_timeout_stop(sched);
> > -	wait_event(sched->pending_list_waitque,
> > drm_sched_no_jobs_pending(sched));
> > -	drm_sched_free_stop(sched);
> > +	if (sched->ops->kill_fence_context) {
> > +		drm_sched_submission_and_timeout_stop(sched);
> > +		drm_sched_cancel_jobs_and_wait(sched);
> > +		drm_sched_free_stop(sched);
> > +	} else {
> > +		/* We're in "legacy free-mode" and ignore
> > potential mem leaks */
> > +		drm_sched_wqueue_stop(sched);
> > +	}
> >   
> >   	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
> > i++) {
> >   		struct drm_sched_rq *rq = sched->sched_rq[i];
> > @@ -1502,7 +1517,7 @@ bool drm_sched_wqueue_ready(struct
> > drm_gpu_scheduler *sched)
> >   EXPORT_SYMBOL(drm_sched_wqueue_ready);
> >   
> >   /**
> > - * drm_sched_wqueue_stop - stop scheduler submission
> > + * drm_sched_wqueue_stop - stop scheduler submission and freeing
> 
> Looks like the kerneldoc corrections (below too) belong to the
> previous 
> patch. Irrelevant if you decide to squash them though.
> 
> >    * @sched: scheduler instance
> >    *
> >    * Stops the scheduler from pulling new jobs from entities. It
> > also stops
> > @@ -1518,7 +1533,7 @@ void drm_sched_wqueue_stop(struct
> > drm_gpu_scheduler *sched)
> >   EXPORT_SYMBOL(drm_sched_wqueue_stop);
> >   
> >   /**
> > - * drm_sched_wqueue_start - start scheduler submission
> > + * drm_sched_wqueue_start - start scheduler submission and freeing
> >    * @sched: scheduler instance
> >    *
> >    * Restarts the scheduler after drm_sched_wqueue_stop() has
> > stopped it.
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index d0b1f416b4d9..8630b4a26f10 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -509,6 +509,17 @@ struct drm_sched_backend_ops {
> >            * and it's time to clean it up.
> >   	 */
> >   	void (*free_job)(struct drm_sched_job *sched_job);
> > +
> > +	/**
> > +	 * @kill_fence_context: kill the fence context belonging
> > to this scheduler
> 
> Which fence context would that be? ;)
> 
> Also, "fence context" would be a new terminology in gpu_scheduler.h
> API 
> level. You could call it ->sched_fini() or similar to signify at
> which 
> point in the API it gets called and then the fact it takes sched as 
> parameter would be natural.
> 
> We also probably want some commentary on the topic of indefinite (or 
> very long at least) blocking a thread exit / SIGINT/TERM/KILL time. 
> Cover letter touches upon that problem but I don't see you address
> it. 
> Is the idea to let drivers shoot themselves in the foot or what?

You didn't seriously just write that, did you?

Yes, my intention clearly has been since September to make things
worse, more ambiguous and destroy drivers. That's why I'm probably the
only individual after Sima (who added some of the FIXMEs) who ever
bothered documenting all this mess here, and warning people about the
five hundred pitfalls, like three different start and stop functions,
implicit refcount rules and God knows which other insane hacks that
simply move driver problems into common infrastructure.

Reconsider your attitude.

P.

> 
> Regards,
> 
> Tvrtko
> 
> > +	 *
> > +	 * Needed to cleanly tear the scheduler down in
> > drm_sched_fini(). This
> > +	 * callback will cause all hardware fences to be signalled
> > by the driver,
> > +	 * which, ultimately, ensures that all jobs get freed
> > before teardown.
> > +	 *
> > +	 * This callback is optional, but it is highly recommended
> > to implement it.
> > +	 */
> > +	void (*kill_fence_context)(struct drm_gpu_scheduler
> > *sched);
> >   };
> >   
> >   /**
>

Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

Posted by Tvrtko Ursulin 8 months, 3 weeks ago

On 16/05/2025 10:54, Philipp Stanner wrote:
> On Fri, 2025-05-16 at 10:33 +0100, Tvrtko Ursulin wrote:
>>
>> On 24/04/2025 10:55, Philipp Stanner wrote:
>>> The waitqueue that ensures that drm_sched_fini() blocks until the
>>> pending_list has become empty could theoretically cause that
>>> function to
>>> block for a very long time. That, ultimately, could block userspace
>>> procesess and prevent them from being killable through SIGKILL.
>>>
>>> When a driver calls drm_sched_fini(), it is safe to assume that all
>>> still pending jobs are not needed anymore anyways. Thus, they can
>>> be
>>> cancelled and thereby it can be ensured that drm_sched_fini() will
>>> return relatively quickly.
>>>
>>> Implement a new helper to stop all work items / submission except
>>> for
>>> the drm_sched_backend_ops.run_job().
>>>
>>> Implement a driver callback, kill_fence_context(), that instructs
>>> the
>>> driver to kill the fence context associated with this scheduler,
>>> thereby
>>> causing all pending hardware fences to be signalled.
>>>
>>> Call those new routines in drm_sched_fini() and ensure backwards
>>> compatibility if the new callback is not implemented.
>>>
>>> Suggested-by: Danilo Krummrich <dakr@redhat.com>
>>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 47 +++++++++++++++++----
>>> -----
>>>    include/drm/gpu_scheduler.h            | 11 ++++++
>>>    2 files changed, 42 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index ea82e69a72a8..c2ad6c70bfb6 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1390,31 +1390,46 @@ drm_sched_no_jobs_pending(struct
>>> drm_gpu_scheduler *sched)
>>>    	return empty;
>>>    }
>>>    
>>> +/**
>>> + * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending
>>> jobs
>>> + * @sched: scheduler instance
>>> + *
>>> + * Must only be called if &struct
>>> drm_sched_backend_ops.kill_fence_context is
>>> + * implemented.
>>> + *
>>> + * Instructs the driver to kill the fence context associated with
>>> this scheduler,
>>> + * thereby signalling all pending fences. This, in turn, will
>>> trigger
>>> + * &struct drm_sched_backend_ops.free_job to be called for all
>>> pending jobs.
>>> + * The function then blocks until all pending jobs have been
>>> freed.
>>> + */
>>> +static inline void
>>> +drm_sched_cancel_jobs_and_wait(struct drm_gpu_scheduler *sched)
>>> +{
>>> +	sched->ops->kill_fence_context(sched);
>>> +	wait_event(sched->pending_list_waitque,
>>> drm_sched_no_jobs_pending(sched));
>>> +}
>>> +
>>>    /**
>>>     * drm_sched_fini - Destroy a gpu scheduler
>>>     *
>>>     * @sched: scheduler instance
>>>     *
>>> - * Tears down and cleans up the scheduler.
>>> - *
>>> - * Note that this function blocks until all the fences returned by
>>> - * &struct drm_sched_backend_ops.run_job have been signalled.
>>> + * Tears down and cleans up the scheduler. Might leak memory if
>>> + * &struct drm_sched_backend_ops.kill_fence_context is not
>>> implemented.
>>>     */
>>>    void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>    {
>>>    	struct drm_sched_entity *s_entity;
>>>    	int i;
>>>    
>>> -	/*
>>> -	 * Jobs that have neither been scheduled or which have
>>> timed out are
>>> -	 * gone by now, but jobs that have been submitted through
>>> -	 * backend_ops.run_job() and have not yet terminated are
>>> still pending.
>>> -	 *
>>> -	 * Wait for the pending_list to become empty to avoid
>>> leaking those jobs.
>>> -	 */
>>> -	drm_sched_submission_and_timeout_stop(sched);
>>> -	wait_event(sched->pending_list_waitque,
>>> drm_sched_no_jobs_pending(sched));
>>> -	drm_sched_free_stop(sched);
>>> +	if (sched->ops->kill_fence_context) {
>>> +		drm_sched_submission_and_timeout_stop(sched);
>>> +		drm_sched_cancel_jobs_and_wait(sched);
>>> +		drm_sched_free_stop(sched);
>>> +	} else {
>>> +		/* We're in "legacy free-mode" and ignore
>>> potential mem leaks */
>>> +		drm_sched_wqueue_stop(sched);
>>> +	}
>>>    
>>>    	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
>>> i++) {
>>>    		struct drm_sched_rq *rq = sched->sched_rq[i];
>>> @@ -1502,7 +1517,7 @@ bool drm_sched_wqueue_ready(struct
>>> drm_gpu_scheduler *sched)
>>>    EXPORT_SYMBOL(drm_sched_wqueue_ready);
>>>    
>>>    /**
>>> - * drm_sched_wqueue_stop - stop scheduler submission
>>> + * drm_sched_wqueue_stop - stop scheduler submission and freeing
>>
>> Looks like the kerneldoc corrections (below too) belong to the
>> previous
>> patch. Irrelevant if you decide to squash them though.
>>
>>>     * @sched: scheduler instance
>>>     *
>>>     * Stops the scheduler from pulling new jobs from entities. It
>>> also stops
>>> @@ -1518,7 +1533,7 @@ void drm_sched_wqueue_stop(struct
>>> drm_gpu_scheduler *sched)
>>>    EXPORT_SYMBOL(drm_sched_wqueue_stop);
>>>    
>>>    /**
>>> - * drm_sched_wqueue_start - start scheduler submission
>>> + * drm_sched_wqueue_start - start scheduler submission and freeing
>>>     * @sched: scheduler instance
>>>     *
>>>     * Restarts the scheduler after drm_sched_wqueue_stop() has
>>> stopped it.
>>> diff --git a/include/drm/gpu_scheduler.h
>>> b/include/drm/gpu_scheduler.h
>>> index d0b1f416b4d9..8630b4a26f10 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -509,6 +509,17 @@ struct drm_sched_backend_ops {
>>>             * and it's time to clean it up.
>>>    	 */
>>>    	void (*free_job)(struct drm_sched_job *sched_job);
>>> +
>>> +	/**
>>> +	 * @kill_fence_context: kill the fence context belonging
>>> to this scheduler
>>
>> Which fence context would that be? ;)
>>
>> Also, "fence context" would be a new terminology in gpu_scheduler.h
>> API
>> level. You could call it ->sched_fini() or similar to signify at
>> which
>> point in the API it gets called and then the fact it takes sched as
>> parameter would be natural.
>>
>> We also probably want some commentary on the topic of indefinite (or
>> very long at least) blocking a thread exit / SIGINT/TERM/KILL time.
>> Cover letter touches upon that problem but I don't see you address
>> it.
>> Is the idea to let drivers shoot themselves in the foot or what?
> 
> You didn't seriously just write that, did you?
> 
> Yes, my intention clearly has been since September to make things
> worse, more ambiguous and destroy drivers. That's why I'm probably the
> only individual after Sima (who added some of the FIXMEs) who ever
> bothered documenting all this mess here, and warning people about the
> five hundred pitfalls, like three different start and stop functions,
> implicit refcount rules and God knows which other insane hacks that
> simply move driver problems into common infrastructure.
> 
> Reconsider your attitude.

I don't know what kind of attitude you think I was expressing. If the 
last sentence was a hyperbole too much for you then sorry. Point was in 
the paragraph ending with that - to document callback must not block, or 
if it has to to discuss for how long. And to perhaps discuss if 
scheduler code should impose a limit. Otherwise the indefinite block on 
thread exit from the cover letter can still happen. I don't see raising 
those point is a criticism of your overall work. (Fact that we don't see 
eye to eye regarding more state machine vs the ->cancel_job() 
alternative aside.)

Regards,

Tvrtko


> 
> P.
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>> +	 *
>>> +	 * Needed to cleanly tear the scheduler down in
>>> drm_sched_fini(). This
>>> +	 * callback will cause all hardware fences to be signalled
>>> by the driver,
>>> +	 * which, ultimately, ensures that all jobs get freed
>>> before teardown.
>>> +	 *
>>> +	 * This callback is optional, but it is highly recommended
>>> to implement it.
>>> +	 */
>>> +	void (*kill_fence_context)(struct drm_gpu_scheduler
>>> *sched);
>>>    };
>>>    
>>>    /**
>>
>

[PATCH v2 1/6] drm/sched: Fix teardown leaks with waitqueue
[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long
[PATCH v2 3/6] drm/sched: Warn if pending list is not empty
[PATCH v2 4/6] drm/nouveau: Add new callback for scheduler teardown
[PATCH v2 5/6] drm/nouveau: Remove waitque for sched teardown
[PATCH v2 6/6] drm/sched: Port unit tests to new cleanup design