[PATCH] drm/sched: Discourage usage of separate workqueues

Philipp Stanner posted 1 patch 6 months, 2 weeks ago
include/drm/gpu_scheduler.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
[PATCH] drm/sched: Discourage usage of separate workqueues
Posted by Philipp Stanner 6 months, 2 weeks ago
struct drm_sched_init_args provides the possibility of letting the
scheduler use user-controlled workqueues, instead of the scheduler
creating its own workqueues. It's currently not documented who would
want to use that.

Not sharing the submit_wq between driver and scheduler has the advantage
of no negative intereference between them being able to occur (e.g.,
MMU notifier callbacks waiting for fences to get signaled). A separate
timeout_wq should rarely be necessary, since using the system_wq could,
in the worst case, delay a timeout.

Discourage the usage of own workqueues in the documentation.

Suggested-by: Danilo Krummrich <dakr@kernel.org>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 include/drm/gpu_scheduler.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 81dcbfc8c223..11740d745223 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -590,14 +590,17 @@ struct drm_gpu_scheduler {
  *
  * @ops: backend operations provided by the driver
  * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
- *	       allocated and used.
+ *	       allocated and used. It is recommended to pass NULL unless there
+ *	       is a good reason not to.
  * @num_rqs: Number of run-queues. This may be at most DRM_SCHED_PRIORITY_COUNT,
  *	     as there's usually one run-queue per priority, but may be less.
  * @credit_limit: the number of credits this scheduler can hold from all jobs
  * @hang_limit: number of times to allow a job to hang before dropping it.
  *		This mechanism is DEPRECATED. Set it to 0.
  * @timeout: timeout value in jiffies for submitted jobs.
- * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is used.
+ * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
+ *		used. It is recommended to pass NULL unless there is a good
+ *		reason not to.
  * @score: score atomic shared with other schedulers. May be NULL.
  * @name: name (typically the driver's name). Used for debugging
  * @dev: associated device. Used for debugging
-- 
2.49.0
Re: [PATCH] drm/sched: Discourage usage of separate workqueues
Posted by Christian König 6 months, 2 weeks ago
On 6/4/25 10:16, Philipp Stanner wrote:
> struct drm_sched_init_args provides the possibility of letting the
> scheduler use user-controlled workqueues, instead of the scheduler
> creating its own workqueues. It's currently not documented who would
> want to use that.
> 
> Not sharing the submit_wq between driver and scheduler has the advantage
> of no negative intereference between them being able to occur (e.g.,
> MMU notifier callbacks waiting for fences to get signaled). A separate
> timeout_wq should rarely be necessary, since using the system_wq could,
> in the worst case, delay a timeout.
> 
> Discourage the usage of own workqueues in the documentation.
> 
> Suggested-by: Danilo Krummrich <dakr@kernel.org>
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>  include/drm/gpu_scheduler.h | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 81dcbfc8c223..11740d745223 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -590,14 +590,17 @@ struct drm_gpu_scheduler {
>   *
>   * @ops: backend operations provided by the driver
>   * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
> - *	       allocated and used.
> + *	       allocated and used. It is recommended to pass NULL unless there
> + *	       is a good reason not to.

Yeah, that's probably a good idea. I'm not sure why xe and nouveau actually wanted that.

>   * @num_rqs: Number of run-queues. This may be at most DRM_SCHED_PRIORITY_COUNT,
>   *	     as there's usually one run-queue per priority, but may be less.
>   * @credit_limit: the number of credits this scheduler can hold from all jobs
>   * @hang_limit: number of times to allow a job to hang before dropping it.
>   *		This mechanism is DEPRECATED. Set it to 0.
>   * @timeout: timeout value in jiffies for submitted jobs.
> - * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is used.
> + * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
> + *		used. It is recommended to pass NULL unless there is a good
> + *		reason not to.

Well, that's a rather bad idea.

Using a the same single threaded work queue for the timeout of multiple schedulers instances has the major advantage of being able to handle their occurrence sequentially.

In other words multiple schedulers post their timeout work items on the same queue, the first one to run resets the specific HW block in question and cancels all timeouts and work items from other schedulers which use the same HW block.

It was Sima, I and a few other people who came up with this approach because both amdgpu and IIRC panthor implemented that in their own specific way, and as usual got it wrong.

If I'm not completely mistaken this approach is now used by amdgpu, panthor, xe and imagination and has proven to be rather flexible and reliable. It just looks like we never documented that you should do it this way.

Regards,
Christian.

>   * @score: score atomic shared with other schedulers. May be NULL.
>   * @name: name (typically the driver's name). Used for debugging
>   * @dev: associated device. Used for debugging
Re: [PATCH] drm/sched: Discourage usage of separate workqueues
Posted by Simona Vetter 6 months, 2 weeks ago
On Wed, Jun 04, 2025 at 11:41:25AM +0200, Christian König wrote:
> On 6/4/25 10:16, Philipp Stanner wrote:
> > struct drm_sched_init_args provides the possibility of letting the
> > scheduler use user-controlled workqueues, instead of the scheduler
> > creating its own workqueues. It's currently not documented who would
> > want to use that.
> > 
> > Not sharing the submit_wq between driver and scheduler has the advantage
> > of no negative intereference between them being able to occur (e.g.,
> > MMU notifier callbacks waiting for fences to get signaled). A separate
> > timeout_wq should rarely be necessary, since using the system_wq could,
> > in the worst case, delay a timeout.
> > 
> > Discourage the usage of own workqueues in the documentation.
> > 
> > Suggested-by: Danilo Krummrich <dakr@kernel.org>
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >  include/drm/gpu_scheduler.h | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 81dcbfc8c223..11740d745223 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -590,14 +590,17 @@ struct drm_gpu_scheduler {
> >   *
> >   * @ops: backend operations provided by the driver
> >   * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
> > - *	       allocated and used.
> > + *	       allocated and used. It is recommended to pass NULL unless there
> > + *	       is a good reason not to.
> 
> Yeah, that's probably a good idea. I'm not sure why xe and nouveau actually wanted that.

The idea of this trick is that you have a fw scheduler which only has one
queue, and a bunch of other things in your driver that also need to be
stuffed into this fw queue (or handled by talking with the fw through
these ringbuffers).

If you use one single-threaded wq for everything then you don't need
additional locking anymore, and a lot of things become easier.

We should definitely document this trick better though, I didn't find any
place where that was documented.

Maybe a new overview section about "how to concurrency with drm/sched"?
That's also a good place to better highlight the existing documentation
for the 2nd part here.

> >   * @num_rqs: Number of run-queues. This may be at most DRM_SCHED_PRIORITY_COUNT,
> >   *	     as there's usually one run-queue per priority, but may be less.
> >   * @credit_limit: the number of credits this scheduler can hold from all jobs
> >   * @hang_limit: number of times to allow a job to hang before dropping it.
> >   *		This mechanism is DEPRECATED. Set it to 0.
> >   * @timeout: timeout value in jiffies for submitted jobs.
> > - * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is used.
> > + * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
> > + *		used. It is recommended to pass NULL unless there is a good
> > + *		reason not to.
> 
> Well, that's a rather bad idea.
> 
> Using a the same single threaded work queue for the timeout of multiple
> schedulers instances has the major advantage of being able to handle
> their occurrence sequentially.
> 
> In other words multiple schedulers post their timeout work items on the
> same queue, the first one to run resets the specific HW block in
> question and cancels all timeouts and work items from other schedulers
> which use the same HW block.
> 
> It was Sima, I and a few other people who came up with this approach
> because both amdgpu and IIRC panthor implemented that in their own
> specific way, and as usual got it wrong.
> 
> If I'm not completely mistaken this approach is now used by amdgpu,
> panthor, xe and imagination and has proven to be rather flexible and
> reliable. It just looks like we never documented that you should do it
> this way.

It is documented, just not here. See the note in
drm_sched_backend_ops.timedout_job at the very bottom.

We should definitely have a lot more cross-links between the various
pieces of this puzzle though, that's for sure :-)

Cheers, Sima

> 
> Regards,
> Christian.
> 
> >   * @score: score atomic shared with other schedulers. May be NULL.
> >   * @name: name (typically the driver's name). Used for debugging
> >   * @dev: associated device. Used for debugging
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Re: [PATCH] drm/sched: Discourage usage of separate workqueues
Posted by Philipp Stanner 6 months, 2 weeks ago
On Wed, 2025-06-04 at 17:07 +0200, Simona Vetter wrote:
> On Wed, Jun 04, 2025 at 11:41:25AM +0200, Christian König wrote:
> > On 6/4/25 10:16, Philipp Stanner wrote:
> > > struct drm_sched_init_args provides the possibility of letting
> > > the
> > > scheduler use user-controlled workqueues, instead of the
> > > scheduler
> > > creating its own workqueues. It's currently not documented who
> > > would
> > > want to use that.
> > > 
> > > Not sharing the submit_wq between driver and scheduler has the
> > > advantage
> > > of no negative intereference between them being able to occur
> > > (e.g.,
> > > MMU notifier callbacks waiting for fences to get signaled). A
> > > separate
> > > timeout_wq should rarely be necessary, since using the system_wq
> > > could,
> > > in the worst case, delay a timeout.
> > > 
> > > Discourage the usage of own workqueues in the documentation.
> > > 
> > > Suggested-by: Danilo Krummrich <dakr@kernel.org>
> > > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > > ---
> > >  include/drm/gpu_scheduler.h | 7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/drm/gpu_scheduler.h
> > > b/include/drm/gpu_scheduler.h
> > > index 81dcbfc8c223..11740d745223 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -590,14 +590,17 @@ struct drm_gpu_scheduler {
> > >   *
> > >   * @ops: backend operations provided by the driver
> > >   * @submit_wq: workqueue to use for submission. If NULL, an
> > > ordered wq is
> > > - *        allocated and used.
> > > + *        allocated and used. It is recommended to pass NULL
> > > unless there
> > > + *        is a good reason not to.
> > 
> > Yeah, that's probably a good idea. I'm not sure why xe and nouveau
> > actually wanted that.
> 
> The idea of this trick is that you have a fw scheduler which only has
> one
> queue, and a bunch of other things in your driver that also need to
> be
> stuffed into this fw queue (or handled by talking with the fw through
> these ringbuffers).
> 
> If you use one single-threaded wq for everything then you don't need
> additional locking anymore, and a lot of things become easier.
> 
> We should definitely document this trick better though, I didn't find
> any
> place where that was documented.
> 
> Maybe a new overview section about "how to concurrency with
> drm/sched"?
> That's also a good place to better highlight the existing
> documentation
> for the 2nd part here.


Let us first create consensus about when providing workqueues by the
driver makes sense. Correct wrong statements:

   1. Using a single threaded workqueue for timeout_wq is useful for
      guaranteeing that timeouts handlers are performed in the correct
      order. (Christian)
   2. Independently from the point above, using the very same single
      threaded wq for both timeout_wq and submit_wq allows for dropping
      locks from the driver's callbacks.
   3. Driver and scheduler sharing workqueues risks deadlocking, as
      described by Danilo.


I'm wondering about the following now:
   1. Why does the scheduler by default use the system_wq for timeouts?
      Could it make sense for us to always use a single threaded wq?
      After all, each scheduler only ever has 1 work item for timeouts
      in the air.
   2. Isn't point 2 from above purely a trick to avoid locking in the
      callbacks? If it is, I'm not convinced yet that we really want to
      actively recommend that. The callback implementations I've seen
      usually only need their fence context lock, and many drivers
      don't use their own workqueues to begin with.
      But I guess having a sentence somewhere teasering at the
      possibility wouldn't be hurtful.


My current feeling right now is that we should document why the
mechanism exists and discourage using it. It's optional, not really
necessary, risks deadlocking and making that more waterproof would
require more substantial work, like maybe lockdep assertions suggested
by Matthew.

I'm for the more leightweight way.


> 
> > >   * @num_rqs: Number of run-queues. This may be at most
> > > DRM_SCHED_PRIORITY_COUNT,
> > >   *      as there's usually one run-queue per priority, but may
> > > be less.
> > >   * @credit_limit: the number of credits this scheduler can hold
> > > from all jobs
> > >   * @hang_limit: number of times to allow a job to hang before
> > > dropping it.
> > >   * This mechanism is DEPRECATED. Set it to 0.
> > >   * @timeout: timeout value in jiffies for submitted jobs.
> > > - * @timeout_wq: workqueue to use for timeout work. If NULL, the
> > > system_wq is used.
> > > + * @timeout_wq: workqueue to use for timeout work. If NULL, the
> > > system_wq is
> > > + * used. It is recommended to pass NULL unless there is a good
> > > + * reason not to.
> > 
> > Well, that's a rather bad idea.
> > 
> > Using a the same single threaded work queue for the timeout of
> > multiple
> > schedulers instances has the major advantage of being able to
> > handle
> > their occurrence sequentially.
> > 
> > In other words multiple schedulers post their timeout work items on
> > the
> > same queue, the first one to run resets the specific HW block in
> > question and cancels all timeouts and work items from other
> > schedulers
> > which use the same HW block.
> > 
> > It was Sima, I and a few other people who came up with this
> > approach
> > because both amdgpu and IIRC panthor implemented that in their own
> > specific way, and as usual got it wrong.
> > 
> > If I'm not completely mistaken this approach is now used by amdgpu,
> > panthor, xe and imagination and has proven to be rather flexible
> > and
> > reliable. It just looks like we never documented that you should do
> > it
> > this way.
> 
> It is documented, just not here. See the note in
> drm_sched_backend_ops.timedout_job at the very bottom.
> 
> We should definitely have a lot more cross-links between the various
> pieces of this puzzle though, that's for sure :-)

Docu is in far better shape by now than it was last year, but there's
still much to do.

However, we probably also don't want to scather over too many places,
as it becomes harder to keep the docu consistent.


P.

> 
> Cheers, Sima
> 
> > 
> > Regards,
> > Christian.
> > 
> > >   * @score: score atomic shared with other schedulers. May be
> > > NULL.
> > >   * @name: name (typically the driver's name). Used for debugging
> > >   * @dev: associated device. Used for debugging
> > 
> 
Re: [PATCH] drm/sched: Discourage usage of separate workqueues
Posted by Matthew Brost 6 months, 2 weeks ago
On Wed, Jun 04, 2025 at 05:07:15PM +0200, Simona Vetter wrote:
> On Wed, Jun 04, 2025 at 11:41:25AM +0200, Christian König wrote:
> > On 6/4/25 10:16, Philipp Stanner wrote:
> > > struct drm_sched_init_args provides the possibility of letting the
> > > scheduler use user-controlled workqueues, instead of the scheduler
> > > creating its own workqueues. It's currently not documented who would
> > > want to use that.
> > > 
> > > Not sharing the submit_wq between driver and scheduler has the advantage
> > > of no negative intereference between them being able to occur (e.g.,
> > > MMU notifier callbacks waiting for fences to get signaled). A separate
> > > timeout_wq should rarely be necessary, since using the system_wq could,
> > > in the worst case, delay a timeout.
> > > 
> > > Discourage the usage of own workqueues in the documentation.
> > > 
> > > Suggested-by: Danilo Krummrich <dakr@kernel.org>
> > > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > > ---
> > >  include/drm/gpu_scheduler.h | 7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > index 81dcbfc8c223..11740d745223 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -590,14 +590,17 @@ struct drm_gpu_scheduler {
> > >   *
> > >   * @ops: backend operations provided by the driver
> > >   * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
> > > - *	       allocated and used.
> > > + *	       allocated and used. It is recommended to pass NULL unless there
> > > + *	       is a good reason not to.
> > 
> > Yeah, that's probably a good idea. I'm not sure why xe and nouveau actually wanted that.

At one point before workqueues could share a lockdep map we had to pass
in a workqueue to not run out of lockdep keys. That restriction is now
gone, so Xe passes in NULL.

Part of reasoning also was there was an interface to pass in the TDR
workqueue, so added one for the submit workqueue.

Xe does have an upcoming use for this though. We have a mode where
multiple queues share FW resources so interaction with the FW between
multiple queues needs to exclusive so we use a single submit workqueue
for queues sharing FW resources to avoid using locks in scheduler ops.

queues == GPU scheduler / entry in this context.

> 
> The idea of this trick is that you have a fw scheduler which only has one
> queue, and a bunch of other things in your driver that also need to be
> stuffed into this fw queue (or handled by talking with the fw through
> these ringbuffers).
> 
> If you use one single-threaded wq for everything then you don't need
> additional locking anymore, and a lot of things become easier.
> 

Yes, this how Xe avoid locks in all scheduler ops. Same in upcoming use
case above - multiple queues uses the same single-threaded wq.

> We should definitely document this trick better though, I didn't find any
> place where that was documented.
> 

This is a good idea.

> Maybe a new overview section about "how to concurrency with drm/sched"?
> That's also a good place to better highlight the existing documentation
> for the 2nd part here.
> 
> > >   * @num_rqs: Number of run-queues. This may be at most DRM_SCHED_PRIORITY_COUNT,
> > >   *	     as there's usually one run-queue per priority, but may be less.
> > >   * @credit_limit: the number of credits this scheduler can hold from all jobs
> > >   * @hang_limit: number of times to allow a job to hang before dropping it.
> > >   *		This mechanism is DEPRECATED. Set it to 0.
> > >   * @timeout: timeout value in jiffies for submitted jobs.
> > > - * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is used.
> > > + * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
> > > + *		used. It is recommended to pass NULL unless there is a good
> > > + *		reason not to.
> > 
> > Well, that's a rather bad idea.
> >

Yea, I've found using system workqueues in driver code usually creates
problems. In Xe, we pass in a single ordered workqueue shared among all
queues for the TDR. GT (device) resets are also run on this ordered
workqueue too to avoid jobs timing out in parallel. I think most drivers
would benefit from this type of design.

Matt
 
> > Using a the same single threaded work queue for the timeout of multiple
> > schedulers instances has the major advantage of being able to handle
> > their occurrence sequentially.
> > 
> > In other words multiple schedulers post their timeout work items on the
> > same queue, the first one to run resets the specific HW block in
> > question and cancels all timeouts and work items from other schedulers
> > which use the same HW block.
> > 
> > It was Sima, I and a few other people who came up with this approach
> > because both amdgpu and IIRC panthor implemented that in their own
> > specific way, and as usual got it wrong.
> > 
> > If I'm not completely mistaken this approach is now used by amdgpu,
> > panthor, xe and imagination and has proven to be rather flexible and
> > reliable. It just looks like we never documented that you should do it
> > this way.
> 
> It is documented, just not here. See the note in
> drm_sched_backend_ops.timedout_job at the very bottom.
> 
> We should definitely have a lot more cross-links between the various
> pieces of this puzzle though, that's for sure :-)
> 
> Cheers, Sima
> 
> > 
> > Regards,
> > Christian.
> > 
> > >   * @score: score atomic shared with other schedulers. May be NULL.
> > >   * @name: name (typically the driver's name). Used for debugging
> > >   * @dev: associated device. Used for debugging
> > 
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Re: [PATCH] drm/sched: Discourage usage of separate workqueues
Posted by Danilo Krummrich 6 months, 2 weeks ago
On Wed, Jun 04, 2025 at 09:45:00AM -0700, Matthew Brost wrote:
> On Wed, Jun 04, 2025 at 05:07:15PM +0200, Simona Vetter wrote:
> > We should definitely document this trick better though, I didn't find any
> > place where that was documented.
> 
> This is a good idea.

I think - and I also mentioned this a few times in the patch series that added
the workqueue support - we should also really document the pitfalls of this.

If the scheduler shares a workqueue with the driver, the driver needs to take
special care when submitting work that it's not possible to prevent run_job and
free_job work from running by doing this.

For instance, if it's a single threaded workqueue and the driver submits work
that allocates with GFP_KERNEL, this is a deadlock condition.

More generally, if the driver submits N work that, for instance allocates with
GFP_KERNEL, it's also a deadlock condition if N == max_active.
Re: [PATCH] drm/sched: Discourage usage of separate workqueues
Posted by Matthew Brost 6 months, 2 weeks ago
On Wed, Jun 04, 2025 at 06:53:44PM +0200, Danilo Krummrich wrote:
> On Wed, Jun 04, 2025 at 09:45:00AM -0700, Matthew Brost wrote:
> > On Wed, Jun 04, 2025 at 05:07:15PM +0200, Simona Vetter wrote:
> > > We should definitely document this trick better though, I didn't find any
> > > place where that was documented.
> > 
> > This is a good idea.
> 
> I think - and I also mentioned this a few times in the patch series that added
> the workqueue support - we should also really document the pitfalls of this.
> 
> If the scheduler shares a workqueue with the driver, the driver needs to take
> special care when submitting work that it's not possible to prevent run_job and
> free_job work from running by doing this.
> 
> For instance, if it's a single threaded workqueue and the driver submits work
> that allocates with GFP_KERNEL, this is a deadlock condition.
> 
> More generally, if the driver submits N work that, for instance allocates with
> GFP_KERNEL, it's also a deadlock condition if N == max_active.

Can we prime lockdep on scheduler init? e.g.

fs_reclaim_acquire(GFP_KERNEL);
workqueue_lockdep_acquire();
workqueue_lockdep_release();
fs_reclaim_release(GFP_KERNEL);

In addition to documentation, this would prevent workqueues from being
used that allocate with GFP_KERNEL.

Maybe we could use dma_fence_sigaling annotations instead of
fs_reclaim_acquire, but at one point those gave Xe false lockdep
positives so use fs_reclaim_acquire in similar cases. Maybe that has
been fixed though.

Matt
Re: [PATCH] drm/sched: Discourage usage of separate workqueues
Posted by Simona Vetter 6 months, 2 weeks ago
On Wed, Jun 04, 2025 at 10:10:21AM -0700, Matthew Brost wrote:
> On Wed, Jun 04, 2025 at 06:53:44PM +0200, Danilo Krummrich wrote:
> > On Wed, Jun 04, 2025 at 09:45:00AM -0700, Matthew Brost wrote:
> > > On Wed, Jun 04, 2025 at 05:07:15PM +0200, Simona Vetter wrote:
> > > > We should definitely document this trick better though, I didn't find any
> > > > place where that was documented.
> > > 
> > > This is a good idea.
> > 
> > I think - and I also mentioned this a few times in the patch series that added
> > the workqueue support - we should also really document the pitfalls of this.
> > 
> > If the scheduler shares a workqueue with the driver, the driver needs to take
> > special care when submitting work that it's not possible to prevent run_job and
> > free_job work from running by doing this.
> > 
> > For instance, if it's a single threaded workqueue and the driver submits work
> > that allocates with GFP_KERNEL, this is a deadlock condition.
> > 
> > More generally, if the driver submits N work that, for instance allocates with
> > GFP_KERNEL, it's also a deadlock condition if N == max_active.
> 
> Can we prime lockdep on scheduler init? e.g.
> 
> fs_reclaim_acquire(GFP_KERNEL);
> workqueue_lockdep_acquire();
> workqueue_lockdep_release();
> fs_reclaim_release(GFP_KERNEL);

+1, this should do the trick. Well strictly need GFP_NORECLAIM, so ideally
the one below for dma_fence.

> In addition to documentation, this would prevent workqueues from being
> used that allocate with GFP_KERNEL.
> 
> Maybe we could use dma_fence_sigaling annotations instead of
> fs_reclaim_acquire, but at one point those gave Xe false lockdep
> positives so use fs_reclaim_acquire in similar cases. Maybe that has
> been fixed though.

Yeah the annotation is busted because it doesn't use the right recursive
version. I thought Thomas Hellstrom had a patch once, but it didn't land
yet.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch