[RFC PATCH 2/3] drm/sched: Taint workqueues with reclaim

Matthew Brost posted 3 patches 3 months, 2 weeks ago
[RFC PATCH 2/3] drm/sched: Taint workqueues with reclaim
Posted by Matthew Brost 3 months, 2 weeks ago
Multiple drivers seemingly do not understand the role of DMA fences in
the reclaim path. As a result, DRM scheduler workqueues, which are part
of the fence signaling path, must not allocate memory. This patch
teaches lockdep to recognize these rules in order to catch driver-side
bugs.

Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index c39f0245e3a9..676484dd3ea3 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1368,6 +1368,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 	atomic64_set(&sched->job_id_count, 0);
 	sched->pause_submit = false;
 
+	taint_reclaim_workqueue(sched->submit_wq, GFP_KERNEL);
+	taint_reclaim_workqueue(sched->timeout_wq, GFP_KERNEL);
+
 	sched->ready = true;
 	return 0;
 Out_unroll:
-- 
2.34.1

Re: [RFC PATCH 2/3] drm/sched: Taint workqueues with reclaim
Posted by Philipp Stanner 3 months, 2 weeks ago
On Tue, 2025-10-21 at 14:39 -0700, Matthew Brost wrote:
> Multiple drivers seemingly do not understand the role of DMA fences in
> the reclaim path. As a result, 
> 

result of what? The "role of DMA fences"?

> DRM scheduler workqueues, which are part
> of the fence signaling path, must not allocate memory.
> 

Should be phrased differently. The actual rule here is "The GPU
scheduler's workqueues can be used for memory reclaim. Because of that,
work items on these queues must not allocate memory."

--

In general, I often read in commits or discussions about this or that
"rule", especially "DMA fence rules", but they're often not detailed
very much.


P.

>  This patch
> teaches lockdep to recognize these rules in order to catch driver-side
> bugs.
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <phasta@kernel.org>
> Cc: dri-devel@lists.freedesktop.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index c39f0245e3a9..676484dd3ea3 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1368,6 +1368,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>  	atomic64_set(&sched->job_id_count, 0);
>  	sched->pause_submit = false;
>  
> +	taint_reclaim_workqueue(sched->submit_wq, GFP_KERNEL);
> +	taint_reclaim_workqueue(sched->timeout_wq, GFP_KERNEL);
> +
>  	sched->ready = true;
>  	return 0;
>  Out_unroll:
Re: [RFC PATCH 2/3] drm/sched: Taint workqueues with reclaim
Posted by Matthew Brost 3 months, 2 weeks ago
On Mon, Oct 27, 2025 at 12:03:33PM +0100, Philipp Stanner wrote:
> On Tue, 2025-10-21 at 14:39 -0700, Matthew Brost wrote:
> > Multiple drivers seemingly do not understand the role of DMA fences in
> > the reclaim path. As a result, 
> > 
> 
> result of what? The "role of DMA fences"?
> 
> > DRM scheduler workqueues, which are part
> > of the fence signaling path, must not allocate memory.
> > 
> 
> Should be phrased differently. The actual rule here is "The GPU
> scheduler's workqueues can be used for memory reclaim. Because of that,
> work items on these queues must not allocate memory."
> 

Sure, will reword.

> --
> 
> In general, I often read in commits or discussions about this or that
> "rule", especially "DMA fence rules", but they're often not detailed
> very much.
>

Yes, I kinda assume the audience reviewing any dma-buf or drm-sched
really understand the "DMA fence rules" compare to driver devs which
often do not really get this concept. Taining the work queues here will
help driver devs avoid mistakes and hopefully along the way get them to
point where they understand "DMA fence rules" - it took me a few years
to really get this rules.

Matt
 
> 
> P.
> 
> >  This patch
> > teaches lockdep to recognize these rules in order to catch driver-side
> > bugs.
> > 
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Danilo Krummrich <dakr@kernel.org>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Philipp Stanner <phasta@kernel.org>
> > Cc: dri-devel@lists.freedesktop.org
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index c39f0245e3a9..676484dd3ea3 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1368,6 +1368,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
> >  	atomic64_set(&sched->job_id_count, 0);
> >  	sched->pause_submit = false;
> >  
> > +	taint_reclaim_workqueue(sched->submit_wq, GFP_KERNEL);
> > +	taint_reclaim_workqueue(sched->timeout_wq, GFP_KERNEL);
> > +
> >  	sched->ready = true;
> >  	return 0;
> >  Out_unroll:
>