drivers/gpu/drm/scheduler/sched_main.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
In drm_sched_fini() all entities are marked as stopped - without taking
the appropriate lock, because that would deadlock. That means that
drm_sched_fini() and drm_sched_entity_push_job() can race against each
other.
This should most likely be fixed by establishing the rule that all
entities associated with a scheduler must be torn down first. Then,
however, the locking should be removed from drm_sched_fini() alltogether
with an appropriate comment.
Reported-by: James Flowers <bold.zone2373@fastmail.com>
Link: https://lore.kernel.org/dri-devel/20250720235748.2798-1-bold.zone2373@fastmail.com/
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
Changes in v2:
- Fix typo.
---
drivers/gpu/drm/scheduler/sched_main.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 5a550fd76bf0..46119aacb809 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1424,6 +1424,22 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
* Prevents reinsertion and marks job_queue as idle,
* it will be removed from the rq in drm_sched_entity_fini()
* eventually
+ *
+ * FIXME:
+ * This lacks the proper spin_lock(&s_entity->lock) and
+ * is, therefore, a race condition. Most notably, it
+ * can race with drm_sched_entity_push_job(). The lock
+ * cannot be taken here, however, because this would
+ * lead to lock inversion -> deadlock.
+ *
+ * The best solution probably is to enforce the life
+ * time rule of all entities having to be torn down
+ * before their scheduler. Then, however, locking could
+ * be dropped alltogether from this function.
+ *
+ * For now, this remains a potential race in all
+ * drivers that keep entities alive for longer than
+ * the scheduler.
*/
s_entity->stopped = true;
spin_unlock(&rq->lock);
--
2.49.0
On Wed Aug 13, 2025 at 10:56 AM CEST, Philipp Stanner wrote: > In drm_sched_fini() all entities are marked as stopped - without taking > the appropriate lock, because that would deadlock. That means that > drm_sched_fini() and drm_sched_entity_push_job() can race against each > other. > > This should most likely be fixed by establishing the rule that all > entities associated with a scheduler must be torn down first. Then, > however, the locking should be removed from drm_sched_fini() alltogether > with an appropriate comment. > > Reported-by: James Flowers <bold.zone2373@fastmail.com> > Link: https://lore.kernel.org/dri-devel/20250720235748.2798-1-bold.zone2373@fastmail.com/ > Signed-off-by: Philipp Stanner <phasta@kernel.org> > --- > Changes in v2: > - Fix typo. > --- > drivers/gpu/drm/scheduler/sched_main.c | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 5a550fd76bf0..46119aacb809 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1424,6 +1424,22 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) > * Prevents reinsertion and marks job_queue as idle, > * it will be removed from the rq in drm_sched_entity_fini() > * eventually > + * > + * FIXME: > + * This lacks the proper spin_lock(&s_entity->lock) and > + * is, therefore, a race condition. Most notably, it > + * can race with drm_sched_entity_push_job(). The lock > + * cannot be taken here, however, because this would > + * lead to lock inversion -> deadlock. > + * > + * The best solution probably is to enforce the life > + * time rule of all entities having to be torn down > + * before their scheduler. Then, however, locking could > + * be dropped alltogether from this function. "Enforce the rule" is correct, since factually it's there, as a dependency in the code. Do we know which drivers violate this lifetime rule? @Christian: What about amdgpu (for which the below was added to begin with)? > + * For now, this remains a potential race in all > + * drivers that keep entities alive for longer than > + * the scheduler. > */ > s_entity->stopped = true; > spin_unlock(&rq->lock); > -- > 2.49.0
On Wed, 2025-08-13 at 14:58 +0200, Danilo Krummrich wrote: > On Wed Aug 13, 2025 at 10:56 AM CEST, Philipp Stanner wrote: > > In drm_sched_fini() all entities are marked as stopped - without taking > > the appropriate lock, because that would deadlock. That means that > > drm_sched_fini() and drm_sched_entity_push_job() can race against each > > other. > > > > This should most likely be fixed by establishing the rule that all > > entities associated with a scheduler must be torn down first. Then, > > however, the locking should be removed from drm_sched_fini() alltogether > > with an appropriate comment. > > > > Reported-by: James Flowers <bold.zone2373@fastmail.com> > > Link: https://lore.kernel.org/dri-devel/20250720235748.2798-1-bold.zone2373@fastmail.com/ > > Signed-off-by: Philipp Stanner <phasta@kernel.org> Applied to drm-misc-next > > --- > > Changes in v2: > > - Fix typo. > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 16 ++++++++++++++++ > > 1 file changed, 16 insertions(+) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index 5a550fd76bf0..46119aacb809 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -1424,6 +1424,22 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) > > * Prevents reinsertion and marks job_queue as idle, > > * it will be removed from the rq in drm_sched_entity_fini() > > * eventually > > + * > > + * FIXME: > > + * This lacks the proper spin_lock(&s_entity->lock) and > > + * is, therefore, a race condition. Most notably, it > > + * can race with drm_sched_entity_push_job(). The lock > > + * cannot be taken here, however, because this would > > + * lead to lock inversion -> deadlock. > > + * > > + * The best solution probably is to enforce the life > > + * time rule of all entities having to be torn down > > + * before their scheduler. Then, however, locking could > > + * be dropped alltogether from this function. > > "Enforce the rule" is correct, since factually it's there, as a dependency in > the code. > > Do we know which drivers violate this lifetime rule? I've got no idea :( > > @Christian: What about amdgpu (for which the below was added to begin with)? +1 P. > > > + * For now, this remains a potential race in all > > + * drivers that keep entities alive for longer than > > + * the scheduler. > > */ > > s_entity->stopped = true; > > spin_unlock(&rq->lock); > > -- > > 2.49.0 >
On 13.08.25 10:56, Philipp Stanner wrote: > In drm_sched_fini() all entities are marked as stopped - without taking > the appropriate lock, because that would deadlock. That means that > drm_sched_fini() and drm_sched_entity_push_job() can race against each > other. > > This should most likely be fixed by establishing the rule that all > entities associated with a scheduler must be torn down first. Then, > however, the locking should be removed from drm_sched_fini() alltogether > with an appropriate comment. > > Reported-by: James Flowers <bold.zone2373@fastmail.com> > Link: https://lore.kernel.org/dri-devel/20250720235748.2798-1-bold.zone2373@fastmail.com/ > Signed-off-by: Philipp Stanner <phasta@kernel.org> Reviewed-by: Christian König <christian.koenig@amd.com> > --- > Changes in v2: > - Fix typo. > --- > drivers/gpu/drm/scheduler/sched_main.c | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 5a550fd76bf0..46119aacb809 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1424,6 +1424,22 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) > * Prevents reinsertion and marks job_queue as idle, > * it will be removed from the rq in drm_sched_entity_fini() > * eventually > + * > + * FIXME: > + * This lacks the proper spin_lock(&s_entity->lock) and > + * is, therefore, a race condition. Most notably, it > + * can race with drm_sched_entity_push_job(). The lock > + * cannot be taken here, however, because this would > + * lead to lock inversion -> deadlock. > + * > + * The best solution probably is to enforce the life > + * time rule of all entities having to be torn down > + * before their scheduler. Then, however, locking could > + * be dropped alltogether from this function. > + * > + * For now, this remains a potential race in all > + * drivers that keep entities alive for longer than > + * the scheduler. > */ > s_entity->stopped = true; > spin_unlock(&rq->lock);
© 2016 - 2025 Red Hat, Inc.