If a job from a ready entity needs more credits than are currently
available, drm_sched_run_job_work() (a work item) simply returns and
doesn't reschedule itself. The scheduler is only woken up again when the
next job gets pushed with drm_sched_entity_push_job().
If someone submits a job that needs too many credits and doesn't submit
more jobs afterwards, this would lead to the scheduler never pulling the
too-expensive job, effectively hanging forever.
Document this problem as a FIXME.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 492e8af639db..eaf8d17b2a66 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1237,6 +1237,16 @@ static void drm_sched_run_job_work(struct work_struct *w)
/* Find entity with a ready job */
entity = drm_sched_select_entity(sched);
+ /*
+ * FIXME:
+ * The entity can be NULL when the scheduler currently has no capacity
+ * (credits) for more jobs. If that happens, the work item terminates
+ * itself here, without rescheduling itself.
+ *
+ * It only gets started again in drm_sched_entity_push_job(). IOW, the
+ * scheduler might hang forever if a job that needs too many credits
+ * gets submitted to an entity and no other, subsequent jobs are.
+ */
if (!entity) {
/*
* Either no more work to do, or the next ready job needs more
--
2.49.0
On Tue, Oct 28, 2025 at 02:46:02PM +0100, Philipp Stanner wrote:
> If a job from a ready entity needs more credits than are currently
> available, drm_sched_run_job_work() (a work item) simply returns and
> doesn't reschedule itself. The scheduler is only woken up again when the
> next job gets pushed with drm_sched_entity_push_job().
>
> If someone submits a job that needs too many credits and doesn't submit
> more jobs afterwards, this would lead to the scheduler never pulling the
> too-expensive job, effectively hanging forever.
>
> Document this problem as a FIXME.
>
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
> drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 492e8af639db..eaf8d17b2a66 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1237,6 +1237,16 @@ static void drm_sched_run_job_work(struct work_struct *w)
>
> /* Find entity with a ready job */
> entity = drm_sched_select_entity(sched);
> + /*
> + * FIXME:
> + * The entity can be NULL when the scheduler currently has no capacity
> + * (credits) for more jobs. If that happens, the work item terminates
> + * itself here, without rescheduling itself.
> + *
> + * It only gets started again in drm_sched_entity_push_job(). IOW, the
> + * scheduler might hang forever if a job that needs too many credits
> + * gets submitted to an entity and no other, subsequent jobs are.
> + */
drm_sched_job_done frees the credits, which triggers
drm_sched_free_job_work, and that in turn triggers
drm_sched_run_job_work.
This flow could be refined a bit, but I do believe it works—unless I'm
missing something. I'm pretty sure we have tests in Xe that exhaust the
credits, though it might be continuous submissions; I'd have to check.
Matt
> if (!entity) {
> /*
> * Either no more work to do, or the next ready job needs more
> --
> 2.49.0
>
On Tue, 2025-10-28 at 12:43 -0700, Matthew Brost wrote: > On Tue, Oct 28, 2025 at 02:46:02PM +0100, Philipp Stanner wrote: > > If a job from a ready entity needs more credits than are currently > > available, drm_sched_run_job_work() (a work item) simply returns and > > doesn't reschedule itself. The scheduler is only woken up again when the > > next job gets pushed with drm_sched_entity_push_job(). > > > > If someone submits a job that needs too many credits and doesn't submit > > more jobs afterwards, this would lead to the scheduler never pulling the > > too-expensive job, effectively hanging forever. > > > > Document this problem as a FIXME. > > > > Signed-off-by: Philipp Stanner <phasta@kernel.org> > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index 492e8af639db..eaf8d17b2a66 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -1237,6 +1237,16 @@ static void drm_sched_run_job_work(struct work_struct *w) > > > > /* Find entity with a ready job */ > > entity = drm_sched_select_entity(sched); > > + /* > > + * FIXME: > > + * The entity can be NULL when the scheduler currently has no capacity > > + * (credits) for more jobs. If that happens, the work item terminates > > + * itself here, without rescheduling itself. > > + * > > + * It only gets started again in drm_sched_entity_push_job(). IOW, the > > + * scheduler might hang forever if a job that needs too many credits > > + * gets submitted to an entity and no other, subsequent jobs are. > > + */ > > drm_sched_job_done frees the credits, which triggers > drm_sched_free_job_work, and that in turn triggers > drm_sched_run_job_work. Sounds correct to me. We can still merge #1, though, for a bit more clearness. P.
© 2016 - 2026 Red Hat, Inc.