drm/sched: Document potential forever-hang

[PATCH 2/2] drm/sched: Add FIXME detailing potential hang

Posted by Philipp Stanner 3 months, 1 week ago

If a job from a ready entity needs more credits than are currently
available, drm_sched_run_job_work() (a work item) simply returns and
doesn't reschedule itself. The scheduler is only woken up again when the
next job gets pushed with drm_sched_entity_push_job().

If someone submits a job that needs too many credits and doesn't submit
more jobs afterwards, this would lead to the scheduler never pulling the
too-expensive job, effectively hanging forever.

Document this problem as a FIXME.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 492e8af639db..eaf8d17b2a66 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1237,6 +1237,16 @@ static void drm_sched_run_job_work(struct work_struct *w)
 
 	/* Find entity with a ready job */
 	entity = drm_sched_select_entity(sched);
+	/*
+	 * FIXME:
+	 * The entity can be NULL when the scheduler currently has no capacity
+	 * (credits) for more jobs. If that happens, the work item terminates
+	 * itself here, without rescheduling itself.
+	 *
+	 * It only gets started again in drm_sched_entity_push_job(). IOW, the
+	 * scheduler might hang forever if a job that needs too many credits
+	 * gets submitted to an entity and no other, subsequent jobs are.
+	 */
 	if (!entity) {
 		/*
 		 * Either no more work to do, or the next ready job needs more
-- 
2.49.0

Re: [PATCH 2/2] drm/sched: Add FIXME detailing potential hang

Posted by Matthew Brost 3 months, 1 week ago

On Tue, Oct 28, 2025 at 02:46:02PM +0100, Philipp Stanner wrote:
> If a job from a ready entity needs more credits than are currently
> available, drm_sched_run_job_work() (a work item) simply returns and
> doesn't reschedule itself. The scheduler is only woken up again when the
> next job gets pushed with drm_sched_entity_push_job().
> 
> If someone submits a job that needs too many credits and doesn't submit
> more jobs afterwards, this would lead to the scheduler never pulling the
> too-expensive job, effectively hanging forever.
> 
> Document this problem as a FIXME.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 492e8af639db..eaf8d17b2a66 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1237,6 +1237,16 @@ static void drm_sched_run_job_work(struct work_struct *w)
>  
>  	/* Find entity with a ready job */
>  	entity = drm_sched_select_entity(sched);
> +	/*
> +	 * FIXME:
> +	 * The entity can be NULL when the scheduler currently has no capacity
> +	 * (credits) for more jobs. If that happens, the work item terminates
> +	 * itself here, without rescheduling itself.
> +	 *
> +	 * It only gets started again in drm_sched_entity_push_job(). IOW, the
> +	 * scheduler might hang forever if a job that needs too many credits
> +	 * gets submitted to an entity and no other, subsequent jobs are.
> +	 */

drm_sched_job_done frees the credits, which triggers
drm_sched_free_job_work, and that in turn triggers
drm_sched_run_job_work.

This flow could be refined a bit, but I do believe it works—unless I'm
missing something. I'm pretty sure we have tests in Xe that exhaust the
credits, though it might be continuous submissions; I'd have to check.

Matt 

>  	if (!entity) {
>  		/*
>  		 * Either no more work to do, or the next ready job needs more
> -- 
> 2.49.0
>

Re: [PATCH 2/2] drm/sched: Add FIXME detailing potential hang

Posted by Philipp Stanner 3 months, 1 week ago

On Tue, 2025-10-28 at 12:43 -0700, Matthew Brost wrote:
> On Tue, Oct 28, 2025 at 02:46:02PM +0100, Philipp Stanner wrote:
> > If a job from a ready entity needs more credits than are currently
> > available, drm_sched_run_job_work() (a work item) simply returns and
> > doesn't reschedule itself. The scheduler is only woken up again when the
> > next job gets pushed with drm_sched_entity_push_job().
> > 
> > If someone submits a job that needs too many credits and doesn't submit
> > more jobs afterwards, this would lead to the scheduler never pulling the
> > too-expensive job, effectively hanging forever.
> > 
> > Document this problem as a FIXME.
> > 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 492e8af639db..eaf8d17b2a66 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1237,6 +1237,16 @@ static void drm_sched_run_job_work(struct work_struct *w)
> >  
> >  	/* Find entity with a ready job */
> >  	entity = drm_sched_select_entity(sched);
> > +	/*
> > +	 * FIXME:
> > +	 * The entity can be NULL when the scheduler currently has no capacity
> > +	 * (credits) for more jobs. If that happens, the work item terminates
> > +	 * itself here, without rescheduling itself.
> > +	 *
> > +	 * It only gets started again in drm_sched_entity_push_job(). IOW, the
> > +	 * scheduler might hang forever if a job that needs too many credits
> > +	 * gets submitted to an entity and no other, subsequent jobs are.
> > +	 */
> 
> drm_sched_job_done frees the credits, which triggers
> drm_sched_free_job_work, and that in turn triggers
> drm_sched_run_job_work.

Sounds correct to me.

We can still merge #1, though, for a bit more clearness.

P.