[v1] drm/sched: Fix dynamic job-flow control race

[PATCH] drm/sched: Fix dynamic job-flow control race

Posted by Rob Clark 1 year, 4 months ago

From: Rob Clark <robdclark@chromium.org>

Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609

The whole premise of lockless access to a single-producer-single-
consumer queue is that there is just a single producer and single
consumer.  That means we can't call drm_sched_can_queue() (which is
about queueing more work to the hw, not to the spsc queue) from
anywhere other than the consumer (wq).

This call in the producer is just an optimization to avoid scheduling
the consuming worker if it cannot yet queue more work to the hw.  It
is safe to drop this optimization to avoid the race condition.

Suggested-by: Asahi Lina <lina@asahilina.net>
Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control")
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index ab53ab486fe6..1af1dbe757d5 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
 		      struct drm_sched_entity *entity)
 {
-	if (drm_sched_can_queue(sched, entity))
-		drm_sched_run_job_queue(sched);
+	drm_sched_run_job_queue(sched);
 }
 
 /**
-- 
2.46.0

Re: [PATCH] drm/sched: Fix dynamic job-flow control race

Posted by Philipp Stanner 1 year, 4 months ago

On Fri, 2024-09-13 at 09:53 -0700, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Fixes a race condition reported here:
> https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609

As Danilo suggested before, I'd put this in a Fixes: section at the
bottom and instead have a sentence here detailing what the race
consists of, i.e., who is racing with whom.

P.

> 
> The whole premise of lockless access to a single-producer-single-
> consumer queue is that there is just a single producer and single
> consumer.  That means we can't call drm_sched_can_queue() (which is
> about queueing more work to the hw, not to the spsc queue) from
> anywhere other than the consumer (wq).
> 
> This call in the producer is just an optimization to avoid scheduling
> the consuming worker if it cannot yet queue more work to the hw.  It
> is safe to drop this optimization to avoid the race condition.
> 
> Suggested-by: Asahi Lina <lina@asahilina.net>
> Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control")
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index ab53ab486fe6..1af1dbe757d5 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>  		      struct drm_sched_entity *entity)
>  {
> -	if (drm_sched_can_queue(sched, entity))
> -		drm_sched_run_job_queue(sched);
> +	drm_sched_run_job_queue(sched);
>  }
>  
>  /**

Re: [PATCH] drm/sched: Fix dynamic job-flow control race

Posted by Danilo Krummrich 1 year, 4 months ago

On Mon, Sep 16, 2024 at 10:21:21AM +0200, Philipp Stanner wrote:
> On Fri, 2024-09-13 at 09:53 -0700, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> > 
> > Fixes a race condition reported here:
> > https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609
> 
> As Danilo suggested before, I'd put this in a Fixes: section at the

I think you mean 'Closes'.

Please note that there's a v2 [1] already, which does that.

[1] https://lore.kernel.org/dri-devel/20240913202301.16772-1-robdclark@gmail.com/

> bottom and instead have a sentence here detailing what the race
> consists of, i.e., who is racing with whom.

That's written right below, isn't it?

> 
> P.
> 
> > 
> > The whole premise of lockless access to a single-producer-single-
> > consumer queue is that there is just a single producer and single
> > consumer.  That means we can't call drm_sched_can_queue() (which is
> > about queueing more work to the hw, not to the spsc queue) from
> > anywhere other than the consumer (wq).
> > 
> > This call in the producer is just an optimization to avoid scheduling
> > the consuming worker if it cannot yet queue more work to the hw.  It
> > is safe to drop this optimization to avoid the race condition.
> > 
> > Suggested-by: Asahi Lina <lina@asahilina.net>
> > Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control")
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index ab53ab486fe6..1af1dbe757d5 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
> >  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
> >  		      struct drm_sched_entity *entity)
> >  {
> > -	if (drm_sched_can_queue(sched, entity))
> > -		drm_sched_run_job_queue(sched);
> > +	drm_sched_run_job_queue(sched);
> >  }
> >  
> >  /**
>

Re: [PATCH] drm/sched: Fix dynamic job-flow control race

Posted by Danilo Krummrich 1 year, 4 months ago

On Fri, Sep 13, 2024 at 09:53:25AM -0700, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609

Good catch! Please add a 'Closes' tag with this link.

> 
> The whole premise of lockless access to a single-producer-single-
> consumer queue is that there is just a single producer and single
> consumer.  That means we can't call drm_sched_can_queue() (which is
> about queueing more work to the hw, not to the spsc queue) from
> anywhere other than the consumer (wq).
> 
> This call in the producer is just an optimization to avoid scheduling
> the consuming worker if it cannot yet queue more work to the hw.  It
> is safe to drop this optimization to avoid the race condition.
> 
> Suggested-by: Asahi Lina <lina@asahilina.net>
> Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control")

You may want to explicitly CC stable.

> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index ab53ab486fe6..1af1dbe757d5 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>  		      struct drm_sched_entity *entity)

Please also remove the entity parameter. For the other refactoring, I agree it
should be in a different patch.

With that,

Reviewed-by: Danilo Krummrich <dakr@kernel.org>

Thanks for fixing this.

- Danilo

>  {
> -	if (drm_sched_can_queue(sched, entity))
> -		drm_sched_run_job_queue(sched);
> +	drm_sched_run_job_queue(sched);
>  }
>  
>  /**
> -- 
> 2.46.0
>

Re: [PATCH] drm/sched: Fix dynamic job-flow control race

Posted by Michel Dänzer 1 year, 4 months ago

On 2024-09-13 18:53, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609
> 
> The whole premise of lockless access to a single-producer-single-
> consumer queue is that there is just a single producer and single
> consumer.  That means we can't call drm_sched_can_queue() (which is
> about queueing more work to the hw, not to the spsc queue) from
> anywhere other than the consumer (wq).
> 
> This call in the producer is just an optimization to avoid scheduling
> the consuming worker if it cannot yet queue more work to the hw.  It
> is safe to drop this optimization to avoid the race condition.
> 
> Suggested-by: Asahi Lina <lina@asahilina.net>
> Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control")
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index ab53ab486fe6..1af1dbe757d5 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>  		      struct drm_sched_entity *entity)
>  {
> -	if (drm_sched_can_queue(sched, entity))
> -		drm_sched_run_job_queue(sched);
> +	drm_sched_run_job_queue(sched);
>  }
>  
>  /**

The entity parameter is unused now.


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast

Re: [PATCH] drm/sched: Fix dynamic job-flow control race

Posted by Rob Clark 1 year, 4 months ago

On Fri, Sep 13, 2024 at 10:03 AM Michel Dänzer
<michel.daenzer@mailbox.org> wrote:
>
> On 2024-09-13 18:53, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> >
> > Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609
> >
> > The whole premise of lockless access to a single-producer-single-
> > consumer queue is that there is just a single producer and single
> > consumer.  That means we can't call drm_sched_can_queue() (which is
> > about queueing more work to the hw, not to the spsc queue) from
> > anywhere other than the consumer (wq).
> >
> > This call in the producer is just an optimization to avoid scheduling
> > the consuming worker if it cannot yet queue more work to the hw.  It
> > is safe to drop this optimization to avoid the race condition.
> >
> > Suggested-by: Asahi Lina <lina@asahilina.net>
> > Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control")
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index ab53ab486fe6..1af1dbe757d5 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
> >  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
> >                     struct drm_sched_entity *entity)
> >  {
> > -     if (drm_sched_can_queue(sched, entity))
> > -             drm_sched_run_job_queue(sched);
> > +     drm_sched_run_job_queue(sched);
> >  }
> >
> >  /**
>
> The entity parameter is unused now.

Right.. and we probably should collapse drm_sched_wakeup() and
drm_sched_run_job_queue()..

But this fix needs to be cherry picked back to a bunch of release
branches, so I intentionally avoided refactoring as part of the fix.

BR,
-R

>
>
> --
> Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
> https://redhat.com             \               Libre software enthusiast

Re: [PATCH] drm/sched: Fix dynamic job-flow control race

Posted by Michel Dänzer 1 year, 4 months ago

On 2024-09-13 19:28, Rob Clark wrote:
> On Fri, Sep 13, 2024 at 10:03 AM Michel Dänzer
> <michel.daenzer@mailbox.org> wrote:
>>
>> On 2024-09-13 18:53, Rob Clark wrote:
>>> From: Rob Clark <robdclark@chromium.org>
>>>
>>> Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609
>>>
>>> The whole premise of lockless access to a single-producer-single-
>>> consumer queue is that there is just a single producer and single
>>> consumer.  That means we can't call drm_sched_can_queue() (which is
>>> about queueing more work to the hw, not to the spsc queue) from
>>> anywhere other than the consumer (wq).
>>>
>>> This call in the producer is just an optimization to avoid scheduling
>>> the consuming worker if it cannot yet queue more work to the hw.  It
>>> is safe to drop this optimization to avoid the race condition.
>>>
>>> Suggested-by: Asahi Lina <lina@asahilina.net>
>>> Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control")
>>> Signed-off-by: Rob Clark <robdclark@chromium.org>
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_main.c | 3 +--
>>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index ab53ab486fe6..1af1dbe757d5 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>>>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>>>                     struct drm_sched_entity *entity)
>>>  {
>>> -     if (drm_sched_can_queue(sched, entity))
>>> -             drm_sched_run_job_queue(sched);
>>> +     drm_sched_run_job_queue(sched);
>>>  }
>>>
>>>  /**
>>
>> The entity parameter is unused now.
> 
> Right.. and we probably should collapse drm_sched_wakeup() and
> drm_sched_run_job_queue()..

I looked into that as well, seems fine to leave them separate for now though.


> But this fix needs to be cherry picked back to a bunch of release
> branches, so I intentionally avoided refactoring as part of the fix.

Fixing up the drm_sched_wakeup caller(s) when backporting doesn't seem like a big deal.


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast