drivers/gpu/drm/scheduler/sched_main.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
From: Rob Clark <robdclark@chromium.org>
Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609
The whole premise of lockless access to a single-producer-single-
consumer queue is that there is just a single producer and single
consumer. That means we can't call drm_sched_can_queue() (which is
about queueing more work to the hw, not to the spsc queue) from
anywhere other than the consumer (wq).
This call in the producer is just an optimization to avoid scheduling
the consuming worker if it cannot yet queue more work to the hw. It
is safe to drop this optimization to avoid the race condition.
Suggested-by: Asahi Lina <lina@asahilina.net>
Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control")
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
drivers/gpu/drm/scheduler/sched_main.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index ab53ab486fe6..1af1dbe757d5 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
struct drm_sched_entity *entity)
{
- if (drm_sched_can_queue(sched, entity))
- drm_sched_run_job_queue(sched);
+ drm_sched_run_job_queue(sched);
}
/**
--
2.46.0
On Fri, 2024-09-13 at 09:53 -0700, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > Fixes a race condition reported here: > https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609 As Danilo suggested before, I'd put this in a Fixes: section at the bottom and instead have a sentence here detailing what the race consists of, i.e., who is racing with whom. P. > > The whole premise of lockless access to a single-producer-single- > consumer queue is that there is just a single producer and single > consumer. That means we can't call drm_sched_can_queue() (which is > about queueing more work to the hw, not to the spsc queue) from > anywhere other than the consumer (wq). > > This call in the producer is just an optimization to avoid scheduling > the consuming worker if it cannot yet queue more work to the hw. It > is safe to drop this optimization to avoid the race condition. > > Suggested-by: Asahi Lina <lina@asahilina.net> > Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control") > Signed-off-by: Rob Clark <robdclark@chromium.org> > --- > drivers/gpu/drm/scheduler/sched_main.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index ab53ab486fe6..1af1dbe757d5 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > struct drm_sched_entity *entity) > { > - if (drm_sched_can_queue(sched, entity)) > - drm_sched_run_job_queue(sched); > + drm_sched_run_job_queue(sched); > } > > /**
On Mon, Sep 16, 2024 at 10:21:21AM +0200, Philipp Stanner wrote: > On Fri, 2024-09-13 at 09:53 -0700, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > > > Fixes a race condition reported here: > > https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609 > > As Danilo suggested before, I'd put this in a Fixes: section at the I think you mean 'Closes'. Please note that there's a v2 [1] already, which does that. [1] https://lore.kernel.org/dri-devel/20240913202301.16772-1-robdclark@gmail.com/ > bottom and instead have a sentence here detailing what the race > consists of, i.e., who is racing with whom. That's written right below, isn't it? > > P. > > > > > The whole premise of lockless access to a single-producer-single- > > consumer queue is that there is just a single producer and single > > consumer. That means we can't call drm_sched_can_queue() (which is > > about queueing more work to the hw, not to the spsc queue) from > > anywhere other than the consumer (wq). > > > > This call in the producer is just an optimization to avoid scheduling > > the consuming worker if it cannot yet queue more work to the hw. It > > is safe to drop this optimization to avoid the race condition. > > > > Suggested-by: Asahi Lina <lina@asahilina.net> > > Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control") > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 3 +-- > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > b/drivers/gpu/drm/scheduler/sched_main.c > > index ab53ab486fe6..1af1dbe757d5 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > > struct drm_sched_entity *entity) > > { > > - if (drm_sched_can_queue(sched, entity)) > > - drm_sched_run_job_queue(sched); > > + drm_sched_run_job_queue(sched); > > } > > > > /** >
On Fri, Sep 13, 2024 at 09:53:25AM -0700, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609 Good catch! Please add a 'Closes' tag with this link. > > The whole premise of lockless access to a single-producer-single- > consumer queue is that there is just a single producer and single > consumer. That means we can't call drm_sched_can_queue() (which is > about queueing more work to the hw, not to the spsc queue) from > anywhere other than the consumer (wq). > > This call in the producer is just an optimization to avoid scheduling > the consuming worker if it cannot yet queue more work to the hw. It > is safe to drop this optimization to avoid the race condition. > > Suggested-by: Asahi Lina <lina@asahilina.net> > Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control") You may want to explicitly CC stable. > Signed-off-by: Rob Clark <robdclark@chromium.org> > --- > drivers/gpu/drm/scheduler/sched_main.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index ab53ab486fe6..1af1dbe757d5 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > struct drm_sched_entity *entity) Please also remove the entity parameter. For the other refactoring, I agree it should be in a different patch. With that, Reviewed-by: Danilo Krummrich <dakr@kernel.org> Thanks for fixing this. - Danilo > { > - if (drm_sched_can_queue(sched, entity)) > - drm_sched_run_job_queue(sched); > + drm_sched_run_job_queue(sched); > } > > /** > -- > 2.46.0 >
On 2024-09-13 18:53, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609 > > The whole premise of lockless access to a single-producer-single- > consumer queue is that there is just a single producer and single > consumer. That means we can't call drm_sched_can_queue() (which is > about queueing more work to the hw, not to the spsc queue) from > anywhere other than the consumer (wq). > > This call in the producer is just an optimization to avoid scheduling > the consuming worker if it cannot yet queue more work to the hw. It > is safe to drop this optimization to avoid the race condition. > > Suggested-by: Asahi Lina <lina@asahilina.net> > Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control") > Signed-off-by: Rob Clark <robdclark@chromium.org> > --- > drivers/gpu/drm/scheduler/sched_main.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index ab53ab486fe6..1af1dbe757d5 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > struct drm_sched_entity *entity) > { > - if (drm_sched_can_queue(sched, entity)) > - drm_sched_run_job_queue(sched); > + drm_sched_run_job_queue(sched); > } > > /** The entity parameter is unused now. -- Earthling Michel Dänzer \ GNOME / Xwayland / Mesa developer https://redhat.com \ Libre software enthusiast
On Fri, Sep 13, 2024 at 10:03 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: > > On 2024-09-13 18:53, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > > > Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609 > > > > The whole premise of lockless access to a single-producer-single- > > consumer queue is that there is just a single producer and single > > consumer. That means we can't call drm_sched_can_queue() (which is > > about queueing more work to the hw, not to the spsc queue) from > > anywhere other than the consumer (wq). > > > > This call in the producer is just an optimization to avoid scheduling > > the consuming worker if it cannot yet queue more work to the hw. It > > is safe to drop this optimization to avoid the race condition. > > > > Suggested-by: Asahi Lina <lina@asahilina.net> > > Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control") > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 3 +-- > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index ab53ab486fe6..1af1dbe757d5 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > > struct drm_sched_entity *entity) > > { > > - if (drm_sched_can_queue(sched, entity)) > > - drm_sched_run_job_queue(sched); > > + drm_sched_run_job_queue(sched); > > } > > > > /** > > The entity parameter is unused now. Right.. and we probably should collapse drm_sched_wakeup() and drm_sched_run_job_queue().. But this fix needs to be cherry picked back to a bunch of release branches, so I intentionally avoided refactoring as part of the fix. BR, -R > > > -- > Earthling Michel Dänzer \ GNOME / Xwayland / Mesa developer > https://redhat.com \ Libre software enthusiast
On 2024-09-13 19:28, Rob Clark wrote: > On Fri, Sep 13, 2024 at 10:03 AM Michel Dänzer > <michel.daenzer@mailbox.org> wrote: >> >> On 2024-09-13 18:53, Rob Clark wrote: >>> From: Rob Clark <robdclark@chromium.org> >>> >>> Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609 >>> >>> The whole premise of lockless access to a single-producer-single- >>> consumer queue is that there is just a single producer and single >>> consumer. That means we can't call drm_sched_can_queue() (which is >>> about queueing more work to the hw, not to the spsc queue) from >>> anywhere other than the consumer (wq). >>> >>> This call in the producer is just an optimization to avoid scheduling >>> the consuming worker if it cannot yet queue more work to the hw. It >>> is safe to drop this optimization to avoid the race condition. >>> >>> Suggested-by: Asahi Lina <lina@asahilina.net> >>> Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control") >>> Signed-off-by: Rob Clark <robdclark@chromium.org> >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 3 +-- >>> 1 file changed, 1 insertion(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>> index ab53ab486fe6..1af1dbe757d5 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); >>> void drm_sched_wakeup(struct drm_gpu_scheduler *sched, >>> struct drm_sched_entity *entity) >>> { >>> - if (drm_sched_can_queue(sched, entity)) >>> - drm_sched_run_job_queue(sched); >>> + drm_sched_run_job_queue(sched); >>> } >>> >>> /** >> >> The entity parameter is unused now. > > Right.. and we probably should collapse drm_sched_wakeup() and > drm_sched_run_job_queue().. I looked into that as well, seems fine to leave them separate for now though. > But this fix needs to be cherry picked back to a bunch of release > branches, so I intentionally avoided refactoring as part of the fix. Fixing up the drm_sched_wakeup caller(s) when backporting doesn't seem like a big deal. -- Earthling Michel Dänzer \ GNOME / Xwayland / Mesa developer https://redhat.com \ Libre software enthusiast
© 2016 - 2024 Red Hat, Inc.