drivers/gpu/drm/scheduler/sched_entity.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
Problem: If prev(last_scheduled) was already signaled I encountred a
memory leak in drm_sched_entity_fini. This is because the
prev(last_scheduled) fence is not free properly.
Fix: Balance the prev(last_scheduled) fence refcnt when
dma_fence_add_callback failed.
Signed-off-by: qianyi liu <liuqianyi125@gmail.com>
---
drivers/gpu/drm/scheduler/sched_entity.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 69bcf0e99d57..1c0c14bcf726 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -259,9 +259,12 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
struct drm_sched_fence *s_fence = job->s_fence;
dma_fence_get(&s_fence->finished);
- if (!prev || dma_fence_add_callback(prev, &job->finish_cb,
- drm_sched_entity_kill_jobs_cb))
+ if (!prev ||
+ dma_fence_add_callback(prev, &job->finish_cb,
+ drm_sched_entity_kill_jobs_cb)) {
+ dma_fence_put(prev);
drm_sched_entity_kill_jobs_cb(NULL, &job->finish_cb);
+ }
prev = &s_fence->finished;
}
--
2.25.1
> Problem: If prev(last_scheduled) was already signaled I encountred a
signalled? encountered?
> memory leak in drm_sched_entity_fini. This is because the
> prev(last_scheduled) fence is not free properly.
freed?
> Fix: Balance the prev(last_scheduled) fence refcnt when
…
reference count?
Would a summary phrase like “Fix memory leak when last_scheduled signalled”
be more appropriate?
How do you think about to add any tags (like “Fixes” and “Cc”) accordingly?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.14-rc4#n145
Regards,
Markus
On Tue, 2025-02-25 at 15:00 +0100, Markus Elfring wrote: > > Problem: If prev(last_scheduled) was already signaled I encountred > > a > > signalled? > encountered? > > > > memory leak in drm_sched_entity_fini. This is because the > > prev(last_scheduled) fence is not free properly. > > freed? > > > > Fix: Balance the prev(last_scheduled) fence refcnt when > … > reference count? > > > Would a summary phrase like “Fix memory leak when last_scheduled > signalled” > be more appropriate? > > > How do you think about to add any tags (like “Fixes” and “Cc”) > accordingly? > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.14-rc4#n145 > He has already addressed that in v2. https://lore.kernel.org/all/20250225094125.224580-1-liuqianyi125@gmail.com/ Besides, Matthew and I made those remarks already here in v1. P. > Regards, > Markus
Hello,
subject line: please write "drm/sched" instead of "drm/scheduler". It
has become the norm
On Fri, 2025-02-21 at 14:27 +0800, qianyi liu wrote:
> Problem: If prev(last_scheduled) was already signaled I encountred a
prev(last_scheduled) almost reads like a function call. Maybe write
"prev / last_scheduled"?
> memory leak in drm_sched_entity_fini. This is because the
> prev(last_scheduled) fence is not free properly.
s/free/freed
>
> Fix: Balance the prev(last_scheduled) fence refcnt when
> dma_fence_add_callback failed.
>
> Signed-off-by: qianyi liu <liuqianyi125@gmail.com>
> ---
> drivers/gpu/drm/scheduler/sched_entity.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index 69bcf0e99d57..1c0c14bcf726 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -259,9 +259,12 @@ static void drm_sched_entity_kill(struct
> drm_sched_entity *entity)
> struct drm_sched_fence *s_fence = job->s_fence;
>
> dma_fence_get(&s_fence->finished);
> - if (!prev || dma_fence_add_callback(prev, &job-
> >finish_cb,
> -
> drm_sched_entity_kill_jobs_cb))
> + if (!prev ||
> + dma_fence_add_callback(prev, &job->finish_cb,
> +
> drm_sched_entity_kill_jobs_cb)) {
> + dma_fence_put(prev);
But now the fence will also be put when prev == NULL. Is that
intentional? It doesn't seem correct to me from looking at the commit
message, which states "Balance […] refcnt when dma_fence_add_callback
failed"
It didn't get clear to me immediately which dma_fence_get() your new
dma_fence_put() balances. Can you ellaborate on that or maybe write a
comment?
But also be handy of could share the kmemleak trace.
Thanks
P.
> drm_sched_entity_kill_jobs_cb(NULL, &job-
> >finish_cb);
> + }
>
> prev = &s_fence->finished;
> }
On Mon, 2025-02-24 at 10:52 +0100, Philipp Stanner wrote:
> Hello,
>
> subject line: please write "drm/sched" instead of "drm/scheduler". It
> has become the norm
>
> On Fri, 2025-02-21 at 14:27 +0800, qianyi liu wrote:
> > Problem: If prev(last_scheduled) was already signaled I encountred
> > a
>
> prev(last_scheduled) almost reads like a function call. Maybe write
> "prev / last_scheduled"?
>
> > memory leak in drm_sched_entity_fini. This is because the
> > prev(last_scheduled) fence is not free properly.
>
> s/free/freed
>
> >
> > Fix: Balance the prev(last_scheduled) fence refcnt when
> > dma_fence_add_callback failed.
Oh, and importantly, I forgot:
Since this is clearly a bug fix, it needs a "Fixes: " tag and put the
stable kernel on Cc.
P.
> >
> > Signed-off-by: qianyi liu <liuqianyi125@gmail.com>
> > ---
> > drivers/gpu/drm/scheduler/sched_entity.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > b/drivers/gpu/drm/scheduler/sched_entity.c
> > index 69bcf0e99d57..1c0c14bcf726 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -259,9 +259,12 @@ static void drm_sched_entity_kill(struct
> > drm_sched_entity *entity)
> > struct drm_sched_fence *s_fence = job->s_fence;
> >
> > dma_fence_get(&s_fence->finished);
> > - if (!prev || dma_fence_add_callback(prev, &job-
> > > finish_cb,
> > -
> > drm_sched_entity_kill_jobs_cb))
> > + if (!prev ||
> > + dma_fence_add_callback(prev, &job->finish_cb,
> > +
> > drm_sched_entity_kill_jobs_cb)) {
> > + dma_fence_put(prev);
>
> But now the fence will also be put when prev == NULL. Is that
> intentional? It doesn't seem correct to me from looking at the commit
> message, which states "Balance […] refcnt when dma_fence_add_callback
> failed"
>
> It didn't get clear to me immediately which dma_fence_get() your new
> dma_fence_put() balances. Can you ellaborate on that or maybe write a
> comment?
>
> But also be handy of could share the kmemleak trace.
>
>
> Thanks
> P.
>
> > drm_sched_entity_kill_jobs_cb(NULL, &job-
> > > finish_cb);
> > + }
> >
> > prev = &s_fence->finished;
> > }
>
> Oh, and importantly, I forgot: > Since this is clearly a bug fix, it needs a "Fixes: " tag and put the > stable kernel on Cc. OK, thanks for reminding. Qianyi.
On Mon, Feb 24, 2025 at 10:52:56AM +0100, Philipp Stanner wrote:
> Hello,
>
> subject line: please write "drm/sched" instead of "drm/scheduler". It
> has become the norm
>
> On Fri, 2025-02-21 at 14:27 +0800, qianyi liu wrote:
> > Problem: If prev(last_scheduled) was already signaled I encountred a
>
> prev(last_scheduled) almost reads like a function call. Maybe write
> "prev / last_scheduled"?
>
> > memory leak in drm_sched_entity_fini. This is because the
> > prev(last_scheduled) fence is not free properly.
>
> s/free/freed
>
> >
> > Fix: Balance the prev(last_scheduled) fence refcnt when
> > dma_fence_add_callback failed.
> >
> > Signed-off-by: qianyi liu <liuqianyi125@gmail.com>
> > ---
> > drivers/gpu/drm/scheduler/sched_entity.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > b/drivers/gpu/drm/scheduler/sched_entity.c
> > index 69bcf0e99d57..1c0c14bcf726 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -259,9 +259,12 @@ static void drm_sched_entity_kill(struct
> > drm_sched_entity *entity)
> > struct drm_sched_fence *s_fence = job->s_fence;
> >
> > dma_fence_get(&s_fence->finished);
> > - if (!prev || dma_fence_add_callback(prev, &job-
> > >finish_cb,
> > -
> > drm_sched_entity_kill_jobs_cb))
> > + if (!prev ||
> > + dma_fence_add_callback(prev, &job->finish_cb,
> > +
> > drm_sched_entity_kill_jobs_cb)) {
> > + dma_fence_put(prev);
>
> But now the fence will also be put when prev == NULL. Is that
dma_fence_put(NULL) is a NOP [1].
[1] https://elixir.bootlin.com/linux/v6.13.4/source/include/linux/dma-fence.h#L290
> intentional? It doesn't seem correct to me from looking at the commit
> message, which states "Balance […] refcnt when dma_fence_add_callback
> failed"
>
> It didn't get clear to me immediately which dma_fence_get() your new
> dma_fence_put() balances. Can you ellaborate on that or maybe write a
> comment?
drm_sched_entity_kill_jobs_cb(prev, ...) - Calls put 'prev'
drm_sched_entity_kill_jobs_cb(NULL, ...) - Does not.
>
> But also be handy of could share the kmemleak trace.
>
Agree kmemleak trace would good, include in commit message, but the
patch looks correct to me.
I also think the commit message need a bit of work as Phillip suggests.
Matt
>
> Thanks
> P.
>
> > drm_sched_entity_kill_jobs_cb(NULL, &job-
> > >finish_cb);
> > + }
> >
> > prev = &s_fence->finished;
> > }
>
Hello Matt, >> >> But also be handy of could share the kmemleak trace. >> > Agree kmemleak trace would good, include in commit message, but the > patch looks correct to me. Unfortunately, as trace involves private code of our driver, our driver has not yet officially entered the community yet, but it is already on the way. So I cannot provide trace information at this time. But I will update the commit in V2. Best Regards. QianYi.
Hello Philipp, Thank you for your patient reply. Let's first clarify the issue and send a new patch if necessary. As soon as it enters the drm_sched_entity_kill function, the entity ->last_scheduled reference count is incremented by 1. If there are still jobs in the current entity, it will enter the while loop, assuming there is only one job left. If entity->last_scheduled has already been signaled, it will enter drm_sched_entity_kill_jobs_cb, but because null is passed in, the last_scheduled reference count will not be correctly reduced by 1. Because the prev pointer has been updated to &s_fence->finished, the dma_fence_put in the last line only reduces the reference count of s_fence->finished. The reference count of entity->last_scheduled was not reduced by 1, causing a memory leak. We should subtract 1 from the reference count of the prev when dma_fence_add_callback fails, which is called balance. Best Regards. QianYi.
On Tue, 2025-02-25 at 11:12 +0800, Qianyi Liu wrote: > Hello Philipp, > > Thank you for your patient reply. Let's first clarify the issue and > send a new > patch if necessary. > > As soon as it enters the drm_sched_entity_kill function, the entity > ->last_scheduled reference count is incremented by 1. If there are > still jobs in > the current entity, it will enter the while loop, assuming there is > only one job > left. If entity->last_scheduled has already been signaled, it will > enter > drm_sched_entity_kill_jobs_cb, but because null is passed in, the > last_scheduled reference count will not be correctly reduced by 1. > > Because the prev pointer has been updated to &s_fence->finished, the > dma_fence_put in the last line only reduces the reference count of > s_fence->finished. > The reference count of entity->last_scheduled was not reduced by > 1, causing a memory leak. > > We should subtract 1 from the reference count of the prev when > dma_fence_add_callback > fails, which is called balance. > > Best Regards. > QianYi. OK, thanks for clarification. I think, next to the other feedback, it would be good to have a brief version of the above explanation in your commit message, since this clearly describes the problem and the proposed solution. Please address the feedback by Matt and myself in a v2. (and btw., please don't remove the original e-mail content when answering – that's uncommon on-list. Only when the threads are getting huge one removes parts not addressed by the answer from the quoted content ;) Thanks P.
Thank you for your patient reply. Let's first clarify the issue and send a new patch if necessary. As soon as it enters the drm_sched_entity_kill function, the entity ->last_stchedled reference count is incremented by 1. If there are still jobs in the current entity, it will enter the while loop, assuming there is only one job left. If entity ->last_stcheduled has already been signaled, it will enter drm_sched_detity_kill_jobs_cb, but because null is passed in, the last_stcheduled reference count will not be correctly reduced by 1. Because the prev pointer has been updated to &s_sense ->defined, the dma_ffence_put in the last line only reduces the reference count of s_fence ->finished. The reference count of entity ->last_stcheduled was not reduced by 1, causing a memory leak. We should subtract 1 from the reference count of the prev when dma_ fence_ add_ callback fails, which is called balance. Best Regards. QianYi.
On Mon, 2025-02-24 at 10:52 +0100, Philipp Stanner wrote:
> Hello,
>
> subject line: please write "drm/sched" instead of "drm/scheduler". It
> has become the norm
>
> On Fri, 2025-02-21 at 14:27 +0800, qianyi liu wrote:
> > Problem: If prev(last_scheduled) was already signaled I encountred
> > a
>
> prev(last_scheduled) almost reads like a function call. Maybe write
> "prev / last_scheduled"?
>
> > memory leak in drm_sched_entity_fini. This is because the
> > prev(last_scheduled) fence is not free properly.
>
> s/free/freed
>
> >
> > Fix: Balance the prev(last_scheduled) fence refcnt when
> > dma_fence_add_callback failed.
> >
> > Signed-off-by: qianyi liu <liuqianyi125@gmail.com>
> > ---
> > drivers/gpu/drm/scheduler/sched_entity.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > b/drivers/gpu/drm/scheduler/sched_entity.c
> > index 69bcf0e99d57..1c0c14bcf726 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -259,9 +259,12 @@ static void drm_sched_entity_kill(struct
> > drm_sched_entity *entity)
> > struct drm_sched_fence *s_fence = job->s_fence;
> >
> > dma_fence_get(&s_fence->finished);
> > - if (!prev || dma_fence_add_callback(prev, &job-
> > > finish_cb,
> > -
> > drm_sched_entity_kill_jobs_cb))
> > + if (!prev ||
> > + dma_fence_add_callback(prev, &job->finish_cb,
> > +
> > drm_sched_entity_kill_jobs_cb)) {
> > + dma_fence_put(prev);
>
> But now the fence will also be put when prev == NULL. Is that
> intentional? It doesn't seem correct to me from looking at the commit
> message, which states "Balance […] refcnt when dma_fence_add_callback
> failed"
>
> It didn't get clear to me immediately which dma_fence_get() your new
> dma_fence_put() balances. Can you ellaborate on that or maybe write a
> comment?
>
> But also be handy of could share the kmemleak trace.
Argh.
-> "It would also be handy if you could share the kmemleak trace"
I should drink less…
P.
>
>
> Thanks
> P.
>
> > drm_sched_entity_kill_jobs_cb(NULL, &job-
> > > finish_cb);
> > + }
> >
> > prev = &s_fence->finished;
> > }
>
© 2016 - 2025 Red Hat, Inc.