[RFC PATCH 0/6] drm/sched: Avoid memory leaks by canceling job-by-job

Philipp Stanner posted 6 patches 8 months, 1 week ago
drivers/gpu/drm/nouveau/nouveau_fence.c       | 35 +++++----
drivers/gpu/drm/nouveau/nouveau_fence.h       |  7 ++
drivers/gpu/drm/nouveau/nouveau_sched.c       | 35 +++++----
drivers/gpu/drm/nouveau/nouveau_sched.h       |  9 +--
drivers/gpu/drm/nouveau/nouveau_uvmm.c        |  8 +--
drivers/gpu/drm/scheduler/sched_main.c        | 37 ++++++----
.../gpu/drm/scheduler/tests/mock_scheduler.c  | 71 +++++++------------
drivers/gpu/drm/scheduler/tests/sched_tests.h |  4 +-
include/drm/gpu_scheduler.h                   |  9 +++
9 files changed, 115 insertions(+), 100 deletions(-)
[RFC PATCH 0/6] drm/sched: Avoid memory leaks by canceling job-by-job
Posted by Philipp Stanner 8 months, 1 week ago
An alternative version to [1], based on Tvrtko's suggestion from [2].

I tested this for Nouveau. Works.

I'm having, however, bigger problems properly porting the unit tests and
have seen various explosions. In the process I noticed that some things
in the unit tests aren't right and a bit of a larger rework will be
necessary (for example, the timedout job callback must signal the
timedout fence, remove it from the list and so on).

Anyways. Please comment on the general idea.

@Tvrtko: As briefly brainstormed about on IRC, if you'd be willing to
take care of the unit tests patch, I could remove that one (and,
maaaaybe, the warning print patch) from the series and we could merge
this RFC's successor version %N once it's ready. What do you think?

P.

[1] https://lore.kernel.org/dri-devel/20250522082742.148191-2-phasta@kernel.org/
[2] https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/

Philipp Stanner (6):
  drm/sched: Avoid memory leaks with cancel_job() callback
  drm/sched/tests: Implement cancel_job()
  drm/sched: Warn if pending list is not empty
  drm/nouveau: Make fence container helper usable driver-wide
  drm/nouveau: Add new callback for scheduler teardown
  drm/nouveau: Remove waitque for sched teardown

 drivers/gpu/drm/nouveau/nouveau_fence.c       | 35 +++++----
 drivers/gpu/drm/nouveau/nouveau_fence.h       |  7 ++
 drivers/gpu/drm/nouveau/nouveau_sched.c       | 35 +++++----
 drivers/gpu/drm/nouveau/nouveau_sched.h       |  9 +--
 drivers/gpu/drm/nouveau/nouveau_uvmm.c        |  8 +--
 drivers/gpu/drm/scheduler/sched_main.c        | 37 ++++++----
 .../gpu/drm/scheduler/tests/mock_scheduler.c  | 71 +++++++------------
 drivers/gpu/drm/scheduler/tests/sched_tests.h |  4 +-
 include/drm/gpu_scheduler.h                   |  9 +++
 9 files changed, 115 insertions(+), 100 deletions(-)

-- 
2.49.0
Re: [RFC PATCH 0/6] drm/sched: Avoid memory leaks by canceling job-by-job
Posted by Tvrtko Ursulin 8 months, 1 week ago
On 03/06/2025 10:31, Philipp Stanner wrote:
> An alternative version to [1], based on Tvrtko's suggestion from [2].
> 
> I tested this for Nouveau. Works.
> 
> I'm having, however, bigger problems properly porting the unit tests and
> have seen various explosions. In the process I noticed that some things
> in the unit tests aren't right and a bit of a larger rework will be
> necessary (for example, the timedout job callback must signal the
> timedout fence, remove it from the list and so on).

General approach I follow when implementing any mock component is to 
implement only as much is needed for a test to pass. Only add more and 
rework when a test/functionality is added which requires it.

Specifically for timedout callback signaling I see that I had exactly 
that added in the patch you linked as [2].
  > Anyways. Please comment on the general idea.

I am obviously okay with it. :) Especially now that you verified it 
works well for nouveau.

What I am not that ecstatic about is only getting the Suggested-by 
credit in 1/6. Given it is basically my patch with some cosmetic changes 
like the kernel doc and the cancel loop extracted to a helper.

> @Tvrtko: As briefly brainstormed about on IRC, if you'd be willing to
> take care of the unit tests patch, I could remove that one (and,
> maaaaybe, the warning print patch) from the series and we could merge
> this RFC's successor version %N once it's ready. What do you think?

Okay in principle but the first thing I would suggest you could try is 
to take my unit tests adaptations from [2] verbatim. Benefit of keeping 
everything in one series is more confidence we are merging a solid 
thing. But I can take it on myself as a follow up too if you want.

Regards,

Tvrtko

> 
> P.
> 
> [1] https://lore.kernel.org/dri-devel/20250522082742.148191-2-phasta@kernel.org/
> [2] https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/
> 
> Philipp Stanner (6):
>    drm/sched: Avoid memory leaks with cancel_job() callback
>    drm/sched/tests: Implement cancel_job()
>    drm/sched: Warn if pending list is not empty
>    drm/nouveau: Make fence container helper usable driver-wide
>    drm/nouveau: Add new callback for scheduler teardown
>    drm/nouveau: Remove waitque for sched teardown
> 
>   drivers/gpu/drm/nouveau/nouveau_fence.c       | 35 +++++----
>   drivers/gpu/drm/nouveau/nouveau_fence.h       |  7 ++
>   drivers/gpu/drm/nouveau/nouveau_sched.c       | 35 +++++----
>   drivers/gpu/drm/nouveau/nouveau_sched.h       |  9 +--
>   drivers/gpu/drm/nouveau/nouveau_uvmm.c        |  8 +--
>   drivers/gpu/drm/scheduler/sched_main.c        | 37 ++++++----
>   .../gpu/drm/scheduler/tests/mock_scheduler.c  | 71 +++++++------------
>   drivers/gpu/drm/scheduler/tests/sched_tests.h |  4 +-
>   include/drm/gpu_scheduler.h                   |  9 +++
>   9 files changed, 115 insertions(+), 100 deletions(-)
>
Re: [RFC PATCH 0/6] drm/sched: Avoid memory leaks by canceling job-by-job
Posted by Philipp Stanner 8 months, 1 week ago
On Tue, 2025-06-03 at 13:27 +0100, Tvrtko Ursulin wrote:
> 
> On 03/06/2025 10:31, Philipp Stanner wrote:
> > An alternative version to [1], based on Tvrtko's suggestion from
> > [2].
> > 
> > I tested this for Nouveau. Works.
> > 
> > I'm having, however, bigger problems properly porting the unit
> > tests and
> > have seen various explosions. In the process I noticed that some
> > things
> > in the unit tests aren't right and a bit of a larger rework will be
> > necessary (for example, the timedout job callback must signal the
> > timedout fence, remove it from the list and so on).
> 
> General approach I follow when implementing any mock component is to 
> implement only as much is needed for a test to pass. Only add more
> and 
> rework when a test/functionality is added which requires it.
> 
> Specifically for timedout callback signaling I see that I had exactly
> that added in the patch you linked as [2].
>   > Anyways. Please comment on the general idea.
> 
> I am obviously okay with it. :) Especially now that you verified it 
> works well for nouveau.
> 
> What I am not that ecstatic about is only getting the Suggested-by 
> credit in 1/6. Given it is basically my patch with some cosmetic
> changes 
> like the kernel doc and the cancel loop extracted to a helper.

Sign the patch off and I give you the authorship if you want.

> 
> > @Tvrtko: As briefly brainstormed about on IRC, if you'd be willing
> > to
> > take care of the unit tests patch, I could remove that one (and,
> > maaaaybe, the warning print patch) from the series and we could
> > merge
> > this RFC's successor version %N once it's ready. What do you think?
> 
> Okay in principle but the first thing I would suggest you could try
> is 
> to take my unit tests adaptations from [2] verbatim. Benefit of
> keeping 
> everything in one series is more confidence we are merging a solid 
> thing. But I can take it on myself as a follow up too if you want.
> 
> Regards,
> 
> Tvrtko
> 
> > 
> > P.
> > 
> > [1]
> > https://lore.kernel.org/dri-devel/20250522082742.148191-2-phasta@kernel.org/
> > [2]
> > https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/
> > 
> > Philipp Stanner (6):
> >    drm/sched: Avoid memory leaks with cancel_job() callback
> >    drm/sched/tests: Implement cancel_job()
> >    drm/sched: Warn if pending list is not empty
> >    drm/nouveau: Make fence container helper usable driver-wide
> >    drm/nouveau: Add new callback for scheduler teardown
> >    drm/nouveau: Remove waitque for sched teardown
> > 
> >   drivers/gpu/drm/nouveau/nouveau_fence.c       | 35 +++++----
> >   drivers/gpu/drm/nouveau/nouveau_fence.h       |  7 ++
> >   drivers/gpu/drm/nouveau/nouveau_sched.c       | 35 +++++----
> >   drivers/gpu/drm/nouveau/nouveau_sched.h       |  9 +--
> >   drivers/gpu/drm/nouveau/nouveau_uvmm.c        |  8 +--
> >   drivers/gpu/drm/scheduler/sched_main.c        | 37 ++++++----
> >   .../gpu/drm/scheduler/tests/mock_scheduler.c  | 71 +++++++-------
> > -----
> >   drivers/gpu/drm/scheduler/tests/sched_tests.h |  4 +-
> >   include/drm/gpu_scheduler.h                   |  9 +++
> >   9 files changed, 115 insertions(+), 100 deletions(-)
> > 
> 
Re: [RFC PATCH 0/6] drm/sched: Avoid memory leaks by canceling job-by-job
Posted by Danilo Krummrich 8 months ago
> On Tue, 2025-06-03 at 13:27 +0100, Tvrtko Ursulin wrote:
> > On 03/06/2025 10:31, Philipp Stanner wrote:
> > What I am not that ecstatic about is only getting the Suggested-by 
> > credit in 1/6. Given it is basically my patch with some cosmetic
> > changes 
> > like the kernel doc and the cancel loop extracted to a helper.
> 
> Sign the patch off and I give you the authorship if you want.

AFAICS, the proposal of having cancel_job() has been a review comment which has
been clarified with a reference patch.

IMO, the fact that after some discussion Philipp decided to go with this
suggestion and implement the suggestion in his patch series does not result in
an obligation for him to hand over authorship of the patch he wrote to the
person who suggested the change in the context of the code review.

Anyways, it seems that Philipp did offer it however, so this seems to be
resolved?
Re: [RFC PATCH 0/6] drm/sched: Avoid memory leaks by canceling job-by-job
Posted by Tvrtko Ursulin 8 months ago
On 11/06/2025 22:21, Danilo Krummrich wrote:
>> On Tue, 2025-06-03 at 13:27 +0100, Tvrtko Ursulin wrote:
>>> On 03/06/2025 10:31, Philipp Stanner wrote:
>>> What I am not that ecstatic about is only getting the Suggested-by
>>> credit in 1/6. Given it is basically my patch with some cosmetic
>>> changes
>>> like the kernel doc and the cancel loop extracted to a helper.
>>
>> Sign the patch off and I give you the authorship if you want.
> 
> AFAICS, the proposal of having cancel_job() has been a review comment which has
> been clarified with a reference patch.

Right, this one:

https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/

> IMO, the fact that after some discussion Philipp decided to go with this
> suggestion and implement the suggestion in his patch series does not result in
> an obligation for him to hand over authorship of the patch he wrote to the
> person who suggested the change in the context of the code review.

It is fine. Just that instead of rewriting we could have also said 
something along the lines of "Okay lets go with your version after all, 
just please tweak this or that". Which in my experience would have been 
more typical.

> Anyways, it seems that Philipp did offer it however, so this seems to be
> resolved?

At the end of the day the very fact a neater solution is going in is the 
main thing for me. Authorship is not that important, only that the way 
of working I follow, both as a maintainer and a colleague, aspires to be 
more like what I described in the previous paragraph.

I am not sure I can review this version though. It feels it would be too 
much like reviewing my own code so wouldn't carry the fully weight of 
review. Technically I probably could, but in reality someone else should 
probably better do it.

Regards,

Tvrtko