[PATCH 3/5] drm/sched: Warn if pending list is not empty

Philipp Stanner posted 5 patches 10 months ago
There is a newer version of this series
[PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Philipp Stanner 10 months ago
drm_sched_fini() can leak jobs under certain circumstances.

Warn if that happens.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 6b72278c4b72..ae3152beca14 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	sched->ready = false;
 	kfree(sched->sched_rq);
 	sched->sched_rq = NULL;
+
+	if (!list_empty(&sched->pending_list))
+		dev_err(sched->dev, "%s: Tearing down scheduler while jobs are pending!\n",
+			__func__);
 }
 EXPORT_SYMBOL(drm_sched_fini);
 
-- 
2.48.1
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Philipp Stanner 9 months, 3 weeks ago
On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
> drm_sched_fini() can leak jobs under certain circumstances.
> 
> Warn if that happens.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 4 ++++

I hear a lot of amazed silence for this series ^_^

If there's no major pushback, my intention is to pull this in once the
blocking Nouveau-bug has been fixed (by pulling my patch).


In the mean time, let me review my own stuff:

>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 6b72278c4b72..ae3152beca14 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
> *sched)
>  	sched->ready = false;
>  	kfree(sched->sched_rq);
>  	sched->sched_rq = NULL;
> +
> +	if (!list_empty(&sched->pending_list))
> +		dev_err(sched->dev, "%s: Tearing down scheduler
> while jobs are pending!\n",
> +			__func__);

The "%s" here will be removed since dev_err() alreday prints the
function name.

I find this dev_err() print more useful than the stack a WARN_ON prints
(telling you about invalid_asm_exec_op or stuff like that).

Plus, I guess the places where drivers call drm_sched_fini() are very
well defined and known, so a callstack wouldn't really be useful in the
first place.


P.

>  }
>  EXPORT_SYMBOL(drm_sched_fini);
>  
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Tvrtko Ursulin 9 months, 3 weeks ago
On 17/04/2025 08:45, Philipp Stanner wrote:
> On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
>> drm_sched_fini() can leak jobs under certain circumstances.
>>
>> Warn if that happens.
>>
>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 4 ++++
> 
> I hear a lot of amazed silence for this series ^_^
> 
> If there's no major pushback, my intention is to pull this in once the
> blocking Nouveau-bug has been fixed (by pulling my patch).

It was on my todo list to read it but failed to get to it due various 
reasons.

I only managed to skim over it it and am not quite convinced. But 
because I did not have time to think about it very much my feedback at 
this point is not very deep.

On the high level I understand for nouveau the series is "net zero", 
right? No leaks before, no leaks after. What about other drivers? Which 
ones have known leaks which could be addressed by them implementing this 
new callback?

Presumably you would document the callback should only be implemented by 
drivers which are 100% sure the clean up can always reliably done? 
Otherwise unkillable process on stuck hardware or drivers with not good 
enough reset support can still happen. (Which would be worse than a leak 
on shutdown.)

Have you thought about any observable effects from the userspace point 
of view? For example something if which now works, such as submitting a 
job which writes to a shared buffer and then exiting could stop working 
after the callback is implemented? I don't see it, because it would be 
unreliable even today (timing dependent whether job is in the queue or 
scheduled at exit time) so just thinking out loud.

Also, since the cover letter mentions job reference counting was one 
idea that was discarded another related problem is about the lifetimes. 
I think it would be good to discuss potentially reference counting 
"everything" because for instance today I can crash the kernel trivially 
with the xe driver*. Probably other drivers too.

Problem exactly is that jobs can outlive the entities and the scheduler, 
while some userspace may have a dma fence reference to the job via sync 
file. This new callback would not solve it for xe, but if everything 
required was reference counted it would.

Back to the series.

On the design level it feels like it adds too much state machine and 
makes things hard to follow/maintain. It would be nice to find a 
solutiuon where we end up with less state machine and not more.

On the low level there are some patch ordering and naming, spelling and 
other small improvements to be made.

But as said at the start, I would need to set aside more time to provide 
better comments and/or ideas.

*)
https://patchwork.freedesktop.org/patch/642709/?series=146211&rev=2

> In the mean time, let me review my own stuff:
> 
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 6b72278c4b72..ae3152beca14 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
>> *sched)
>>   	sched->ready = false;
>>   	kfree(sched->sched_rq);
>>   	sched->sched_rq = NULL;
>> +
>> +	if (!list_empty(&sched->pending_list))
>> +		dev_err(sched->dev, "%s: Tearing down scheduler
>> while jobs are pending!\n",
>> +			__func__);

It isn't fair to add this error since it would out of the blue start 
firing for everyone expect nouveau, no? Regardless if there is a leak or 
not.

> 
> The "%s" here will be removed since dev_err() alreday prints the
> function name.

It does? But anyway, function name is redundant and irrelevant and 
should not be logged IMO. I would rather prefer we replaced sched->dev 
with sched->drm so could use drm loggers for clarity throughout.

> I find this dev_err() print more useful than the stack a WARN_ON prints
> (telling you about invalid_asm_exec_op or stuff like that).
> 
> Plus, I guess the places where drivers call drm_sched_fini() are very
> well defined and known, so a callstack wouldn't really be useful in the
> first place.

Agreed.

Regards,

Tvrtko

> 
> P.
> 
>>   }
>>   EXPORT_SYMBOL(drm_sched_fini);
>>   
> 

Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Danilo Krummrich 9 months, 3 weeks ago
On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
> 
> On 17/04/2025 08:45, Philipp Stanner wrote:
> > On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
> 
> Problem exactly is that jobs can outlive the entities and the scheduler,
> while some userspace may have a dma fence reference to the job via sync
> file. This new callback would not solve it for xe, but if everything
> required was reference counted it would.

I think you're mixing up the job and the dma_fence here, if a job outlives the
scheduler, it clearly is a bug, always has been.

AFAIK, Xe reference counts it's driver specific job structures *and* the driver
specific scheduler structure, such that drm_sched_fini() won't be called before
all jobs have finished.

The reference counting solution, however, potentially creates subsequent
lifetime problems, which I think have been discussed already in a different
thread already. Philipp can probably link in the corresponding discussion.

> On the design level it feels like it adds too much state machine and makes
> things hard to follow/maintain. It would be nice to find a solutiuon where
> we end up with less state machine and not more.

Multiple solutions have been discussed already, e.g. just wait for the pending
list to be empty, reference count the scheduler for every pending job. Those all
had significant downsides, which I don't see with this proposal.

I'm all for better ideas though -- what do you propose?

> 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 6b72278c4b72..ae3152beca14 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
> > > *sched)
> > >   	sched->ready = false;
> > >   	kfree(sched->sched_rq);
> > >   	sched->sched_rq = NULL;
> > > +
> > > +	if (!list_empty(&sched->pending_list))
> > > +		dev_err(sched->dev, "%s: Tearing down scheduler
> > > while jobs are pending!\n",
> > > +			__func__);
> 
> It isn't fair to add this error since it would out of the blue start firing
> for everyone expect nouveau, no? Regardless if there is a leak or not.

I think it is pretty fair to warn when detecting a guaranteed bug, no?

If drm_sched_fini() is call while jobs are still on the pending_list, they won't
ever be freed, because all workqueues are stopped.
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Tvrtko Ursulin 9 months, 3 weeks ago
On 17/04/2025 13:11, Danilo Krummrich wrote:
> On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
>>
>> On 17/04/2025 08:45, Philipp Stanner wrote:
>>> On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
>>
>> Problem exactly is that jobs can outlive the entities and the scheduler,
>> while some userspace may have a dma fence reference to the job via sync
>> file. This new callback would not solve it for xe, but if everything
>> required was reference counted it would.
> 
> I think you're mixing up the job and the dma_fence here, if a job outlives the
> scheduler, it clearly is a bug, always has been.
> 
> AFAIK, Xe reference counts it's driver specific job structures *and* the driver
> specific scheduler structure, such that drm_sched_fini() won't be called before
> all jobs have finished.

Yes, sorry, dma fence. But it is not enough to postpone drm_sched_fini 
until the job is not finished. Problem is exported dma fence holds the 
pointer to drm_sched_fence (and so oopses in 
drm_sched_fence_get_timeline_name on fence->sched->name) *after* job had 
finished and driver was free to tear everything down.

That is what the "fini callback" wouldn't solve but reference counting 
would.

I don't see that any other driver which can export a fence couldn't 
suffer from the same problem, and even if I agree this maybe isn't 
strictly a scheduler problem, I cannot see it solvable without involving 
the scheduler, and in any other way than reference counting. And if 
reference counting could solve both problems then it feels attractive.

> The reference counting solution, however, potentially creates subsequent
> lifetime problems, which I think have been discussed already in a different
> thread already. Philipp can probably link in the corresponding discussion.

I don't doubt it causes problems but maybe there will be no other way 
than to tackle them.

>> On the design level it feels like it adds too much state machine and makes
>> things hard to follow/maintain. It would be nice to find a solutiuon where
>> we end up with less state machine and not more.
> 
> Multiple solutions have been discussed already, e.g. just wait for the pending
> list to be empty, reference count the scheduler for every pending job. Those all
> had significant downsides, which I don't see with this proposal.
> 
> I'm all for better ideas though -- what do you propose?

I think we need to brainstorm both issues and see if there is a solution 
which solves them both, with bonus points for being elegant.

Use after free is IMO even a worse problem so if reference counting is 
the only way then lets try that. But I will wait for those links to 
catch up on the past discussions.

>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 6b72278c4b72..ae3152beca14 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
>>>> *sched)
>>>>    	sched->ready = false;
>>>>    	kfree(sched->sched_rq);
>>>>    	sched->sched_rq = NULL;
>>>> +
>>>> +	if (!list_empty(&sched->pending_list))
>>>> +		dev_err(sched->dev, "%s: Tearing down scheduler
>>>> while jobs are pending!\n",
>>>> +			__func__);
>>
>> It isn't fair to add this error since it would out of the blue start firing
>> for everyone expect nouveau, no? Regardless if there is a leak or not.
> 
> I think it is pretty fair to warn when detecting a guaranteed bug, no?
> 
> If drm_sched_fini() is call while jobs are still on the pending_list, they won't
> ever be freed, because all workqueues are stopped.

Is it a guaranteed bug for drivers are aware of the drm_sched_fini() 
limitation and are cleaning up upon themselves?

In other words if you apply the series up to here would it trigger for 
nouveau? Reportedly it triggers for the mock scheduler which also has no 
leak.

Also, I asked in my initial reply if we have a list of which of the 
current drivers suffer from memory leaks. Is it all or some etc.

Regards,

Tvrtko

Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Danilo Krummrich 9 months, 3 weeks ago
On Thu, Apr 17, 2025 at 03:20:44PM +0100, Tvrtko Ursulin wrote:
> 
> On 17/04/2025 13:11, Danilo Krummrich wrote:
> > On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 17/04/2025 08:45, Philipp Stanner wrote:
> > > > On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
> > > 
> > > Problem exactly is that jobs can outlive the entities and the scheduler,
> > > while some userspace may have a dma fence reference to the job via sync
> > > file. This new callback would not solve it for xe, but if everything
> > > required was reference counted it would.
> > 
> > I think you're mixing up the job and the dma_fence here, if a job outlives the
> > scheduler, it clearly is a bug, always has been.
> > 
> > AFAIK, Xe reference counts it's driver specific job structures *and* the driver
> > specific scheduler structure, such that drm_sched_fini() won't be called before
> > all jobs have finished.
> 
> Yes, sorry, dma fence. But it is not enough to postpone drm_sched_fini until
> the job is not finished. Problem is exported dma fence holds the pointer to
> drm_sched_fence (and so oopses in drm_sched_fence_get_timeline_name on
> fence->sched->name) *after* job had finished and driver was free to tear
> everything down.

Well, that's a bug in drm_sched_fence then and independent from the other topic.
Once the finished fence in a struct drm_sched_fence has been signaled it must
live independent of the scheduler.

The lifetime of the drm_sched_fence is entirely independent from the scheduler
itself, as you correctly point out.

Starting to reference count things to keep the whole scheduler etc. alive as
long as the drm_sched_fence lives is not the correct solution.

> > Multiple solutions have been discussed already, e.g. just wait for the pending
> > list to be empty, reference count the scheduler for every pending job. Those all
> > had significant downsides, which I don't see with this proposal.
> > 
> > I'm all for better ideas though -- what do you propose?
> 
> I think we need to brainstorm both issues and see if there is a solution
> which solves them both, with bonus points for being elegant.

The problems are not related. As mentioned above, once signaled a
drm_sched_fence must not depend on the scheduler any longer.

> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index 6b72278c4b72..ae3152beca14 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
> > > > > *sched)
> > > > >    	sched->ready = false;
> > > > >    	kfree(sched->sched_rq);
> > > > >    	sched->sched_rq = NULL;
> > > > > +
> > > > > +	if (!list_empty(&sched->pending_list))
> > > > > +		dev_err(sched->dev, "%s: Tearing down scheduler
> > > > > while jobs are pending!\n",
> > > > > +			__func__);
> > > 
> > > It isn't fair to add this error since it would out of the blue start firing
> > > for everyone expect nouveau, no? Regardless if there is a leak or not.
> > 
> > I think it is pretty fair to warn when detecting a guaranteed bug, no?
> > 
> > If drm_sched_fini() is call while jobs are still on the pending_list, they won't
> > ever be freed, because all workqueues are stopped.
> 
> Is it a guaranteed bug for drivers are aware of the drm_sched_fini()
> limitation and are cleaning up upon themselves?

How could a driver clean up on itself (unless the driver holds its own list of
pending jobs)?

Once a job is in flight (i.e. it's on the pending_list) we must guarantee that
free_job() is called by the scheduler, which it can't do if we call
drm_sched_fini() before the pending_list is empty.

> In other words if you apply the series up to here would it trigger for
> nouveau?

No, because nouveau does something very stupid, i.e. replicate the pending_list.

> Reportedly it triggers for the mock scheduler which also has no
> leak.

That sounds impossible. How do you ensure you do *not* leak memory when you tear
down the scheduler while it still has pending jobs? Or in other words, who calls
free_job() if not the scheduler itself?

> Also, I asked in my initial reply if we have a list of which of the current
> drivers suffer from memory leaks. Is it all or some etc.

Not all, but quite some I think. The last time I looked (which is about a year
ago) amdgpu for instance could leak memory when you unbind the driver while
enough jobs are in flight.
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Tvrtko Ursulin 9 months, 3 weeks ago
On 17/04/2025 15:48, Danilo Krummrich wrote:
> On Thu, Apr 17, 2025 at 03:20:44PM +0100, Tvrtko Ursulin wrote:
>>
>> On 17/04/2025 13:11, Danilo Krummrich wrote:
>>> On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 17/04/2025 08:45, Philipp Stanner wrote:
>>>>> On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
>>>>
>>>> Problem exactly is that jobs can outlive the entities and the scheduler,
>>>> while some userspace may have a dma fence reference to the job via sync
>>>> file. This new callback would not solve it for xe, but if everything
>>>> required was reference counted it would.
>>>
>>> I think you're mixing up the job and the dma_fence here, if a job outlives the
>>> scheduler, it clearly is a bug, always has been.
>>>
>>> AFAIK, Xe reference counts it's driver specific job structures *and* the driver
>>> specific scheduler structure, such that drm_sched_fini() won't be called before
>>> all jobs have finished.
>>
>> Yes, sorry, dma fence. But it is not enough to postpone drm_sched_fini until
>> the job is not finished. Problem is exported dma fence holds the pointer to
>> drm_sched_fence (and so oopses in drm_sched_fence_get_timeline_name on
>> fence->sched->name) *after* job had finished and driver was free to tear
>> everything down.
> 
> Well, that's a bug in drm_sched_fence then and independent from the other topic.
> Once the finished fence in a struct drm_sched_fence has been signaled it must
> live independent of the scheduler.
> 
> The lifetime of the drm_sched_fence is entirely independent from the scheduler
> itself, as you correctly point out.

Connection (re. independent or not) I made was *if* drm_sched would be 
reference counted, would that satisfy both the requirement to keep 
working drm_sched_fence_get_timeline_name and to allow a different 
flavour of the memory leak fix.

I agree drm_sched_fence_get_timeline_name can also be fixed by removing 
the fence->sched dereference and losing the (pretty) name. Historically 
there has been a lot of trouble with those names so maybe that would be 
acceptable.

Revoking s_fence->sched on job completion as an alternative does not 
sound feasible.

To further complicate matters, I suspect rmmod gpu-sched.ko is also 
something which would break exported fences since that would remove the 
fence ops. But that is solvable by module_get/put().

> Starting to reference count things to keep the whole scheduler etc. alive as
> long as the drm_sched_fence lives is not the correct solution.

To catch up on why if you could dig out the links to past discussions it 
would be helpful.

I repeat how there is a lot of attractiveness to reference counting. 
Already mentioned memory leak, s_fence oops, and also not having to 
clear job->entity could be useful for things like tracking per entity 
submission stats (imagine CFS like scheduling, generic scheduling DRM 
cgroup controller). So it would be good for me to hear what pitfalls 
were identified in this space.

>>> Multiple solutions have been discussed already, e.g. just wait for the pending
>>> list to be empty, reference count the scheduler for every pending job. Those all
>>> had significant downsides, which I don't see with this proposal.
>>>
>>> I'm all for better ideas though -- what do you propose?
>>
>> I think we need to brainstorm both issues and see if there is a solution
>> which solves them both, with bonus points for being elegant.
> 
> The problems are not related. As mentioned above, once signaled a
> drm_sched_fence must not depend on the scheduler any longer.
> 
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> index 6b72278c4b72..ae3152beca14 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
>>>>>> *sched)
>>>>>>     	sched->ready = false;
>>>>>>     	kfree(sched->sched_rq);
>>>>>>     	sched->sched_rq = NULL;
>>>>>> +
>>>>>> +	if (!list_empty(&sched->pending_list))
>>>>>> +		dev_err(sched->dev, "%s: Tearing down scheduler
>>>>>> while jobs are pending!\n",
>>>>>> +			__func__);
>>>>
>>>> It isn't fair to add this error since it would out of the blue start firing
>>>> for everyone expect nouveau, no? Regardless if there is a leak or not.
>>>
>>> I think it is pretty fair to warn when detecting a guaranteed bug, no?
>>>
>>> If drm_sched_fini() is call while jobs are still on the pending_list, they won't
>>> ever be freed, because all workqueues are stopped.
>>
>> Is it a guaranteed bug for drivers are aware of the drm_sched_fini()
>> limitation and are cleaning up upon themselves?
> 
> How could a driver clean up on itself (unless the driver holds its own list of
> pending jobs)?
> 
> Once a job is in flight (i.e. it's on the pending_list) we must guarantee that
> free_job() is called by the scheduler, which it can't do if we call
> drm_sched_fini() before the pending_list is empty.
> 
>> In other words if you apply the series up to here would it trigger for
>> nouveau?
> 
> No, because nouveau does something very stupid, i.e. replicate the pending_list.

Ah okay I see it now, it waits for all jobs to finish before calling 
drm_sched_fini(). For some reason I did not think it was doing that 
given the cover letter starts with how that is a big no-no.

>> Reportedly it triggers for the mock scheduler which also has no
>> leak.
> 
> That sounds impossible. How do you ensure you do *not* leak memory when you tear
> down the scheduler while it still has pending jobs? Or in other words, who calls
> free_job() if not the scheduler itself?

Well the cover letter says it triggers so it is possible. :)

Mock scheduler also tracks the pending jobs itself, but different from 
nouveau it does not wait for jobs to finish and free worker to process 
them all, but having stopped the "hw" backend it cancels them and calls 
the free_job vfunc directly.

Going back to the topic of this series, if we go with a solution along 
the lines of the proposed, I wonder if it would be doable without 
mandating that drivers keep a list parallel to pending_list. Instead 
have a vfunc DRM scheduler would call to cancel job at a time from *its* 
pending list. It would go nicely together with prepare/run/timedout/free.

Would it allow getting rid of the new state machinery and just 
cancelling and freeing in one go directly from drm_sched_fini()?

Regards,

Tvrtko

>> Also, I asked in my initial reply if we have a list of which of the current
>> drivers suffer from memory leaks. Is it all or some etc.
> 
> Not all, but quite some I think. The last time I looked (which is about a year
> ago) amdgpu for instance could leak memory when you unbind the driver while
> enough jobs are in flight.

Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Philipp Stanner 9 months, 3 weeks ago
On Thu, 2025-04-17 at 17:08 +0100, Tvrtko Ursulin wrote:
> 
> On 17/04/2025 15:48, Danilo Krummrich wrote:
> > On Thu, Apr 17, 2025 at 03:20:44PM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 17/04/2025 13:11, Danilo Krummrich wrote:
> > > > On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
> > > > > 
> > > > > On 17/04/2025 08:45, Philipp Stanner wrote:
> > > > > > On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
> > > > > 
> > > > > Problem exactly is that jobs can outlive the entities and the
> > > > > scheduler,
> > > > > while some userspace may have a dma fence reference to the
> > > > > job via sync
> > > > > file. This new callback would not solve it for xe, but if
> > > > > everything
> > > > > required was reference counted it would.
> > > > 
> > > > I think you're mixing up the job and the dma_fence here, if a
> > > > job outlives the
> > > > scheduler, it clearly is a bug, always has been.
> > > > 
> > > > AFAIK, Xe reference counts it's driver specific job structures
> > > > *and* the driver
> > > > specific scheduler structure, such that drm_sched_fini() won't
> > > > be called before
> > > > all jobs have finished.
> > > 
> > > Yes, sorry, dma fence. But it is not enough to postpone
> > > drm_sched_fini until
> > > the job is not finished. Problem is exported dma fence holds the
> > > pointer to
> > > drm_sched_fence (and so oopses in
> > > drm_sched_fence_get_timeline_name on
> > > fence->sched->name) *after* job had finished and driver was free
> > > to tear
> > > everything down.
> > 
> > Well, that's a bug in drm_sched_fence then and independent from the
> > other topic.
> > Once the finished fence in a struct drm_sched_fence has been
> > signaled it must
> > live independent of the scheduler.
> > 
> > The lifetime of the drm_sched_fence is entirely independent from
> > the scheduler
> > itself, as you correctly point out.
> 
> Connection (re. independent or not) I made was *if* drm_sched would
> be 
> reference counted, would that satisfy both the requirement to keep 
> working drm_sched_fence_get_timeline_name and to allow a different 
> flavour of the memory leak fix.
> 
> I agree drm_sched_fence_get_timeline_name can also be fixed by
> removing 
> the fence->sched dereference and losing the (pretty) name.
> Historically 
> there has been a lot of trouble with those names so maybe that would
> be 
> acceptable.
> 
> Revoking s_fence->sched on job completion as an alternative does not 
> sound feasible.
> 
> To further complicate matters, I suspect rmmod gpu-sched.ko is also 
> something which would break exported fences since that would remove
> the 
> fence ops. But that is solvable by module_get/put().

Is it a common kernel policy to be immune against crazy people just
calling rmmod on central, shared kernel infrastructure?

We cannot protect people from everything, especially from themselves.

> 
> > Starting to reference count things to keep the whole scheduler etc.
> > alive as
> > long as the drm_sched_fence lives is not the correct solution.
> 
> To catch up on why if you could dig out the links to past discussions
> it 
> would be helpful.
> 
> I repeat how there is a lot of attractiveness to reference counting. 
> Already mentioned memory leak, s_fence oops, and also not having to 
> clear job->entity could be useful for things like tracking per entity
> submission stats (imagine CFS like scheduling, generic scheduling DRM
> cgroup controller). So it would be good for me to hear what pitfalls 
> were identified in this space.

Reference counting does _appear_ attractive, but our (mostly internal)
attempts and discussions revealed that it's not feasable.

There was an RFC by me about it last year, but it was incomplete and
few people reacted:

https://lore.kernel.org/dri-devel/20240903094531.29893-2-pstanner@redhat.com/

When you try to implement refcounting, you quickly discover that it's
not enough if submitted jobs refcount the scheduler.

If the scheduler is refcounted, this means that it can now outlive the
driver. This means that it can continue calling into the driver with
run_job(), free_job(), timedout_job() and so on.

Since this would fire 500 million UAFs, you, hence, now would have to
guard *all* drivers' callbacks in some way against the driver being
unloaded. You might even end up having to refcount thinks like entire
modules, depending on the driver.


So effectively, with refcounting, you would fix a problem in central
infrastructure (the scheduler) in all users (the driver) – therefore,
you would not fix the problem at all, but just work around the
scheduler being broken in the drivers. IOW, everything would be exactly
as it is now, where everyone works around the problem in their own way.
Nouveau uses its redundant pending list, Xe refcounts and amdgpu does……
IDK, things.


Another problem is the kernel workflow and kernel politics. The
solution I propose here allows for phasing the problem out, first in
Nouveau, then step by step others. It's the only minimalist backward-
compatible solution I can see.


That said, if you have the capacity to look deeper into a refcount
solution, feel free to go for it. But you'd have to find a way to solve
it in a centralized manner, otherwise we could just leave everything
be.


> 
> > > > Multiple solutions have been discussed already, e.g. just wait
> > > > for the pending
> > > > list to be empty, reference count the scheduler for every
> > > > pending job. Those all
> > > > had significant downsides, which I don't see with this
> > > > proposal.
> > > > 
> > > > I'm all for better ideas though -- what do you propose?
> > > 
> > > I think we need to brainstorm both issues and see if there is a
> > > solution
> > > which solves them both, with bonus points for being elegant.
> > 
> > The problems are not related. As mentioned above, once signaled a
> > drm_sched_fence must not depend on the scheduler any longer.
> > 
> > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > index 6b72278c4b72..ae3152beca14 100644
> > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct
> > > > > > > drm_gpu_scheduler
> > > > > > > *sched)
> > > > > > >      sched->ready = false;
> > > > > > >      kfree(sched->sched_rq);
> > > > > > >      sched->sched_rq = NULL;
> > > > > > > +
> > > > > > > + if (!list_empty(&sched->pending_list))
> > > > > > > + dev_err(sched->dev, "%s: Tearing down scheduler
> > > > > > > while jobs are pending!\n",
> > > > > > > + __func__);
> > > > > 
> > > > > It isn't fair to add this error since it would out of the
> > > > > blue start firing
> > > > > for everyone expect nouveau, no? Regardless if there is a
> > > > > leak or not.
> > > > 
> > > > I think it is pretty fair to warn when detecting a guaranteed
> > > > bug, no?
> > > > 
> > > > If drm_sched_fini() is call while jobs are still on the
> > > > pending_list, they won't
> > > > ever be freed, because all workqueues are stopped.
> > > 
> > > Is it a guaranteed bug for drivers are aware of the
> > > drm_sched_fini()
> > > limitation and are cleaning up upon themselves?
> > 
> > How could a driver clean up on itself (unless the driver holds its
> > own list of
> > pending jobs)?
> > 
> > Once a job is in flight (i.e. it's on the pending_list) we must
> > guarantee that
> > free_job() is called by the scheduler, which it can't do if we call
> > drm_sched_fini() before the pending_list is empty.
> > 
> > > In other words if you apply the series up to here would it
> > > trigger for
> > > nouveau?
> > 
> > No, because nouveau does something very stupid, i.e. replicate the
> > pending_list.
> 
> Ah okay I see it now, it waits for all jobs to finish before calling 
> drm_sched_fini(). For some reason I did not think it was doing that 
> given the cover letter starts with how that is a big no-no.
> 
> > > Reportedly it triggers for the mock scheduler which also has no
> > > leak.
> > 
> > That sounds impossible. How do you ensure you do *not* leak memory
> > when you tear
> > down the scheduler while it still has pending jobs? Or in other
> > words, who calls
> > free_job() if not the scheduler itself?
> 
> Well the cover letter says it triggers so it is possible. :)

Where does it say that?

I mean, the warning would trigger if a driver is leaking jobs, which is
clearly a bug (and since a list is leaked, it might then even be a
false-negative in kmemleak).

It's actually quite simple:
   1. jobs are being freed by free_job()
   2. If you call drm_sched_fini(), the work item for free_job() gets
      deactivated.
   3. Depending on the load (race), you leak the jobs.

That's not that much of a problem for hardware schedulers, but firmware
schedulers who tear down scheduler instances all the time could leak
quite some amount of memory over time.

I can't see why a warning print would ever be bad there. Even Christian
wants it.

P.

> 
> Mock scheduler also tracks the pending jobs itself, but different
> from 
> nouveau it does not wait for jobs to finish and free worker to
> process 
> them all, but having stopped the "hw" backend it cancels them and
> calls 
> the free_job vfunc directly.
> 
> Going back to the topic of this series, if we go with a solution
> along 
> the lines of the proposed, I wonder if it would be doable without 
> mandating that drivers keep a list parallel to pending_list. Instead 
> have a vfunc DRM scheduler would call to cancel job at a time from
> *its* 
> pending list. It would go nicely together with
> prepare/run/timedout/free.
> 
> Would it allow getting rid of the new state machinery and just 
> cancelling and freeing in one go directly from drm_sched_fini()?
> 
> Regards,
> 
> Tvrtko
> 
> > > Also, I asked in my initial reply if we have a list of which of
> > > the current
> > > drivers suffer from memory leaks. Is it all or some etc.
> > 
> > Not all, but quite some I think. The last time I looked (which is
> > about a year
> > ago) amdgpu for instance could leak memory when you unbind the
> > driver while
> > enough jobs are in flight.
> 
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Tvrtko Ursulin 9 months, 3 weeks ago
On 22/04/2025 07:06, Philipp Stanner wrote:
> On Thu, 2025-04-17 at 17:08 +0100, Tvrtko Ursulin wrote:
>>
>> On 17/04/2025 15:48, Danilo Krummrich wrote:
>>> On Thu, Apr 17, 2025 at 03:20:44PM +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 17/04/2025 13:11, Danilo Krummrich wrote:
>>>>> On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 17/04/2025 08:45, Philipp Stanner wrote:
>>>>>>> On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
>>>>>>
>>>>>> Problem exactly is that jobs can outlive the entities and the
>>>>>> scheduler,
>>>>>> while some userspace may have a dma fence reference to the
>>>>>> job via sync
>>>>>> file. This new callback would not solve it for xe, but if
>>>>>> everything
>>>>>> required was reference counted it would.
>>>>>
>>>>> I think you're mixing up the job and the dma_fence here, if a
>>>>> job outlives the
>>>>> scheduler, it clearly is a bug, always has been.
>>>>>
>>>>> AFAIK, Xe reference counts it's driver specific job structures
>>>>> *and* the driver
>>>>> specific scheduler structure, such that drm_sched_fini() won't
>>>>> be called before
>>>>> all jobs have finished.
>>>>
>>>> Yes, sorry, dma fence. But it is not enough to postpone
>>>> drm_sched_fini until
>>>> the job is not finished. Problem is exported dma fence holds the
>>>> pointer to
>>>> drm_sched_fence (and so oopses in
>>>> drm_sched_fence_get_timeline_name on
>>>> fence->sched->name) *after* job had finished and driver was free
>>>> to tear
>>>> everything down.
>>>
>>> Well, that's a bug in drm_sched_fence then and independent from the
>>> other topic.
>>> Once the finished fence in a struct drm_sched_fence has been
>>> signaled it must
>>> live independent of the scheduler.
>>>
>>> The lifetime of the drm_sched_fence is entirely independent from
>>> the scheduler
>>> itself, as you correctly point out.
>>
>> Connection (re. independent or not) I made was *if* drm_sched would
>> be
>> reference counted, would that satisfy both the requirement to keep
>> working drm_sched_fence_get_timeline_name and to allow a different
>> flavour of the memory leak fix.
>>
>> I agree drm_sched_fence_get_timeline_name can also be fixed by
>> removing
>> the fence->sched dereference and losing the (pretty) name.
>> Historically
>> there has been a lot of trouble with those names so maybe that would
>> be
>> acceptable.
>>
>> Revoking s_fence->sched on job completion as an alternative does not
>> sound feasible.
>>
>> To further complicate matters, I suspect rmmod gpu-sched.ko is also
>> something which would break exported fences since that would remove
>> the
>> fence ops. But that is solvable by module_get/put().
> 
> Is it a common kernel policy to be immune against crazy people just
> calling rmmod on central, shared kernel infrastructure?
> 
> We cannot protect people from everything, especially from themselves.

I would say if DRM supports driver unbind, and it does, then we need to 
decide on the module unload. Otherwise it is possible and allows for bad 
things to happen.

Even if solution might be to just perma-raise the module refcount, or 
something. It does not feel valid to dismiss the discussion straight 
away. Hence I have sent 
https://lore.kernel.org/dri-devel/20250418164246.72426-1-tvrtko.ursulin@igalia.com/ 
to discuss on this topic separately.

>>> Starting to reference count things to keep the whole scheduler etc.
>>> alive as
>>> long as the drm_sched_fence lives is not the correct solution.
>>
>> To catch up on why if you could dig out the links to past discussions
>> it
>> would be helpful.
>>
>> I repeat how there is a lot of attractiveness to reference counting.
>> Already mentioned memory leak, s_fence oops, and also not having to
>> clear job->entity could be useful for things like tracking per entity
>> submission stats (imagine CFS like scheduling, generic scheduling DRM
>> cgroup controller). So it would be good for me to hear what pitfalls
>> were identified in this space.
> 
> Reference counting does _appear_ attractive, but our (mostly internal)
> attempts and discussions revealed that it's not feasable.
> 
> There was an RFC by me about it last year, but it was incomplete and
> few people reacted:
> 
> https://lore.kernel.org/dri-devel/20240903094531.29893-2-pstanner@redhat.com/
> 
> When you try to implement refcounting, you quickly discover that it's
> not enough if submitted jobs refcount the scheduler.

Right, I did imply in my initial reply that "everything" would need to 
be reference counted.

> If the scheduler is refcounted, this means that it can now outlive the
> driver. This means that it can continue calling into the driver with
> run_job(), free_job(), timedout_job() and so on.
> 
> Since this would fire 500 million UAFs, you, hence, now would have to
> guard *all* drivers' callbacks in some way against the driver being
> unloaded. You might even end up having to refcount thinks like entire
> modules, depending on the driver.
> 
> 
> So effectively, with refcounting, you would fix a problem in central
> infrastructure (the scheduler) in all users (the driver) – therefore,
> you would not fix the problem at all, but just work around the
> scheduler being broken in the drivers. IOW, everything would be exactly
> as it is now, where everyone works around the problem in their own way.
> Nouveau uses its redundant pending list, Xe refcounts and amdgpu does……
> IDK, things.

Xe _fails_ to handle it as demonstrated in the above linked RFC. I don't 
think any driver can actually handle it on their own. Hence what I am 
discussing is completely opposite of "not fixing the problem at all" - 
it is about what if anything can we do to fix it robustly.

> Another problem is the kernel workflow and kernel politics. The
> solution I propose here allows for phasing the problem out, first in
> Nouveau, then step by step others. It's the only minimalist backward-
> compatible solution I can see.

It is two separate things. Memory leaks in this series and the ref 
counting angle as a potential fix for _both_ memory leaks _and_ use 
after free problems.

> That said, if you have the capacity to look deeper into a refcount
> solution, feel free to go for it. But you'd have to find a way to solve
> it in a centralized manner, otherwise we could just leave everything
> be.

I will leave discussion for my linked RFC.

>>>>> Multiple solutions have been discussed already, e.g. just wait
>>>>> for the pending
>>>>> list to be empty, reference count the scheduler for every
>>>>> pending job. Those all
>>>>> had significant downsides, which I don't see with this
>>>>> proposal.
>>>>>
>>>>> I'm all for better ideas though -- what do you propose?
>>>>
>>>> I think we need to brainstorm both issues and see if there is a
>>>> solution
>>>> which solves them both, with bonus points for being elegant.
>>>
>>> The problems are not related. As mentioned above, once signaled a
>>> drm_sched_fence must not depend on the scheduler any longer.
>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index 6b72278c4b72..ae3152beca14 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct
>>>>>>>> drm_gpu_scheduler
>>>>>>>> *sched)
>>>>>>>>       sched->ready = false;
>>>>>>>>       kfree(sched->sched_rq);
>>>>>>>>       sched->sched_rq = NULL;
>>>>>>>> +
>>>>>>>> + if (!list_empty(&sched->pending_list))
>>>>>>>> + dev_err(sched->dev, "%s: Tearing down scheduler
>>>>>>>> while jobs are pending!\n",
>>>>>>>> + __func__);
>>>>>>
>>>>>> It isn't fair to add this error since it would out of the
>>>>>> blue start firing
>>>>>> for everyone expect nouveau, no? Regardless if there is a
>>>>>> leak or not.
>>>>>
>>>>> I think it is pretty fair to warn when detecting a guaranteed
>>>>> bug, no?
>>>>>
>>>>> If drm_sched_fini() is call while jobs are still on the
>>>>> pending_list, they won't
>>>>> ever be freed, because all workqueues are stopped.
>>>>
>>>> Is it a guaranteed bug for drivers are aware of the
>>>> drm_sched_fini()
>>>> limitation and are cleaning up upon themselves?
>>>
>>> How could a driver clean up on itself (unless the driver holds its
>>> own list of
>>> pending jobs)?
>>>
>>> Once a job is in flight (i.e. it's on the pending_list) we must
>>> guarantee that
>>> free_job() is called by the scheduler, which it can't do if we call
>>> drm_sched_fini() before the pending_list is empty.
>>>
>>>> In other words if you apply the series up to here would it
>>>> trigger for
>>>> nouveau?
>>>
>>> No, because nouveau does something very stupid, i.e. replicate the
>>> pending_list.
>>
>> Ah okay I see it now, it waits for all jobs to finish before calling
>> drm_sched_fini(). For some reason I did not think it was doing that
>> given the cover letter starts with how that is a big no-no.
>>
>>>> Reportedly it triggers for the mock scheduler which also has no
>>>> leak.
>>>
>>> That sounds impossible. How do you ensure you do *not* leak memory
>>> when you tear
>>> down the scheduler while it still has pending jobs? Or in other
>>> words, who calls
>>> free_job() if not the scheduler itself?
>>
>> Well the cover letter says it triggers so it is possible. :)
> 
> Where does it say that?

"""
Tvrtko's unit tests also run as expected (except for the new warning
print in patch 3), which is not surprising since they don't provide the
callback.
"""

> I mean, the warning would trigger if a driver is leaking jobs, which is
> clearly a bug (and since a list is leaked, it might then even be a
> false-negative in kmemleak).
> 
> It's actually quite simple:
>     1. jobs are being freed by free_job()
>     2. If you call drm_sched_fini(), the work item for free_job() gets
>        deactivated.
>     3. Depending on the load (race), you leak the jobs.
> 
> That's not that much of a problem for hardware schedulers, but firmware
> schedulers who tear down scheduler instances all the time could leak
> quite some amount of memory over time.
> 
> I can't see why a warning print would ever be bad there. Even Christian
> wants it.

Question I raised is if there are other drivers which manage to clean up 
everything correctly (like the mock scheduler does), but trigger that 
warning. Maybe there are not and maybe mock scheduler is the only false 
positive.

Regards,

Tvrtko

>> Mock scheduler also tracks the pending jobs itself, but different
>> from
>> nouveau it does not wait for jobs to finish and free worker to
>> process
>> them all, but having stopped the "hw" backend it cancels them and
>> calls
>> the free_job vfunc directly.
>>
>> Going back to the topic of this series, if we go with a solution
>> along
>> the lines of the proposed, I wonder if it would be doable without
>> mandating that drivers keep a list parallel to pending_list. Instead
>> have a vfunc DRM scheduler would call to cancel job at a time from
>> *its*
>> pending list. It would go nicely together with
>> prepare/run/timedout/free.
>>
>> Would it allow getting rid of the new state machinery and just
>> cancelling and freeing in one go directly from drm_sched_fini()?
>>
>> Regards,
>>
>> Tvrtko
>>
>>>> Also, I asked in my initial reply if we have a list of which of
>>>> the current
>>>> drivers suffer from memory leaks. Is it all or some etc.
>>>
>>> Not all, but quite some I think. The last time I looked (which is
>>> about a year
>>> ago) amdgpu for instance could leak memory when you unbind the
>>> driver while
>>> enough jobs are in flight.
>>
> 

Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Danilo Krummrich 9 months, 3 weeks ago
On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> Question I raised is if there are other drivers which manage to clean up
> everything correctly (like the mock scheduler does), but trigger that
> warning. Maybe there are not and maybe mock scheduler is the only false
> positive.

So far the scheduler simply does not give any guideline on how to address the
problem, hence every driver simply does something (or nothing, effectively
ignoring the problem). This is what we want to fix.

The mock scheduler keeps it's own list of pending jobs and on tear down stops
the scheduler's workqueues, traverses it's own list and eventually frees the
pending jobs without updating the scheduler's internal pending list.

So yes, it does avoid memory leaks, but it also leaves the schedulers internal
structures with an invalid state, i.e. the pending list of the scheduler has
pointers to already freed memory.

What if the drm_sched_fini() starts touching the pending list? Then you'd end up
with UAF bugs with this implementation. We cannot invalidate the schedulers
internal structures and yet call scheduler functions - e.g. drm_sched_fini() -
subsequently.

Hence, the current implementation of the mock scheduler is fundamentally flawed.
And so would be *every* driver that still has entries within the scheduler's
pending list.

This is not a false positive, it already caught a real bug -- in the mock
scheduler.

- Danilo
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Tvrtko Ursulin 9 months, 3 weeks ago
On 22/04/2025 12:13, Danilo Krummrich wrote:
> On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
>> Question I raised is if there are other drivers which manage to clean up
>> everything correctly (like the mock scheduler does), but trigger that
>> warning. Maybe there are not and maybe mock scheduler is the only false
>> positive.
> 
> So far the scheduler simply does not give any guideline on how to address the
> problem, hence every driver simply does something (or nothing, effectively
> ignoring the problem). This is what we want to fix.
> 
> The mock scheduler keeps it's own list of pending jobs and on tear down stops
> the scheduler's workqueues, traverses it's own list and eventually frees the
> pending jobs without updating the scheduler's internal pending list.
> 
> So yes, it does avoid memory leaks, but it also leaves the schedulers internal
> structures with an invalid state, i.e. the pending list of the scheduler has
> pointers to already freed memory.
> 
> What if the drm_sched_fini() starts touching the pending list? Then you'd end up
> with UAF bugs with this implementation. We cannot invalidate the schedulers
> internal structures and yet call scheduler functions - e.g. drm_sched_fini() -
> subsequently.
> 
> Hence, the current implementation of the mock scheduler is fundamentally flawed.
> And so would be *every* driver that still has entries within the scheduler's
> pending list.
> 
> This is not a false positive, it already caught a real bug -- in the mock
> scheduler.

To avoid furher splitting hairs on whether real bugs need to be able to 
manifest or not, lets move past this with a conclusion that there are 
two potential things to do here:

First one is to either send separately or include in this series 
something like:

diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c 
b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
index f999c8859cf7..7c4df0e890ac 100644
--- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
+++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
@@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct drm_mock_scheduler 
*sched)
                 drm_mock_sched_job_complete(job);
         spin_unlock_irqrestore(&sched->lock, flags);

+       drm_sched_fini(&sched->base);
+
         /*
          * Free completed jobs and jobs not yet processed by the DRM 
scheduler
          * free worker.
@@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct drm_mock_scheduler 
*sched)

         list_for_each_entry_safe(job, next, &list, link)
                 mock_sched_free_job(&job->base);
-
-       drm_sched_fini(&sched->base);
  }

  /**

That should satisfy the requirement to "clear" memory about to be freed 
and be 100% compliant with drm_sched_fini() kerneldoc (guideline b).

But the new warning from 3/5 here will still be there AFAICT and would 
you then agree it is a false positive?

Secondly, the series should modify all drivers (including the unit 
tests) which are known to trigger this false positive.

Regards,

Tvrtko
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Danilo Krummrich 9 months, 3 weeks ago
On Tue, Apr 22, 2025 at 01:07:47PM +0100, Tvrtko Ursulin wrote:
> 
> On 22/04/2025 12:13, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> > > Question I raised is if there are other drivers which manage to clean up
> > > everything correctly (like the mock scheduler does), but trigger that
> > > warning. Maybe there are not and maybe mock scheduler is the only false
> > > positive.
> > 
> > So far the scheduler simply does not give any guideline on how to address the
> > problem, hence every driver simply does something (or nothing, effectively
> > ignoring the problem). This is what we want to fix.
> > 
> > The mock scheduler keeps it's own list of pending jobs and on tear down stops
> > the scheduler's workqueues, traverses it's own list and eventually frees the
> > pending jobs without updating the scheduler's internal pending list.
> > 
> > So yes, it does avoid memory leaks, but it also leaves the schedulers internal
> > structures with an invalid state, i.e. the pending list of the scheduler has
> > pointers to already freed memory.
> > 
> > What if the drm_sched_fini() starts touching the pending list? Then you'd end up
> > with UAF bugs with this implementation. We cannot invalidate the schedulers
> > internal structures and yet call scheduler functions - e.g. drm_sched_fini() -
> > subsequently.
> > 
> > Hence, the current implementation of the mock scheduler is fundamentally flawed.
> > And so would be *every* driver that still has entries within the scheduler's
> > pending list.
> > 
> > This is not a false positive, it already caught a real bug -- in the mock
> > scheduler.
> 
> To avoid furher splitting hairs on whether real bugs need to be able to
> manifest or not, lets move past this with a conclusion that there are two
> potential things to do here:

This is not about splitting hairs, it is about understanding that abusing
knowledge about internals of a component to clean things up is *never* valid.

> First one is to either send separately or include in this series something
> like:
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> index f999c8859cf7..7c4df0e890ac 100644
> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct drm_mock_scheduler
> *sched)
>                 drm_mock_sched_job_complete(job);
>         spin_unlock_irqrestore(&sched->lock, flags);
> 
> +       drm_sched_fini(&sched->base);
> +
>         /*
>          * Free completed jobs and jobs not yet processed by the DRM
> scheduler
>          * free worker.
> @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct drm_mock_scheduler
> *sched)
> 
>         list_for_each_entry_safe(job, next, &list, link)
>                 mock_sched_free_job(&job->base);
> -
> -       drm_sched_fini(&sched->base);
>  }
> 
>  /**
> 
> That should satisfy the requirement to "clear" memory about to be freed and
> be 100% compliant with drm_sched_fini() kerneldoc (guideline b).
> 
> But the new warning from 3/5 here will still be there AFAICT and would you
> then agree it is a false positive?

No, I do not agree.

Even if a driver does what you describe it is not the correct thing to do and
having a warning call it out makes sense.

This way of cleaning things up entirely relies on knowing specific scheduler
internals, which if changed, may fall apart.

> Secondly, the series should modify all drivers (including the unit tests)
> which are known to trigger this false positive.

Again, there are no false positives. It is the scheduler that needs to call
free_job() and other potential cleanups. You can't just stop the scheduler,
leave it in an intermediate state and try to clean it up by hand relying on
knowledge about internals.

Consequently, when the pending list isn't empty when drm_sched_fini() is called,
it *always* is a bug.
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Tvrtko Ursulin 9 months, 3 weeks ago
On 22/04/2025 13:32, Danilo Krummrich wrote:
> On Tue, Apr 22, 2025 at 01:07:47PM +0100, Tvrtko Ursulin wrote:
>>
>> On 22/04/2025 12:13, Danilo Krummrich wrote:
>>> On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
>>>> Question I raised is if there are other drivers which manage to clean up
>>>> everything correctly (like the mock scheduler does), but trigger that
>>>> warning. Maybe there are not and maybe mock scheduler is the only false
>>>> positive.
>>>
>>> So far the scheduler simply does not give any guideline on how to address the
>>> problem, hence every driver simply does something (or nothing, effectively
>>> ignoring the problem). This is what we want to fix.
>>>
>>> The mock scheduler keeps it's own list of pending jobs and on tear down stops
>>> the scheduler's workqueues, traverses it's own list and eventually frees the
>>> pending jobs without updating the scheduler's internal pending list.
>>>
>>> So yes, it does avoid memory leaks, but it also leaves the schedulers internal
>>> structures with an invalid state, i.e. the pending list of the scheduler has
>>> pointers to already freed memory.
>>>
>>> What if the drm_sched_fini() starts touching the pending list? Then you'd end up
>>> with UAF bugs with this implementation. We cannot invalidate the schedulers
>>> internal structures and yet call scheduler functions - e.g. drm_sched_fini() -
>>> subsequently.
>>>
>>> Hence, the current implementation of the mock scheduler is fundamentally flawed.
>>> And so would be *every* driver that still has entries within the scheduler's
>>> pending list.
>>>
>>> This is not a false positive, it already caught a real bug -- in the mock
>>> scheduler.
>>
>> To avoid furher splitting hairs on whether real bugs need to be able to
>> manifest or not, lets move past this with a conclusion that there are two
>> potential things to do here:
> 
> This is not about splitting hairs, it is about understanding that abusing
> knowledge about internals of a component to clean things up is *never* valid.
> 
>> First one is to either send separately or include in this series something
>> like:
>>
>> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>> b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>> index f999c8859cf7..7c4df0e890ac 100644
>> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>> @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct drm_mock_scheduler
>> *sched)
>>                  drm_mock_sched_job_complete(job);
>>          spin_unlock_irqrestore(&sched->lock, flags);
>>
>> +       drm_sched_fini(&sched->base);
>> +
>>          /*
>>           * Free completed jobs and jobs not yet processed by the DRM
>> scheduler
>>           * free worker.
>> @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct drm_mock_scheduler
>> *sched)
>>
>>          list_for_each_entry_safe(job, next, &list, link)
>>                  mock_sched_free_job(&job->base);
>> -
>> -       drm_sched_fini(&sched->base);
>>   }
>>
>>   /**
>>
>> That should satisfy the requirement to "clear" memory about to be freed and
>> be 100% compliant with drm_sched_fini() kerneldoc (guideline b).
>>
>> But the new warning from 3/5 here will still be there AFAICT and would you
>> then agree it is a false positive?
> 
> No, I do not agree.
> 
> Even if a driver does what you describe it is not the correct thing to do and
> having a warning call it out makes sense.
> 
> This way of cleaning things up entirely relies on knowing specific scheduler
> internals, which if changed, may fall apart.
> 
>> Secondly, the series should modify all drivers (including the unit tests)
>> which are known to trigger this false positive.
> 
> Again, there are no false positives. It is the scheduler that needs to call
> free_job() and other potential cleanups. You can't just stop the scheduler,
> leave it in an intermediate state and try to clean it up by hand relying on
> knowledge about internals.

Sorry I don't see the argument for the claim it is relying on the 
internals with the re-positioned drm_sched_fini call. In that case it is 
fully compliant with:

/**
  * drm_sched_fini - Destroy a gpu scheduler
  *
  * @sched: scheduler instance
  *
  * Tears down and cleans up the scheduler.
  *
  * This stops submission of new jobs to the hardware through
  * drm_sched_backend_ops.run_job(). Consequently, 
drm_sched_backend_ops.free_job()
  * will not be called for all jobs still in drm_gpu_scheduler.pending_list.
  * There is no solution for this currently. Thus, it is up to the 
driver to make
  * sure that:
  *
  *  a) drm_sched_fini() is only called after for all submitted jobs
  *     drm_sched_backend_ops.free_job() has been called or that
  *  b) the jobs for which drm_sched_backend_ops.free_job() has not been 
called
  *
  * FIXME: Take care of the above problem and prevent this function from 
leaking
  * the jobs in drm_gpu_scheduler.pending_list under any circumstances.

^^^ recommended solution b).

> Consequently, when the pending list isn't empty when drm_sched_fini() is called,
> it *always* is a bug.

I am simply arguing that a quick audit of the drivers should be done to 
see that the dev_err is not added for drivers which clean up in 
compliance with drm_sched_fini() kerneldoc.

Starting to log errors from those would be adding work for many people 
in the bug handling chain. Sure you can say lets add the dev_err and 
then we don't have to look into the code base, just wait for bug reports 
to come our way. That works too (on some level) but lets please state 
the intent clearly and explicitly.

Regards,

Tvrtko
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Danilo Krummrich 9 months, 3 weeks ago
On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote:
> 
> On 22/04/2025 13:32, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 01:07:47PM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 22/04/2025 12:13, Danilo Krummrich wrote:
> > > > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> > > > > Question I raised is if there are other drivers which manage to clean up
> > > > > everything correctly (like the mock scheduler does), but trigger that
> > > > > warning. Maybe there are not and maybe mock scheduler is the only false
> > > > > positive.
> > > > 
> > > > So far the scheduler simply does not give any guideline on how to address the
> > > > problem, hence every driver simply does something (or nothing, effectively
> > > > ignoring the problem). This is what we want to fix.
> > > > 
> > > > The mock scheduler keeps it's own list of pending jobs and on tear down stops
> > > > the scheduler's workqueues, traverses it's own list and eventually frees the
> > > > pending jobs without updating the scheduler's internal pending list.
> > > > 
> > > > So yes, it does avoid memory leaks, but it also leaves the schedulers internal
> > > > structures with an invalid state, i.e. the pending list of the scheduler has
> > > > pointers to already freed memory.
> > > > 
> > > > What if the drm_sched_fini() starts touching the pending list? Then you'd end up
> > > > with UAF bugs with this implementation. We cannot invalidate the schedulers
> > > > internal structures and yet call scheduler functions - e.g. drm_sched_fini() -
> > > > subsequently.
> > > > 
> > > > Hence, the current implementation of the mock scheduler is fundamentally flawed.
> > > > And so would be *every* driver that still has entries within the scheduler's
> > > > pending list.
> > > > 
> > > > This is not a false positive, it already caught a real bug -- in the mock
> > > > scheduler.
> > > 
> > > To avoid furher splitting hairs on whether real bugs need to be able to
> > > manifest or not, lets move past this with a conclusion that there are two
> > > potential things to do here:
> > 
> > This is not about splitting hairs, it is about understanding that abusing
> > knowledge about internals of a component to clean things up is *never* valid.
> > 
> > > First one is to either send separately or include in this series something
> > > like:
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > index f999c8859cf7..7c4df0e890ac 100644
> > > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct drm_mock_scheduler
> > > *sched)
> > >                  drm_mock_sched_job_complete(job);
> > >          spin_unlock_irqrestore(&sched->lock, flags);
> > > 
> > > +       drm_sched_fini(&sched->base);
> > > +
> > >          /*
> > >           * Free completed jobs and jobs not yet processed by the DRM
> > > scheduler
> > >           * free worker.
> > > @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct drm_mock_scheduler
> > > *sched)
> > > 
> > >          list_for_each_entry_safe(job, next, &list, link)
> > >                  mock_sched_free_job(&job->base);
> > > -
> > > -       drm_sched_fini(&sched->base);
> > >   }
> > > 
> > >   /**
> > > 
> > > That should satisfy the requirement to "clear" memory about to be freed and
> > > be 100% compliant with drm_sched_fini() kerneldoc (guideline b).
> > > 
> > > But the new warning from 3/5 here will still be there AFAICT and would you
> > > then agree it is a false positive?
> > 
> > No, I do not agree.
> > 
> > Even if a driver does what you describe it is not the correct thing to do and
> > having a warning call it out makes sense.
> > 
> > This way of cleaning things up entirely relies on knowing specific scheduler
> > internals, which if changed, may fall apart.
> > 
> > > Secondly, the series should modify all drivers (including the unit tests)
> > > which are known to trigger this false positive.
> > 
> > Again, there are no false positives. It is the scheduler that needs to call
> > free_job() and other potential cleanups. You can't just stop the scheduler,
> > leave it in an intermediate state and try to clean it up by hand relying on
> > knowledge about internals.
> 
> Sorry I don't see the argument for the claim it is relying on the internals
> with the re-positioned drm_sched_fini call. In that case it is fully
> compliant with:
> 
> /**
>  * drm_sched_fini - Destroy a gpu scheduler
>  *
>  * @sched: scheduler instance
>  *
>  * Tears down and cleans up the scheduler.
>  *
>  * This stops submission of new jobs to the hardware through
>  * drm_sched_backend_ops.run_job(). Consequently,
> drm_sched_backend_ops.free_job()
>  * will not be called for all jobs still in drm_gpu_scheduler.pending_list.
>  * There is no solution for this currently. Thus, it is up to the driver to
> make
>  * sure that:
>  *
>  *  a) drm_sched_fini() is only called after for all submitted jobs
>  *     drm_sched_backend_ops.free_job() has been called or that
>  *  b) the jobs for which drm_sched_backend_ops.free_job() has not been
> called
>  *
>  * FIXME: Take care of the above problem and prevent this function from
> leaking
>  * the jobs in drm_gpu_scheduler.pending_list under any circumstances.
> 
> ^^^ recommended solution b).

This has been introduced recently with commit baf4afc58314 ("drm/sched: Improve
teardown documentation") and I do not agree with this. The scheduler should
*not* make any promises about implementation details to enable drivers to abuse
their knowledge about component internals.

This makes the problem *worse* as it encourages drivers to rely on
implementation details, making maintainability of the scheduler even worse.

For instance, what if I change the scheduler implementation, such that for every
entry in the pending_list the scheduler allocates another internal object for
${something}? Then drivers would already fall apart leaking those internal
objects.

Now, obviously that's pretty unlikely, but I assume you get the idea.

The b) paragraph in drm_sched_fini() should be removed for the given reasons.

AFAICS, since the introduction of this commit, driver implementations haven't
changed in this regard, hence we should be good.

So, for me this doesn't change the fact that every driver implementation that
just stops the scheduler at an arbitrary point of time and tries to clean things
up manually relying on knowledge about component internals is broken.

However, this doesn't mean we can't do a brief audit.

- Danilo
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Philipp Stanner 9 months, 3 weeks ago
On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote:
> On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote:
> > 
> > On 22/04/2025 13:32, Danilo Krummrich wrote:
> > > On Tue, Apr 22, 2025 at 01:07:47PM +0100, Tvrtko Ursulin wrote:
> > > > 
> > > > On 22/04/2025 12:13, Danilo Krummrich wrote:
> > > > > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin
> > > > > wrote:
> > > > > > Question I raised is if there are other drivers which
> > > > > > manage to clean up
> > > > > > everything correctly (like the mock scheduler does), but
> > > > > > trigger that
> > > > > > warning. Maybe there are not and maybe mock scheduler is
> > > > > > the only false
> > > > > > positive.
> > > > > 
> > > > > So far the scheduler simply does not give any guideline on
> > > > > how to address the
> > > > > problem, hence every driver simply does something (or
> > > > > nothing, effectively
> > > > > ignoring the problem). This is what we want to fix.
> > > > > 
> > > > > The mock scheduler keeps it's own list of pending jobs and on
> > > > > tear down stops
> > > > > the scheduler's workqueues, traverses it's own list and
> > > > > eventually frees the
> > > > > pending jobs without updating the scheduler's internal
> > > > > pending list.
> > > > > 
> > > > > So yes, it does avoid memory leaks, but it also leaves the
> > > > > schedulers internal
> > > > > structures with an invalid state, i.e. the pending list of
> > > > > the scheduler has
> > > > > pointers to already freed memory.
> > > > > 
> > > > > What if the drm_sched_fini() starts touching the pending
> > > > > list? Then you'd end up
> > > > > with UAF bugs with this implementation. We cannot invalidate
> > > > > the schedulers
> > > > > internal structures and yet call scheduler functions - e.g.
> > > > > drm_sched_fini() -
> > > > > subsequently.
> > > > > 
> > > > > Hence, the current implementation of the mock scheduler is
> > > > > fundamentally flawed.
> > > > > And so would be *every* driver that still has entries within
> > > > > the scheduler's
> > > > > pending list.
> > > > > 
> > > > > This is not a false positive, it already caught a real bug --
> > > > > in the mock
> > > > > scheduler.
> > > > 
> > > > To avoid furher splitting hairs on whether real bugs need to be
> > > > able to
> > > > manifest or not, lets move past this with a conclusion that
> > > > there are two
> > > > potential things to do here:
> > > 
> > > This is not about splitting hairs, it is about understanding that
> > > abusing
> > > knowledge about internals of a component to clean things up is
> > > *never* valid.
> > > 
> > > > First one is to either send separately or include in this
> > > > series something
> > > > like:
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > index f999c8859cf7..7c4df0e890ac 100644
> > > > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct
> > > > drm_mock_scheduler
> > > > *sched)
> > > >                  drm_mock_sched_job_complete(job);
> > > >          spin_unlock_irqrestore(&sched->lock, flags);
> > > > 
> > > > +       drm_sched_fini(&sched->base);
> > > > +
> > > >          /*
> > > >           * Free completed jobs and jobs not yet processed by
> > > > the DRM
> > > > scheduler
> > > >           * free worker.
> > > > @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct
> > > > drm_mock_scheduler
> > > > *sched)
> > > > 
> > > >          list_for_each_entry_safe(job, next, &list, link)
> > > >                  mock_sched_free_job(&job->base);
> > > > -
> > > > -       drm_sched_fini(&sched->base);
> > > >   }
> > > > 
> > > >   /**
> > > > 
> > > > That should satisfy the requirement to "clear" memory about to
> > > > be freed and
> > > > be 100% compliant with drm_sched_fini() kerneldoc (guideline
> > > > b).
> > > > 
> > > > But the new warning from 3/5 here will still be there AFAICT
> > > > and would you
> > > > then agree it is a false positive?
> > > 
> > > No, I do not agree.
> > > 
> > > Even if a driver does what you describe it is not the correct
> > > thing to do and
> > > having a warning call it out makes sense.
> > > 
> > > This way of cleaning things up entirely relies on knowing
> > > specific scheduler
> > > internals, which if changed, may fall apart.
> > > 
> > > > Secondly, the series should modify all drivers (including the
> > > > unit tests)
> > > > which are known to trigger this false positive.
> > > 
> > > Again, there are no false positives. It is the scheduler that
> > > needs to call
> > > free_job() and other potential cleanups. You can't just stop the
> > > scheduler,
> > > leave it in an intermediate state and try to clean it up by hand
> > > relying on
> > > knowledge about internals.
> > 
> > Sorry I don't see the argument for the claim it is relying on the
> > internals
> > with the re-positioned drm_sched_fini call. In that case it is
> > fully
> > compliant with:
> > 
> > /**
> >  * drm_sched_fini - Destroy a gpu scheduler
> >  *
> >  * @sched: scheduler instance
> >  *
> >  * Tears down and cleans up the scheduler.
> >  *
> >  * This stops submission of new jobs to the hardware through
> >  * drm_sched_backend_ops.run_job(). Consequently,
> > drm_sched_backend_ops.free_job()
> >  * will not be called for all jobs still in
> > drm_gpu_scheduler.pending_list.
> >  * There is no solution for this currently. Thus, it is up to the
> > driver to
> > make
> >  * sure that:
> >  *
> >  *  a) drm_sched_fini() is only called after for all submitted jobs
> >  *     drm_sched_backend_ops.free_job() has been called or that
> >  *  b) the jobs for which drm_sched_backend_ops.free_job() has not
> > been
> > called
> >  *
> >  * FIXME: Take care of the above problem and prevent this function
> > from
> > leaking
> >  * the jobs in drm_gpu_scheduler.pending_list under any
> > circumstances.
> > 
> > ^^^ recommended solution b).
> 
> This has been introduced recently with commit baf4afc58314
> ("drm/sched: Improve
> teardown documentation") and I do not agree with this. The scheduler
> should
> *not* make any promises about implementation details to enable
> drivers to abuse
> their knowledge about component internals.
> 
> This makes the problem *worse* as it encourages drivers to rely on
> implementation details, making maintainability of the scheduler even
> worse.
> 
> For instance, what if I change the scheduler implementation, such
> that for every
> entry in the pending_list the scheduler allocates another internal
> object for
> ${something}? Then drivers would already fall apart leaking those
> internal
> objects.
> 
> Now, obviously that's pretty unlikely, but I assume you get the idea.
> 
> The b) paragraph in drm_sched_fini() should be removed for the given
> reasons.
> 
> AFAICS, since the introduction of this commit, driver implementations
> haven't
> changed in this regard, hence we should be good.
> 
> So, for me this doesn't change the fact that every driver
> implementation that
> just stops the scheduler at an arbitrary point of time and tries to
> clean things
> up manually relying on knowledge about component internals is broken.

To elaborate on that, this documentation has been written so that we at
least have *some* documentation about the problem, instead of just
letting new drivers run into the knife.

The commit explicitly introduced the FIXME, marking those two hacky
workarounds as undesirable.

But back then we couldn't fix the problem quickly, so it was either
document the issue at least a bit, or leave it completely undocumented.

P.

> 
> However, this doesn't mean we can't do a brief audit.
> 
> - Danilo
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Danilo Krummrich 9 months, 3 weeks ago
On Tue, Apr 22, 2025 at 04:16:48PM +0200, Philipp Stanner wrote:
> On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote:
> 
> > > Sorry I don't see the argument for the claim it is relying on the
> > > internals
> > > with the re-positioned drm_sched_fini call. In that case it is
> > > fully
> > > compliant with:
> > > 
> > > /**
> > >  * drm_sched_fini - Destroy a gpu scheduler
> > >  *
> > >  * @sched: scheduler instance
> > >  *
> > >  * Tears down and cleans up the scheduler.
> > >  *
> > >  * This stops submission of new jobs to the hardware through
> > >  * drm_sched_backend_ops.run_job(). Consequently,
> > > drm_sched_backend_ops.free_job()
> > >  * will not be called for all jobs still in
> > > drm_gpu_scheduler.pending_list.
> > >  * There is no solution for this currently. Thus, it is up to the
> > > driver to
> > > make
> > >  * sure that:
> > >  *
> > >  *  a) drm_sched_fini() is only called after for all submitted jobs
> > >  *     drm_sched_backend_ops.free_job() has been called or that
> > >  *  b) the jobs for which drm_sched_backend_ops.free_job() has not
> > > been
> > > called
> > >  *
> > >  * FIXME: Take care of the above problem and prevent this function
> > > from
> > > leaking
> > >  * the jobs in drm_gpu_scheduler.pending_list under any
> > > circumstances.
> > > 
> > > ^^^ recommended solution b).
> > 
> > This has been introduced recently with commit baf4afc58314
> > ("drm/sched: Improve
> > teardown documentation") and I do not agree with this. The scheduler
> > should
> > *not* make any promises about implementation details to enable
> > drivers to abuse
> > their knowledge about component internals.
> > 
> > This makes the problem *worse* as it encourages drivers to rely on
> > implementation details, making maintainability of the scheduler even
> > worse.
> > 
> > For instance, what if I change the scheduler implementation, such
> > that for every
> > entry in the pending_list the scheduler allocates another internal
> > object for
> > ${something}? Then drivers would already fall apart leaking those
> > internal
> > objects.
> > 
> > Now, obviously that's pretty unlikely, but I assume you get the idea.
> > 
> > The b) paragraph in drm_sched_fini() should be removed for the given
> > reasons.
> > 
> > AFAICS, since the introduction of this commit, driver implementations
> > haven't
> > changed in this regard, hence we should be good.
> > 
> > So, for me this doesn't change the fact that every driver
> > implementation that
> > just stops the scheduler at an arbitrary point of time and tries to
> > clean things
> > up manually relying on knowledge about component internals is broken.
> 
> To elaborate on that, this documentation has been written so that we at
> least have *some* documentation about the problem, instead of just
> letting new drivers run into the knife.
> 
> The commit explicitly introduced the FIXME, marking those two hacky
> workarounds as undesirable.
> 
> But back then we couldn't fix the problem quickly, so it was either
> document the issue at least a bit, or leave it completely undocumented.

Agreed, but b) really sounds like an invitation (or even justification) for
doing the wrong thing, let's removed it.
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Tvrtko Ursulin 9 months, 3 weeks ago
On 22/04/2025 15:52, Danilo Krummrich wrote:
> On Tue, Apr 22, 2025 at 04:16:48PM +0200, Philipp Stanner wrote:
>> On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote:
>>> On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote:
>>
>>>> Sorry I don't see the argument for the claim it is relying on the
>>>> internals
>>>> with the re-positioned drm_sched_fini call. In that case it is
>>>> fully
>>>> compliant with:
>>>>
>>>> /**
>>>>   * drm_sched_fini - Destroy a gpu scheduler
>>>>   *
>>>>   * @sched: scheduler instance
>>>>   *
>>>>   * Tears down and cleans up the scheduler.
>>>>   *
>>>>   * This stops submission of new jobs to the hardware through
>>>>   * drm_sched_backend_ops.run_job(). Consequently,
>>>> drm_sched_backend_ops.free_job()
>>>>   * will not be called for all jobs still in
>>>> drm_gpu_scheduler.pending_list.
>>>>   * There is no solution for this currently. Thus, it is up to the
>>>> driver to
>>>> make
>>>>   * sure that:
>>>>   *
>>>>   *  a) drm_sched_fini() is only called after for all submitted jobs
>>>>   *     drm_sched_backend_ops.free_job() has been called or that
>>>>   *  b) the jobs for which drm_sched_backend_ops.free_job() has not
>>>> been
>>>> called
>>>>   *
>>>>   * FIXME: Take care of the above problem and prevent this function
>>>> from
>>>> leaking
>>>>   * the jobs in drm_gpu_scheduler.pending_list under any
>>>> circumstances.
>>>>
>>>> ^^^ recommended solution b).
>>>
>>> This has been introduced recently with commit baf4afc58314
>>> ("drm/sched: Improve
>>> teardown documentation") and I do not agree with this. The scheduler
>>> should
>>> *not* make any promises about implementation details to enable
>>> drivers to abuse
>>> their knowledge about component internals.
>>>
>>> This makes the problem *worse* as it encourages drivers to rely on
>>> implementation details, making maintainability of the scheduler even
>>> worse.
>>>
>>> For instance, what if I change the scheduler implementation, such
>>> that for every
>>> entry in the pending_list the scheduler allocates another internal
>>> object for
>>> ${something}? Then drivers would already fall apart leaking those
>>> internal
>>> objects.
>>>
>>> Now, obviously that's pretty unlikely, but I assume you get the idea.
>>>
>>> The b) paragraph in drm_sched_fini() should be removed for the given
>>> reasons.
>>>
>>> AFAICS, since the introduction of this commit, driver implementations
>>> haven't
>>> changed in this regard, hence we should be good.
>>>
>>> So, for me this doesn't change the fact that every driver
>>> implementation that
>>> just stops the scheduler at an arbitrary point of time and tries to
>>> clean things
>>> up manually relying on knowledge about component internals is broken.
>>
>> To elaborate on that, this documentation has been written so that we at
>> least have *some* documentation about the problem, instead of just
>> letting new drivers run into the knife.
>>
>> The commit explicitly introduced the FIXME, marking those two hacky
>> workarounds as undesirable.
>>
>> But back then we couldn't fix the problem quickly, so it was either
>> document the issue at least a bit, or leave it completely undocumented.
> 
> Agreed, but b) really sounds like an invitation (or even justification) for
> doing the wrong thing, let's removed it.

IMO it is better to leave it. Regardless of whether it was added because 
some driver is actually operating like that, it does describe a 
_currently_ workable option to avoid memory leaks. Once a better method 
is there, ie. FIXME is addressed, then it can be removed or replaced.

Regards,

Tvrtko

Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Danilo Krummrich 9 months, 3 weeks ago
On Wed, Apr 23, 2025 at 08:34:08AM +0100, Tvrtko Ursulin wrote:
> 
> IMO it is better to leave it. Regardless of whether it was added because
> some driver is actually operating like that, it does describe a _currently_
> workable option to avoid memory leaks. Once a better method is there, ie.
> FIXME is addressed, then it can be removed or replaced.

I'm not willing to sign off on encouraging drivers to rely on scheduler
internals -- also not in this case, sorry.

Our primary goal with the scheduler is to *remove* such broken contracts where
drivers rely on scheduler internal implementation details, mess with scheduler
internal data structures etc. This is clearly a step back.

And AFAICT, as by now drivers either do a) or simply nothing (with the exception
of the mock scheduler). Drivers can do a) in the meantime, there's no reason at
all to additionally offer b).
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Tvrtko Ursulin 9 months, 2 weeks ago
On 23/04/2025 09:48, Danilo Krummrich wrote:
> On Wed, Apr 23, 2025 at 08:34:08AM +0100, Tvrtko Ursulin wrote:
>>
>> IMO it is better to leave it. Regardless of whether it was added because
>> some driver is actually operating like that, it does describe a _currently_
>> workable option to avoid memory leaks. Once a better method is there, ie.
>> FIXME is addressed, then it can be removed or replaced.
> 
> I'm not willing to sign off on encouraging drivers to rely on scheduler
> internals -- also not in this case, sorry.
> 
> Our primary goal with the scheduler is to *remove* such broken contracts where
> drivers rely on scheduler internal implementation details, mess with scheduler
> internal data structures etc. This is clearly a step back.
> 
> And AFAICT, as by now drivers either do a) or simply nothing (with the exception
> of the mock scheduler). Drivers can do a) in the meantime, there's no reason at
> all to additionally offer b).

What mechanism do we currently have to enable using a), and which you 
would not consider needing knowledge of scheduler internals?

Regards,

Tvrtko
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Danilo Krummrich 9 months, 2 weeks ago
On Wed, Apr 23, 2025 at 11:10:51AM +0100, Tvrtko Ursulin wrote:
> 
> On 23/04/2025 09:48, Danilo Krummrich wrote:
> > On Wed, Apr 23, 2025 at 08:34:08AM +0100, Tvrtko Ursulin wrote:
> > > 
> > > IMO it is better to leave it. Regardless of whether it was added because
> > > some driver is actually operating like that, it does describe a _currently_
> > > workable option to avoid memory leaks. Once a better method is there, ie.
> > > FIXME is addressed, then it can be removed or replaced.
> > 
> > I'm not willing to sign off on encouraging drivers to rely on scheduler
> > internals -- also not in this case, sorry.
> > 
> > Our primary goal with the scheduler is to *remove* such broken contracts where
> > drivers rely on scheduler internal implementation details, mess with scheduler
> > internal data structures etc. This is clearly a step back.
> > 
> > And AFAICT, as by now drivers either do a) or simply nothing (with the exception
> > of the mock scheduler). Drivers can do a) in the meantime, there's no reason at
> > all to additionally offer b).
> 
> What mechanism do we currently have to enable using a), and which you would
> not consider needing knowledge of scheduler internals?

The rule is that drivers must not call drm_sched_fini() before all jobs are
processed.

For this, drivers may reference count their scheduler "subclass" for each job
in flight and only call drm_sched_fini() once the reference count becomes zero,
or keep their own list of jobs in flight and just wait for the list to become
empty.

This is possible with the regular API and none of this requires relying on
scheduler internals or messing with scheduler internal data structures.

The problem is limited to the aspect that the GPU scheduler does not provide an
API for drivers to solve this problem.
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Philipp Stanner 9 months, 3 weeks ago
On Tue, 2025-04-22 at 14:39 +0100, Tvrtko Ursulin wrote:
> 
> On 22/04/2025 13:32, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 01:07:47PM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 22/04/2025 12:13, Danilo Krummrich wrote:
> > > > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> > > > > Question I raised is if there are other drivers which manage
> > > > > to clean up
> > > > > everything correctly (like the mock scheduler does), but
> > > > > trigger that
> > > > > warning. Maybe there are not and maybe mock scheduler is the
> > > > > only false
> > > > > positive.
> > > > 
> > > > So far the scheduler simply does not give any guideline on how
> > > > to address the
> > > > problem, hence every driver simply does something (or nothing,
> > > > effectively
> > > > ignoring the problem). This is what we want to fix.
> > > > 
> > > > The mock scheduler keeps it's own list of pending jobs and on
> > > > tear down stops
> > > > the scheduler's workqueues, traverses it's own list and
> > > > eventually frees the
> > > > pending jobs without updating the scheduler's internal pending
> > > > list.
> > > > 
> > > > So yes, it does avoid memory leaks, but it also leaves the
> > > > schedulers internal
> > > > structures with an invalid state, i.e. the pending list of the
> > > > scheduler has
> > > > pointers to already freed memory.
> > > > 
> > > > What if the drm_sched_fini() starts touching the pending list?
> > > > Then you'd end up
> > > > with UAF bugs with this implementation. We cannot invalidate
> > > > the schedulers
> > > > internal structures and yet call scheduler functions - e.g.
> > > > drm_sched_fini() -
> > > > subsequently.
> > > > 
> > > > Hence, the current implementation of the mock scheduler is
> > > > fundamentally flawed.
> > > > And so would be *every* driver that still has entries within
> > > > the scheduler's
> > > > pending list.
> > > > 
> > > > This is not a false positive, it already caught a real bug --
> > > > in the mock
> > > > scheduler.
> > > 
> > > To avoid furher splitting hairs on whether real bugs need to be
> > > able to
> > > manifest or not, lets move past this with a conclusion that there
> > > are two
> > > potential things to do here:
> > 
> > This is not about splitting hairs, it is about understanding that
> > abusing
> > knowledge about internals of a component to clean things up is
> > *never* valid.
> > 
> > > First one is to either send separately or include in this series
> > > something
> > > like:
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > index f999c8859cf7..7c4df0e890ac 100644
> > > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct
> > > drm_mock_scheduler
> > > *sched)
> > >                  drm_mock_sched_job_complete(job);
> > >          spin_unlock_irqrestore(&sched->lock, flags);
> > > 
> > > +       drm_sched_fini(&sched->base);
> > > +
> > >          /*
> > >           * Free completed jobs and jobs not yet processed by the
> > > DRM
> > > scheduler
> > >           * free worker.
> > > @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct
> > > drm_mock_scheduler
> > > *sched)
> > > 
> > >          list_for_each_entry_safe(job, next, &list, link)
> > >                  mock_sched_free_job(&job->base);
> > > -
> > > -       drm_sched_fini(&sched->base);
> > >   }
> > > 
> > >   /**
> > > 
> > > That should satisfy the requirement to "clear" memory about to be
> > > freed and
> > > be 100% compliant with drm_sched_fini() kerneldoc (guideline b).
> > > 
> > > But the new warning from 3/5 here will still be there AFAICT and
> > > would you
> > > then agree it is a false positive?
> > 
> > No, I do not agree.
> > 
> > Even if a driver does what you describe it is not the correct thing
> > to do and
> > having a warning call it out makes sense.
> > 
> > This way of cleaning things up entirely relies on knowing specific
> > scheduler
> > internals, which if changed, may fall apart.
> > 
> > > Secondly, the series should modify all drivers (including the
> > > unit tests)
> > > which are known to trigger this false positive.
> > 
> > Again, there are no false positives. It is the scheduler that needs
> > to call
> > free_job() and other potential cleanups. You can't just stop the
> > scheduler,
> > leave it in an intermediate state and try to clean it up by hand
> > relying on
> > knowledge about internals.
> 
> Sorry I don't see the argument for the claim it is relying on the 
> internals with the re-positioned drm_sched_fini call. In that case it
> is 
> fully compliant with:
> 
> /**
>   * drm_sched_fini - Destroy a gpu scheduler
>   *
>   * @sched: scheduler instance
>   *
>   * Tears down and cleans up the scheduler.
>   *
>   * This stops submission of new jobs to the hardware through
>   * drm_sched_backend_ops.run_job(). Consequently, 
> drm_sched_backend_ops.free_job()
>   * will not be called for all jobs still in
> drm_gpu_scheduler.pending_list.
>   * There is no solution for this currently. Thus, it is up to the 
> driver to make
>   * sure that:
>   *
>   *  a) drm_sched_fini() is only called after for all submitted jobs
>   *     drm_sched_backend_ops.free_job() has been called or that
>   *  b) the jobs for which drm_sched_backend_ops.free_job() has not
> been 
> called
>   *
>   * FIXME: Take care of the above problem and prevent this function
> from 
> leaking
>   * the jobs in drm_gpu_scheduler.pending_list under any
> circumstances.
> 
> ^^^ recommended solution b).
> 
> > Consequently, when the pending list isn't empty when
> > drm_sched_fini() is called,
> > it *always* is a bug.
> 
> I am simply arguing that a quick audit of the drivers should be done
> to 
> see that the dev_err is not added for drivers which clean up in 
> compliance with drm_sched_fini() kerneldoc.
> 
> Starting to log errors from those would be adding work for many
> people 
> in the bug handling chain. Sure you can say lets add the dev_err and 
> then we don't have to look into the code base, just wait for bug
> reports 
> to come our way. That works too (on some level) but lets please state
> the intent clearly and explicitly.

Well, yes, that exactly is my intention.

All driver's must currently ensure in some custom way that a) all
fences get signaled and b) that the scheduler had time to call
free_job() for all jobs in pending_list.

If there is anyone in-tree currently who has len(pending_list) > 0
after drm_sched_fini() ran that are clearly memory leaks that need to
be fixed.

And, thus, firing the warning for all those drivers is appropriate.

I think it's unlikely to happen though. The hardware schedulers rarely
call drm_sched_fini(), and the firmware schedulers would have memory
leaks so large that they are likely to have been discovered by now.

P.

> 
> Regards,
> 
> Tvrtko
> 
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Philipp Stanner 9 months, 3 weeks ago
On Tue, 2025-04-22 at 13:07 +0100, Tvrtko Ursulin wrote:
> 
> On 22/04/2025 12:13, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> > > Question I raised is if there are other drivers which manage to
> > > clean up
> > > everything correctly (like the mock scheduler does), but trigger
> > > that
> > > warning. Maybe there are not and maybe mock scheduler is the only
> > > false
> > > positive.
> > 
> > So far the scheduler simply does not give any guideline on how to
> > address the
> > problem, hence every driver simply does something (or nothing,
> > effectively
> > ignoring the problem). This is what we want to fix.
> > 
> > The mock scheduler keeps it's own list of pending jobs and on tear
> > down stops
> > the scheduler's workqueues, traverses it's own list and eventually
> > frees the
> > pending jobs without updating the scheduler's internal pending
> > list.
> > 
> > So yes, it does avoid memory leaks, but it also leaves the
> > schedulers internal
> > structures with an invalid state, i.e. the pending list of the
> > scheduler has
> > pointers to already freed memory.
> > 
> > What if the drm_sched_fini() starts touching the pending list? Then
> > you'd end up
> > with UAF bugs with this implementation. We cannot invalidate the
> > schedulers
> > internal structures and yet call scheduler functions - e.g.
> > drm_sched_fini() -
> > subsequently.
> > 
> > Hence, the current implementation of the mock scheduler is
> > fundamentally flawed.
> > And so would be *every* driver that still has entries within the
> > scheduler's
> > pending list.
> > 
> > This is not a false positive, it already caught a real bug -- in
> > the mock
> > scheduler.
> 
> To avoid furher splitting hairs on whether real bugs need to be able
> to 
> manifest or not, lets move past this with a conclusion that there are
> two potential things to do here:
> 
> First one is to either send separately or include in this series 
> something like:
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c 
> b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> index f999c8859cf7..7c4df0e890ac 100644
> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct
> drm_mock_scheduler 
> *sched)
>                  drm_mock_sched_job_complete(job);
>          spin_unlock_irqrestore(&sched->lock, flags);
> 
> +       drm_sched_fini(&sched->base);
> +
>          /*
>           * Free completed jobs and jobs not yet processed by the DRM
> scheduler
>           * free worker.
> @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct
> drm_mock_scheduler 
> *sched)
> 
>          list_for_each_entry_safe(job, next, &list, link)
>                  mock_sched_free_job(&job->base);
> -
> -       drm_sched_fini(&sched->base);
>   }
> 
>   /**
> 
> That should satisfy the requirement to "clear" memory about to be
> freed 
> and be 100% compliant with drm_sched_fini() kerneldoc (guideline b).
> 
> But the new warning from 3/5 here will still be there AFAICT and
> would 
> you then agree it is a false positive?
> 
> Secondly, the series should modify all drivers (including the unit 
> tests) which are known to trigger this false positive.

There is no false positive. I just stated that in my other answer.

That said, I agree that the unit tests should be a second user of this
code in addition to Nouveau. They then should also provide the
fence_context_kill() callback, though.

And it seems we also agree that the memory leaks must be solved
centrally in drm_sched_fini().


P.

> 
> Regards,
> 
> Tvrtko
> 
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Philipp Stanner 9 months, 3 weeks ago
On Tue, 2025-04-22 at 13:13 +0200, Danilo Krummrich wrote:
> On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> > Question I raised is if there are other drivers which manage to
> > clean up
> > everything correctly (like the mock scheduler does), but trigger
> > that
> > warning. Maybe there are not and maybe mock scheduler is the only
> > false
> > positive.

For clarification:

I messed up the comment from the cover letter.

What I did was run the *old* unit tests (v5 IIRC) from Tvrtko that
still had the memory leaks. Those then trigger the warning print, as is
expected, since they don't provide fence_context_kill().

The current unit tests are fine memory-leak-wise.

IOW, both with Nouveau and the unit tests, everything behaves as
expected, without issues.

Sorry for the confusion.

P.

> 
> So far the scheduler simply does not give any guideline on how to
> address the
> problem, hence every driver simply does something (or nothing,
> effectively
> ignoring the problem). This is what we want to fix.
> 
> The mock scheduler keeps it's own list of pending jobs and on tear
> down stops
> the scheduler's workqueues, traverses it's own list and eventually
> frees the
> pending jobs without updating the scheduler's internal pending list.
> 
> So yes, it does avoid memory leaks, but it also leaves the schedulers
> internal
> structures with an invalid state, i.e. the pending list of the
> scheduler has
> pointers to already freed memory.
> 
> What if the drm_sched_fini() starts touching the pending list? Then
> you'd end up
> with UAF bugs with this implementation. We cannot invalidate the
> schedulers
> internal structures and yet call scheduler functions - e.g.
> drm_sched_fini() -
> subsequently.
> 
> Hence, the current implementation of the mock scheduler is
> fundamentally flawed.
> And so would be *every* driver that still has entries within the
> scheduler's
> pending list.
> 
> This is not a false positive, it already caught a real bug -- in the
> mock
> scheduler.
> 
> - Danilo
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Tvrtko Ursulin 9 months, 3 weeks ago
On 22/04/2025 13:00, Philipp Stanner wrote:
> On Tue, 2025-04-22 at 13:13 +0200, Danilo Krummrich wrote:
>> On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
>>> Question I raised is if there are other drivers which manage to
>>> clean up
>>> everything correctly (like the mock scheduler does), but trigger
>>> that
>>> warning. Maybe there are not and maybe mock scheduler is the only
>>> false
>>> positive.
> 
> For clarification:
> 
> I messed up the comment from the cover letter.
> 
> What I did was run the *old* unit tests (v5 IIRC) from Tvrtko that
> still had the memory leaks. Those then trigger the warning print, as is
> expected, since they don't provide fence_context_kill().
> 
> The current unit tests are fine memory-leak-wise.
> 
> IOW, both with Nouveau and the unit tests, everything behaves as
> expected, without issues.

Hmm how? Pending list can be non-empty so it should be possible to hit 
that warn.

Regards,

Tvrtko

>> So far the scheduler simply does not give any guideline on how to
>> address the
>> problem, hence every driver simply does something (or nothing,
>> effectively
>> ignoring the problem). This is what we want to fix.
>>
>> The mock scheduler keeps it's own list of pending jobs and on tear
>> down stops
>> the scheduler's workqueues, traverses it's own list and eventually
>> frees the
>> pending jobs without updating the scheduler's internal pending list.
>>
>> So yes, it does avoid memory leaks, but it also leaves the schedulers
>> internal
>> structures with an invalid state, i.e. the pending list of the
>> scheduler has
>> pointers to already freed memory.
>>
>> What if the drm_sched_fini() starts touching the pending list? Then
>> you'd end up
>> with UAF bugs with this implementation. We cannot invalidate the
>> schedulers
>> internal structures and yet call scheduler functions - e.g.
>> drm_sched_fini() -
>> subsequently.
>>
>> Hence, the current implementation of the mock scheduler is
>> fundamentally flawed.
>> And so would be *every* driver that still has entries within the
>> scheduler's
>> pending list.
>>
>> This is not a false positive, it already caught a real bug -- in the
>> mock
>> scheduler.
>>
>> - Danilo
>
Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Posted by Danilo Krummrich 9 months, 3 weeks ago
On Thu, Apr 17, 2025 at 05:08:12PM +0100, Tvrtko Ursulin wrote:
> To catch up on why if you could dig out the links to past discussions it
> would be helpful.

I can't look it up currently, sorry. That's why I said Philipp will loop you in
once he's back. :)

> I repeat how there is a lot of attractiveness to reference counting. Already
> mentioned memory leak, s_fence oops, and also not having to clear
> job->entity could be useful for things like tracking per entity submission
> stats (imagine CFS like scheduling, generic scheduling DRM cgroup
> controller). So it would be good for me to hear what pitfalls were
> identified in this space.

<snip>

> > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > index 6b72278c4b72..ae3152beca14 100644
> > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
> > > > > > > *sched)
> > > > > > >     	sched->ready = false;
> > > > > > >     	kfree(sched->sched_rq);
> > > > > > >     	sched->sched_rq = NULL;
> > > > > > > +
> > > > > > > +	if (!list_empty(&sched->pending_list))
> > > > > > > +		dev_err(sched->dev, "%s: Tearing down scheduler
> > > > > > > while jobs are pending!\n",
> > > > > > > +			__func__);
> > > > > 
> > > > > It isn't fair to add this error since it would out of the blue start firing
> > > > > for everyone expect nouveau, no? Regardless if there is a leak or not.
> > > > 
> > > > I think it is pretty fair to warn when detecting a guaranteed bug, no?
> > > > 
> > > > If drm_sched_fini() is call while jobs are still on the pending_list, they won't
> > > > ever be freed, because all workqueues are stopped.
> > > 
> > > Is it a guaranteed bug for drivers are aware of the drm_sched_fini()
> > > limitation and are cleaning up upon themselves?
> > 
> > How could a driver clean up on itself (unless the driver holds its own list of
> > pending jobs)?
> > 
> > Once a job is in flight (i.e. it's on the pending_list) we must guarantee that
> > free_job() is called by the scheduler, which it can't do if we call
> > drm_sched_fini() before the pending_list is empty.
> > 
> > > In other words if you apply the series up to here would it trigger for
> > > nouveau?
> > 
> > No, because nouveau does something very stupid, i.e. replicate the pending_list.
> 
> Ah okay I see it now, it waits for all jobs to finish before calling
> drm_sched_fini(). For some reason I did not think it was doing that given
> the cover letter starts with how that is a big no-no.
> 
> > > Reportedly it triggers for the mock scheduler which also has no
> > > leak.
> > 
> > That sounds impossible. How do you ensure you do *not* leak memory when you tear
> > down the scheduler while it still has pending jobs? Or in other words, who calls
> > free_job() if not the scheduler itself?
> 
> Well the cover letter says it triggers so it is possible. :)

I mean it should be impossible to have jobs on the pending_list when calling
drm_sched_fini(), but not have memory leaks, unless you let the driver do weird
things, such as peeking into implementation details of the scheduler, etc.

> Mock scheduler also tracks the pending jobs itself, but different from
> nouveau it does not wait for jobs to finish and free worker to process them
> all, but having stopped the "hw" backend it cancels them and calls the
> free_job vfunc directly.

That seems very wrong to me. This is exactly what I mean with the driver peeks
into implementation details.

If the API of a component requires to know about and modify internals of the
component, it's pretty much broken and must be fixed.

> Going back to the topic of this series, if we go with a solution along the
> lines of the proposed, I wonder if it would be doable without mandating that
> drivers keep a list parallel to pending_list. Instead have a vfunc DRM
> scheduler would call to cancel job at a time from *its* pending list. It
> would go nicely together with prepare/run/timedout/free.

That's pretty much what this series does, no? With the new callback the driver
is supposed to kill the corresponding fence context, which signals all
associated fences and hence the pending list will clear subsequently.

> Would it allow getting rid of the new state machinery and just cancelling
> and freeing in one go directly from drm_sched_fini()?

Again, I think that's what it does, unless I misunderstand you.