[v5] drm/sched: Documentation and refcount improvements

[PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job()

Posted by Philipp Stanner 11 months, 3 weeks ago

The documentation for drm_sched_backend_ops.run_job() mentions a certain
function called drm_sched_job_recovery(). This function does not exist.
What's actually meant is drm_sched_resubmit_jobs(), which is by now also
deprecated.

Remove the mention of the removed function.

Discourage the behavior of drm_sched_backend_ops.run_job() being called
multiple times for the same job.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 include/drm/gpu_scheduler.h | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 916279b5aa00..29e5bda91806 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -421,20 +421,27 @@ struct drm_sched_backend_ops {
 
 	/**
 	 * @run_job: Called to execute the job once all of the dependencies
-	 * have been resolved. This may be called multiple times, if
-	 * timedout_job() has happened and drm_sched_job_recovery() decides to
-	 * try it again.
+	 * have been resolved.
+	 *
+	 * The deprecated drm_sched_resubmit_jobs() (called from
+	 * drm_sched_backend_ops.timedout_job()) can invoke this again with the
+	 * same parameters. Using this is discouraged because it, presumably,
+	 * violates dma_fence rules.
+	 *
+	 * TODO: Document which fence rules above.
 	 *
 	 * @sched_job: the job to run
 	 *
-	 * Returns: dma_fence the driver must signal once the hardware has
-	 *	completed the job ("hardware fence").
-	 *
 	 * Note that the scheduler expects to 'inherit' its own reference to
 	 * this fence from the callback. It does not invoke an extra
 	 * dma_fence_get() on it. Consequently, this callback must take a
 	 * reference for the scheduler, and additional ones for the driver's
 	 * respective needs.
+	 *
+	 * Return:
+	 * * On success: dma_fence the driver must signal once the hardware has
+	 * completed the job ("hardware fence").
+	 * * On failure: NULL or an ERR_PTR.
 	 */
 	struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
 
-- 
2.47.1

Re: [PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job()

Posted by Maíra Canal 11 months, 3 weeks ago

Hi Philipp,

On 20/02/25 08:28, Philipp Stanner wrote:
> The documentation for drm_sched_backend_ops.run_job() mentions a certain
> function called drm_sched_job_recovery(). This function does not exist.
> What's actually meant is drm_sched_resubmit_jobs(), which is by now also
> deprecated.
> 
> Remove the mention of the removed function.
> 
> Discourage the behavior of drm_sched_backend_ops.run_job() being called
> multiple times for the same job.

It looks odd to me that this patch removes lines that were added in
patch 1/3. Maybe you could change the patchset order and place this one
as the first.

> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>   include/drm/gpu_scheduler.h | 19 +++++++++++++------
>   1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 916279b5aa00..29e5bda91806 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -421,20 +421,27 @@ struct drm_sched_backend_ops {
>   
>   	/**
>   	 * @run_job: Called to execute the job once all of the dependencies
> -	 * have been resolved. This may be called multiple times, if
> -	 * timedout_job() has happened and drm_sched_job_recovery() decides to
> -	 * try it again.
> +	 * have been resolved.
> +	 *
> +	 * The deprecated drm_sched_resubmit_jobs() (called from
> +	 * drm_sched_backend_ops.timedout_job()) can invoke this again with the

I think it would be "@timedout_job".

> +	 * same parameters. Using this is discouraged because it, presumably,
> +	 * violates dma_fence rules.

I believe it would be "struct dma_fence".

> +	 *
> +	 * TODO: Document which fence rules above.
>   	 *
>   	 * @sched_job: the job to run
>   	 *
> -	 * Returns: dma_fence the driver must signal once the hardware has
> -	 *	completed the job ("hardware fence").
> -	 *
>   	 * Note that the scheduler expects to 'inherit' its own reference to
>   	 * this fence from the callback. It does not invoke an extra
>   	 * dma_fence_get() on it. Consequently, this callback must take a
>   	 * reference for the scheduler, and additional ones for the driver's
>   	 * respective needs.

Would it be possible to add a comment that `run_job()` must check if
`s_fence->finished.error` is different than 0? If you increase the karma
of a job and don't check for `s_fence->finished.error`, you might run a
cancelled job.

> +	 *
> +	 * Return:
> +	 * * On success: dma_fence the driver must signal once the hardware has
> +	 * completed the job ("hardware fence").

A suggestion: "the fence that the driver must signal once the hardware
has completed the job".

Best Regards,
- Maíra

> +	 * * On failure: NULL or an ERR_PTR.
>   	 */
>   	struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>

Re: [PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job()

Posted by Philipp Stanner 11 months, 3 weeks ago

On Thu, 2025-02-20 at 10:28 -0300, Maíra Canal wrote:
> Hi Philipp,
> 
> On 20/02/25 08:28, Philipp Stanner wrote:
> > The documentation for drm_sched_backend_ops.run_job() mentions a
> > certain
> > function called drm_sched_job_recovery(). This function does not
> > exist.
> > What's actually meant is drm_sched_resubmit_jobs(), which is by now
> > also
> > deprecated.
> > 
> > Remove the mention of the removed function.
> > 
> > Discourage the behavior of drm_sched_backend_ops.run_job() being
> > called
> > multiple times for the same job.
> 
> It looks odd to me that this patch removes lines that were added in
> patch 1/3. Maybe you could change the patchset order and place this
> one
> as the first.
> 
> > 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >   include/drm/gpu_scheduler.h | 19 +++++++++++++------
> >   1 file changed, 13 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index 916279b5aa00..29e5bda91806 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -421,20 +421,27 @@ struct drm_sched_backend_ops {
> >   
> >   	/**
> >   	 * @run_job: Called to execute the job once all of the
> > dependencies
> > -	 * have been resolved. This may be called multiple times,
> > if
> > -	 * timedout_job() has happened and
> > drm_sched_job_recovery() decides to
> > -	 * try it again.
> > +	 * have been resolved.
> > +	 *
> > +	 * The deprecated drm_sched_resubmit_jobs() (called from
> > +	 * drm_sched_backend_ops.timedout_job()) can invoke this
> > again with the
> 
> I think it would be "@timedout_job".

Not sure, isn't referencing in docstrings done with '&'?

> 
> > +	 * same parameters. Using this is discouraged because it,
> > presumably,
> > +	 * violates dma_fence rules.
> 
> I believe it would be "struct dma_fence".

Well, in this case strictly speaking not IMO, because it's about the
rules of the "DMA Fence Subsystem", not about the struct itself.

I'd just keep it that way or call it "dma fence"

> 
> > +	 *
> > +	 * TODO: Document which fence rules above.
> >   	 *
> >   	 * @sched_job: the job to run
> >   	 *
> > -	 * Returns: dma_fence the driver must signal once the
> > hardware has
> > -	 *	completed the job ("hardware fence").
> > -	 *
> >   	 * Note that the scheduler expects to 'inherit' its own
> > reference to
> >   	 * this fence from the callback. It does not invoke an
> > extra
> >   	 * dma_fence_get() on it. Consequently, this callback must
> > take a
> >   	 * reference for the scheduler, and additional ones for
> > the driver's
> >   	 * respective needs.
> 
> Would it be possible to add a comment that `run_job()` must check if
> `s_fence->finished.error` is different than 0? If you increase the
> karma
> of a job and don't check for `s_fence->finished.error`, you might run
> a
> cancelled job.

s_fence->finished is only signaled and its error set once the hardware
fence got signaled; or when the entity is killed.

In any case, signaling "finished" will cause the job to be prevented
from being executed (again), and will never reach run_job() in the
first place.

Correct me if I am mistaken.

Or are you suggesting that there is a race?


P.

> 
> > +	 *
> > +	 * Return:
> > +	 * * On success: dma_fence the driver must signal once the
> > hardware has
> > +	 * completed the job ("hardware fence").
> 
> A suggestion: "the fence that the driver must signal once the
> hardware
> has completed the job".
> 
> Best Regards,
> - Maíra
> 
> > +	 * * On failure: NULL or an ERR_PTR.
> >   	 */
> >   	struct dma_fence *(*run_job)(struct drm_sched_job
> > *sched_job);
> >   
>

Re: [PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job()

Posted by Maíra Canal 11 months, 2 weeks ago

Hi Philipp,

On 20/02/25 12:28, Philipp Stanner wrote:
> On Thu, 2025-02-20 at 10:28 -0300, Maíra Canal wrote:
>> Hi Philipp,
>>
>> On 20/02/25 08:28, Philipp Stanner wrote:
>>> The documentation for drm_sched_backend_ops.run_job() mentions a
>>> certain
>>> function called drm_sched_job_recovery(). This function does not
>>> exist.
>>> What's actually meant is drm_sched_resubmit_jobs(), which is by now
>>> also
>>> deprecated.
>>>
>>> Remove the mention of the removed function.
>>>
>>> Discourage the behavior of drm_sched_backend_ops.run_job() being
>>> called
>>> multiple times for the same job.
>>
>> It looks odd to me that this patch removes lines that were added in
>> patch 1/3. Maybe you could change the patchset order and place this
>> one
>> as the first.
>>
>>>
>>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>>> ---
>>>    include/drm/gpu_scheduler.h | 19 +++++++++++++------
>>>    1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/drm/gpu_scheduler.h
>>> b/include/drm/gpu_scheduler.h
>>> index 916279b5aa00..29e5bda91806 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -421,20 +421,27 @@ struct drm_sched_backend_ops {
>>>    
>>>    	/**
>>>    	 * @run_job: Called to execute the job once all of the
>>> dependencies
>>> -	 * have been resolved. This may be called multiple times,
>>> if
>>> -	 * timedout_job() has happened and
>>> drm_sched_job_recovery() decides to
>>> -	 * try it again.
>>> +	 * have been resolved.
>>> +	 *
>>> +	 * The deprecated drm_sched_resubmit_jobs() (called from
>>> +	 * drm_sched_backend_ops.timedout_job()) can invoke this
>>> again with the
>>
>> I think it would be "@timedout_job".
> 
> Not sure, isn't referencing in docstrings done with '&'?

`timedout_job` is a member of the same struct, so I believe it should be
@. But, I'm no kernel-doc expert, it's just my understanding of [1]. If
we don't use @, it should be at least
"&drm_sched_backend_ops.timedout_job".

[1] https://docs.kernel.org/doc-guide/kernel-doc.html

> 
>>
>>> +	 * same parameters. Using this is discouraged because it,
>>> presumably,
>>> +	 * violates dma_fence rules.
>>
>> I believe it would be "struct dma_fence".
> 
> Well, in this case strictly speaking not IMO, because it's about the
> rules of the "DMA Fence Subsystem", not about the struct itself.
> 
> I'd just keep it that way or call it "dma fence"
> 
>>
>>> +	 *
>>> +	 * TODO: Document which fence rules above.
>>>    	 *
>>>    	 * @sched_job: the job to run
>>>    	 *
>>> -	 * Returns: dma_fence the driver must signal once the
>>> hardware has
>>> -	 *	completed the job ("hardware fence").
>>> -	 *
>>>    	 * Note that the scheduler expects to 'inherit' its own
>>> reference to
>>>    	 * this fence from the callback. It does not invoke an
>>> extra
>>>    	 * dma_fence_get() on it. Consequently, this callback must
>>> take a
>>>    	 * reference for the scheduler, and additional ones for
>>> the driver's
>>>    	 * respective needs.
>>
>> Would it be possible to add a comment that `run_job()` must check if
>> `s_fence->finished.error` is different than 0? If you increase the
>> karma
>> of a job and don't check for `s_fence->finished.error`, you might run
>> a
>> cancelled job.
> 
> s_fence->finished is only signaled and its error set once the hardware
> fence got signaled; or when the entity is killed.

If you have a timeout, increase the karma of that job with
`drm_sched_increase_karma()` and call `drm_sched_resubmit_jobs()`, the
latter will flag an error in the dma fence. If you don't check for it in
`run_job()`, you will run the guilty job again.

I'm still talking about `drm_sched_resubmit_jobs()`, because I'm
currently fixing an issue in V3D with the GPU reset and we still use
`drm_sched_resubmit_jobs()`. I read the documentation of `run_job()` and
`timeout_job()` and the information I commented here (which was crucial
to fix the bug) wasn't available there.

`drm_sched_resubmit_jobs()` was deprecated in 2022, but Xe introduced a
new use in 2023, for example. The commit that deprecated it just
mentions AMD's case, but do we know if the function works as expected
for the other users? For V3D, it does. Also, we need to make it clear 
which are the dma fence requirements that the functions violates.

If we shouldn't use `drm_sched_resubmit_jobs()`, would it be possible to
provide a common interface for job resubmission?

Best Regards,
- Maíra

> 
> In any case, signaling "finished" will cause the job to be prevented
> from being executed (again), and will never reach run_job() in the
> first place.
> 
> Correct me if I am mistaken.
> 
> Or are you suggesting that there is a race?
> 
> 
> P.
> 
>>
>>> +	 *
>>> +	 * Return:
>>> +	 * * On success: dma_fence the driver must signal once the
>>> hardware has
>>> +	 * completed the job ("hardware fence").
>>
>> A suggestion: "the fence that the driver must signal once the
>> hardware
>> has completed the job".
>>
>> Best Regards,
>> - Maíra
>>
>>> +	 * * On failure: NULL or an ERR_PTR.
>>>    	 */
>>>    	struct dma_fence *(*run_job)(struct drm_sched_job
>>> *sched_job);
>>>    
>>
>

Re: [PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job()

Posted by Danilo Krummrich 11 months, 2 weeks ago

On Mon, Feb 24, 2025 at 10:29:26AM -0300, Maíra Canal wrote:
> On 20/02/25 12:28, Philipp Stanner wrote:
> > On Thu, 2025-02-20 at 10:28 -0300, Maíra Canal wrote:
> > > Would it be possible to add a comment that `run_job()` must check if
> > > `s_fence->finished.error` is different than 0? If you increase the
> > > karma
> > > of a job and don't check for `s_fence->finished.error`, you might run
> > > a
> > > cancelled job.
> > 
> > s_fence->finished is only signaled and its error set once the hardware
> > fence got signaled; or when the entity is killed.
> 
> If you have a timeout, increase the karma of that job with
> `drm_sched_increase_karma()` and call `drm_sched_resubmit_jobs()`, the
> latter will flag an error in the dma fence. If you don't check for it in
> `run_job()`, you will run the guilty job again.

Considering that drm_sched_resubmit_jobs() is deprecated I don't think we need
to add this hint to the documentation; the drivers that are still using the API
hopefully got it right.

> I'm still talking about `drm_sched_resubmit_jobs()`, because I'm
> currently fixing an issue in V3D with the GPU reset and we still use
> `drm_sched_resubmit_jobs()`. I read the documentation of `run_job()` and
> `timeout_job()` and the information I commented here (which was crucial
> to fix the bug) wasn't available there.

Well, hopefully... :-)

> 
> `drm_sched_resubmit_jobs()` was deprecated in 2022, but Xe introduced a
> new use in 2023

Yeah, that's a bit odd, since Xe relies on a firmware scheduler and uses a 1:1
scheduler - entity setup. I'm a bit surprised Xe does use this function.

> for example. The commit that deprecated it just
> mentions AMD's case, but do we know if the function works as expected
> for the other users?

I read the comment [1] you're referring to differently. It says that
"Re-submitting jobs was a concept AMD came up as cheap way to implement recovery
after a job timeout".

It further explains that "there are many problem with the dma_fence
implementation and requirements. Either the implementation is risking deadlocks
with core memory management or violating documented implementation details of
the dma_fence object", which doesn't give any hint to me that the conceptual
issues are limited to amdgpu.

> For V3D, it does. Also, we need to make it clear which
> are the dma fence requirements that the functions violates.

This I fully agree with, unfortunately the comment does not explain what's the
issue at all.

While I do think I have a vague idea of what's the potential issue with this
approach, I think it would be way better to get Christian, as the expert for DMA
fence rules to comment on this.

@Christian: Can you please shed some light on this?

> 
> If we shouldn't use `drm_sched_resubmit_jobs()`, would it be possible to
> provide a common interface for job resubmission?

I wonder why this question did not come up when drm_sched_resubmit_jobs() was
deprecated two years ago, did it?

Anyway, let's shed some light on the difficulties with drm_sched_resubmit_jobs()
and then we can figure out how we can do better.

I think it would also be interesting to know how amdgpu handles job from
unrelated entities being discarded by not re-submitting them when a job from
another entitiy hangs the HW ring.

[1] https://patchwork.freedesktop.org/patch/msgid/20221109095010.141189-5-christian.koenig@amd.com

Re: [PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job()

Posted by Matthew Brost 11 months, 2 weeks ago

On Mon, Feb 24, 2025 at 03:43:49PM +0100, Danilo Krummrich wrote:
> On Mon, Feb 24, 2025 at 10:29:26AM -0300, Maíra Canal wrote:
> > On 20/02/25 12:28, Philipp Stanner wrote:
> > > On Thu, 2025-02-20 at 10:28 -0300, Maíra Canal wrote:
> > > > Would it be possible to add a comment that `run_job()` must check if
> > > > `s_fence->finished.error` is different than 0? If you increase the
> > > > karma
> > > > of a job and don't check for `s_fence->finished.error`, you might run
> > > > a
> > > > cancelled job.
> > > 
> > > s_fence->finished is only signaled and its error set once the hardware
> > > fence got signaled; or when the entity is killed.
> > 
> > If you have a timeout, increase the karma of that job with
> > `drm_sched_increase_karma()` and call `drm_sched_resubmit_jobs()`, the
> > latter will flag an error in the dma fence. If you don't check for it in
> > `run_job()`, you will run the guilty job again.
> 
> Considering that drm_sched_resubmit_jobs() is deprecated I don't think we need
> to add this hint to the documentation; the drivers that are still using the API
> hopefully got it right.
> 
> > I'm still talking about `drm_sched_resubmit_jobs()`, because I'm
> > currently fixing an issue in V3D with the GPU reset and we still use
> > `drm_sched_resubmit_jobs()`. I read the documentation of `run_job()` and
> > `timeout_job()` and the information I commented here (which was crucial
> > to fix the bug) wasn't available there.
> 
> Well, hopefully... :-)
> 
> > 
> > `drm_sched_resubmit_jobs()` was deprecated in 2022, but Xe introduced a
> > new use in 2023
> 
> Yeah, that's a bit odd, since Xe relies on a firmware scheduler and uses a 1:1
> scheduler - entity setup. I'm a bit surprised Xe does use this function.
> 

To clarify Xe's usage. We use this function to resubmit jobs after
device reset for queues which had nothing to do with the device reset.
In practice, a device should never occur as we have per-queue resets in
our harwdare. If a per-queue reset occurs, we ban the queue rather than
doing a resubmit.

Matt  

> > for example. The commit that deprecated it just
> > mentions AMD's case, but do we know if the function works as expected
> > for the other users?
> 
> I read the comment [1] you're referring to differently. It says that
> "Re-submitting jobs was a concept AMD came up as cheap way to implement recovery
> after a job timeout".
> 
> It further explains that "there are many problem with the dma_fence
> implementation and requirements. Either the implementation is risking deadlocks
> with core memory management or violating documented implementation details of
> the dma_fence object", which doesn't give any hint to me that the conceptual
> issues are limited to amdgpu.
> 
> > For V3D, it does. Also, we need to make it clear which
> > are the dma fence requirements that the functions violates.
> 
> This I fully agree with, unfortunately the comment does not explain what's the
> issue at all.
> 
> While I do think I have a vague idea of what's the potential issue with this
> approach, I think it would be way better to get Christian, as the expert for DMA
> fence rules to comment on this.
> 
> @Christian: Can you please shed some light on this?
> 
> > 
> > If we shouldn't use `drm_sched_resubmit_jobs()`, would it be possible to
> > provide a common interface for job resubmission?
> 
> I wonder why this question did not come up when drm_sched_resubmit_jobs() was
> deprecated two years ago, did it?
> 
> Anyway, let's shed some light on the difficulties with drm_sched_resubmit_jobs()
> and then we can figure out how we can do better.
> 
> I think it would also be interesting to know how amdgpu handles job from
> unrelated entities being discarded by not re-submitting them when a job from
> another entitiy hangs the HW ring.
> 
> [1] https://patchwork.freedesktop.org/patch/msgid/20221109095010.141189-5-christian.koenig@amd.com

Re: [PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job()

Posted by Christian König 11 months, 1 week ago

Am 24.02.25 um 17:25 schrieb Matthew Brost:
> On Mon, Feb 24, 2025 at 03:43:49PM +0100, Danilo Krummrich wrote:
>> On Mon, Feb 24, 2025 at 10:29:26AM -0300, Maíra Canal wrote:
>>> On 20/02/25 12:28, Philipp Stanner wrote:
>>>> On Thu, 2025-02-20 at 10:28 -0300, Maíra Canal wrote:
>>>>> Would it be possible to add a comment that `run_job()` must check if
>>>>> `s_fence->finished.error` is different than 0? If you increase the
>>>>> karma
>>>>> of a job and don't check for `s_fence->finished.error`, you might run
>>>>> a
>>>>> cancelled job.
>>>> s_fence->finished is only signaled and its error set once the hardware
>>>> fence got signaled; or when the entity is killed.
>>> If you have a timeout, increase the karma of that job with
>>> `drm_sched_increase_karma()` and call `drm_sched_resubmit_jobs()`, the
>>> latter will flag an error in the dma fence. If you don't check for it in
>>> `run_job()`, you will run the guilty job again.
>> Considering that drm_sched_resubmit_jobs() is deprecated I don't think we need
>> to add this hint to the documentation; the drivers that are still using the API
>> hopefully got it right.
>>
>>> I'm still talking about `drm_sched_resubmit_jobs()`, because I'm
>>> currently fixing an issue in V3D with the GPU reset and we still use
>>> `drm_sched_resubmit_jobs()`. I read the documentation of `run_job()` and
>>> `timeout_job()` and the information I commented here (which was crucial
>>> to fix the bug) wasn't available there.
>> Well, hopefully... :-)
>>
>>> `drm_sched_resubmit_jobs()` was deprecated in 2022, but Xe introduced a
>>> new use in 2023
>> Yeah, that's a bit odd, since Xe relies on a firmware scheduler and uses a 1:1
>> scheduler - entity setup. I'm a bit surprised Xe does use this function.
>>
> To clarify Xe's usage. We use this function to resubmit jobs after
> device reset for queues which had nothing to do with the device reset.
> In practice, a device should never occur as we have per-queue resets in
> our harwdare. If a per-queue reset occurs, we ban the queue rather than
> doing a resubmit.

That's still invalid usage. Re-submitting jobs by the scheduler is a completely broken concept in general.

What you can do is to re-create the queue content after device reset inside your driver, but *never* use drm_sched_resubmit_jobs() for that.

>
> Matt  
>
>>> for example. The commit that deprecated it just
>>> mentions AMD's case, but do we know if the function works as expected
>>> for the other users?
>> I read the comment [1] you're referring to differently. It says that
>> "Re-submitting jobs was a concept AMD came up as cheap way to implement recovery
>> after a job timeout".
>>
>> It further explains that "there are many problem with the dma_fence
>> implementation and requirements. Either the implementation is risking deadlocks
>> with core memory management or violating documented implementation details of
>> the dma_fence object", which doesn't give any hint to me that the conceptual
>> issues are limited to amdgpu.
>>
>>> For V3D, it does. Also, we need to make it clear which
>>> are the dma fence requirements that the functions violates.
>> This I fully agree with, unfortunately the comment does not explain what's the
>> issue at all.
>>
>> While I do think I have a vague idea of what's the potential issue with this
>> approach, I think it would be way better to get Christian, as the expert for DMA
>> fence rules to comment on this.
>>
>> @Christian: Can you please shed some light on this?
>>
>>> If we shouldn't use `drm_sched_resubmit_jobs()`, would it be possible to
>>> provide a common interface for job resubmission?
>> I wonder why this question did not come up when drm_sched_resubmit_jobs() was
>> deprecated two years ago, did it?

Exactly that's the point why drm_sched_resubmit_jobs() was deprecated.

It is not possible to provide a common interface to re-submit jobs (with switching of hardware dma_fences) without breaking dma_fence rules.

The idea behind the scheduler is that you pack your submission state into a job object which as soon as it is picked up is converted into a hardware dma_fence for execution. This hardware dma_fence is then the object which represents execution of the submission on the hardware.

So on re-submission you either use the same dma_fence multiple times which results in a *horrible* kref_init() on an already initialized reference (It's a wonder that this doesn't crashes all the time in amdgpu). Or you do things like starting to allocate memory while the memory management potentially waits for the reset to complete.

What we could do is to provide a helper for the device drivers in the form of an iterator which gives you all the hardware fences the scheduler is waiting for, but in general device drivers should have this information by themselves.

>>
>> Anyway, let's shed some light on the difficulties with drm_sched_resubmit_jobs()
>> and then we can figure out how we can do better.
>>
>> I think it would also be interesting to know how amdgpu handles job from
>> unrelated entities being discarded by not re-submitting them when a job from
>> another entitiy hangs the HW ring.

Quite simple this case never happens in the first place.

When you have individual queues for each process (e.g. like Xe and upcomming amdgpu HW generation) you should always be able to reset the device without loosing everything.

Otherwise things like userspace queues also doesn't work at all because then neither the kernel nor the DRM scheduler is involved in the submission any more.

Regards,
Christian.

>>
>> [1] https://patchwork.freedesktop.org/patch/msgid/20221109095010.141189-5-christian.koenig@amd.com

Re: [PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job()

Posted by Philipp Stanner 11 months, 1 week ago

On Tue, 2025-03-04 at 10:05 +0100, Christian König wrote:
> Am 24.02.25 um 17:25 schrieb Matthew Brost:
> > On Mon, Feb 24, 2025 at 03:43:49PM +0100, Danilo Krummrich wrote:
> > > On Mon, Feb 24, 2025 at 10:29:26AM -0300, Maíra Canal wrote:
> > > > On 20/02/25 12:28, Philipp Stanner wrote:
> > > > > On Thu, 2025-02-20 at 10:28 -0300, Maíra Canal wrote:
> > > > > > Would it be possible to add a comment that `run_job()` must
> > > > > > check if
> > > > > > `s_fence->finished.error` is different than 0? If you
> > > > > > increase the
> > > > > > karma
> > > > > > of a job and don't check for `s_fence->finished.error`, you
> > > > > > might run
> > > > > > a
> > > > > > cancelled job.
> > > > > s_fence->finished is only signaled and its error set once the
> > > > > hardware
> > > > > fence got signaled; or when the entity is killed.
> > > > If you have a timeout, increase the karma of that job with
> > > > `drm_sched_increase_karma()` and call
> > > > `drm_sched_resubmit_jobs()`, the
> > > > latter will flag an error in the dma fence. If you don't check
> > > > for it in
> > > > `run_job()`, you will run the guilty job again.
> > > Considering that drm_sched_resubmit_jobs() is deprecated I don't
> > > think we need
> > > to add this hint to the documentation; the drivers that are still
> > > using the API
> > > hopefully got it right.
> > > 
> > > > I'm still talking about `drm_sched_resubmit_jobs()`, because
> > > > I'm
> > > > currently fixing an issue in V3D with the GPU reset and we
> > > > still use
> > > > `drm_sched_resubmit_jobs()`. I read the documentation of
> > > > `run_job()` and
> > > > `timeout_job()` and the information I commented here (which was
> > > > crucial
> > > > to fix the bug) wasn't available there.
> > > Well, hopefully... :-)
> > > 
> > > > `drm_sched_resubmit_jobs()` was deprecated in 2022, but Xe
> > > > introduced a
> > > > new use in 2023
> > > Yeah, that's a bit odd, since Xe relies on a firmware scheduler
> > > and uses a 1:1
> > > scheduler - entity setup. I'm a bit surprised Xe does use this
> > > function.
> > > 
> > To clarify Xe's usage. We use this function to resubmit jobs after
> > device reset for queues which had nothing to do with the device
> > reset.
> > In practice, a device should never occur as we have per-queue
> > resets in
> > our harwdare. If a per-queue reset occurs, we ban the queue rather
> > than
> > doing a resubmit.
> 
> That's still invalid usage. Re-submitting jobs by the scheduler is a
> completely broken concept in general.
> 
> What you can do is to re-create the queue content after device reset
> inside your driver, but *never* use drm_sched_resubmit_jobs() for
> that.
> 
> > 
> > Matt  
> > 
> > > > for example. The commit that deprecated it just
> > > > mentions AMD's case, but do we know if the function works as
> > > > expected
> > > > for the other users?
> > > I read the comment [1] you're referring to differently. It says
> > > that
> > > "Re-submitting jobs was a concept AMD came up as cheap way to
> > > implement recovery
> > > after a job timeout".
> > > 
> > > It further explains that "there are many problem with the
> > > dma_fence
> > > implementation and requirements. Either the implementation is
> > > risking deadlocks
> > > with core memory management or violating documented
> > > implementation details of
> > > the dma_fence object", which doesn't give any hint to me that the
> > > conceptual
> > > issues are limited to amdgpu.
> > > 
> > > > For V3D, it does. Also, we need to make it clear which
> > > > are the dma fence requirements that the functions violates.
> > > This I fully agree with, unfortunately the comment does not
> > > explain what's the
> > > issue at all.
> > > 
> > > While I do think I have a vague idea of what's the potential
> > > issue with this
> > > approach, I think it would be way better to get Christian, as the
> > > expert for DMA
> > > fence rules to comment on this.
> > > 
> > > @Christian: Can you please shed some light on this?
> > > 
> > > > If we shouldn't use `drm_sched_resubmit_jobs()`, would it be
> > > > possible to
> > > > provide a common interface for job resubmission?
> > > I wonder why this question did not come up when
> > > drm_sched_resubmit_jobs() was
> > > deprecated two years ago, did it?
> 
> Exactly that's the point why drm_sched_resubmit_jobs() was
> deprecated.
> 
> It is not possible to provide a common interface to re-submit jobs
> (with switching of hardware dma_fences) without breaking dma_fence
> rules.
> 
> The idea behind the scheduler is that you pack your submission state
> into a job object which as soon as it is picked up is converted into
> a hardware dma_fence for execution. This hardware dma_fence is then
> the object which represents execution of the submission on the
> hardware.
> 
> So on re-submission you either use the same dma_fence multiple times
> which results in a *horrible* kref_init() on an already initialized
> reference (It's a wonder that this doesn't crashes all the time in
> amdgpu). Or you do things like starting to allocate memory while the
> memory management potentially waits for the reset to complete.
> 
> What we could do is to provide a helper for the device drivers in the
> form of an iterator which gives you all the hardware fences the
> scheduler is waiting for, but in general device drivers should have
> this information by themselves.

What we should work out in this patch series first is some lines of
documentation telling the drivers what the current state is and what
they should do.

Maira is not OK with me just removing mention of
drm_sched_resubmit_jobs().

So the question is what they should do instead and thus, what, e.g.,
amdgpu does instead. See also below

> 
> > > 
> > > Anyway, let's shed some light on the difficulties with
> > > drm_sched_resubmit_jobs()
> > > and then we can figure out how we can do better.
> > > 
> > > I think it would also be interesting to know how amdgpu handles
> > > job from
> > > unrelated entities being discarded by not re-submitting them when
> > > a job from
> > > another entitiy hangs the HW ring.
> 
> Quite simple this case never happens in the first place.
> 
> When you have individual queues for each process (e.g. like Xe and
> upcomming amdgpu HW generation)

If amdgpu's *current* HW generation does not have individual queues,
why then can it never happen currently?

How does amdgpu make sure that jobs from innocent entities get
rescheduled after a GPU reset? AFAIK AMD cards currently have 4 run
queues, which are shared by many entities by many processes.

P.

>  you should always be able to reset the device without loosing
> everything.
> 
> Otherwise things like userspace queues also doesn't work at all
> because then neither the kernel nor the DRM scheduler is involved in
> the submission any more.
> 
> Regards,
> Christian.
> 
> > > 
> > > [1]
> > > https://patchwork.freedesktop.org/patch/msgid/20221109095010.141189-5-christian.koenig@amd.com
>

[PATCH v5 1/3] drm/sched: Document run_job() refcount hazard
[PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job()
[PATCH v5 3/3] drm/sched: Update timedout_job()'s documentation