[PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW

Jacek Lawrynowicz posted 1 patch 1 month ago
drivers/accel/ivpu/ivpu_job.c | 5 +++++
1 file changed, 5 insertions(+)
[PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW
Posted by Jacek Lawrynowicz 1 month ago
From: Karol Wachowski <karol.wachowski@intel.com>

commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.

Trigger recovery of the NPU upon receiving HW context violation from
the firmware. The context violation error is a fatal error that prevents
any subsequent jobs from being executed. Without this fix it is
necessary to reload the driver to restore the NPU operational state.

This is simplified version of upstream commit as the full implementation
would require all engine reset/resume logic to be backported.

Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250107173238.381120-13-maciej.falkowski@linux.intel.com
Fixes: 0adff3b0ef12 ("accel/ivpu: Share NPU busy time in sysfs")
Cc: <stable@vger.kernel.org> # v6.11+
---
 drivers/accel/ivpu/ivpu_job.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
index be2e2bf0f43f0..70b3676974407 100644
--- a/drivers/accel/ivpu/ivpu_job.c
+++ b/drivers/accel/ivpu/ivpu_job.c
@@ -482,6 +482,8 @@ static struct ivpu_job *ivpu_job_remove_from_submitted_jobs(struct ivpu_device *
 	return job;
 }
 
+#define VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 0xEU
+
 static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32 job_status)
 {
 	struct ivpu_job *job;
@@ -490,6 +492,9 @@ static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32
 	if (!job)
 		return -ENOENT;
 
+	if (job_status == VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW)
+		ivpu_pm_trigger_recovery(vdev, "HW context violation");
+
 	if (job->file_priv->has_mmu_faults)
 		job_status = DRM_IVPU_JOB_STATUS_ABORTED;
 
-- 
2.45.1
Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW
Posted by Greg KH 2 weeks, 6 days ago
On Tue, Apr 08, 2025 at 11:57:11AM +0200, Jacek Lawrynowicz wrote:
> From: Karol Wachowski <karol.wachowski@intel.com>
> 
> commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.
> 
> Trigger recovery of the NPU upon receiving HW context violation from
> the firmware. The context violation error is a fatal error that prevents
> any subsequent jobs from being executed. Without this fix it is
> necessary to reload the driver to restore the NPU operational state.
> 
> This is simplified version of upstream commit as the full implementation
> would require all engine reset/resume logic to be backported.

We REALLY do not like taking patches that are not upstream.  Why not
backport all of the needed patches instead, how many would that be?
Taking one-off patches like this just makes it harder/impossible to
maintain the code over time as further fixes in this same area will NOT
apply properly at all.

Think about what you want to be touching 5 years from now, a one-off
change that doesn't match the rest of the kernel tree, or something that
is the same?

thanks,

greg k-h
Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW
Posted by Jacek Lawrynowicz 2 weeks, 4 days ago
Hi,

On 4/22/2025 2:17 PM, Greg KH wrote:
> On Tue, Apr 08, 2025 at 11:57:11AM +0200, Jacek Lawrynowicz wrote:
>> From: Karol Wachowski <karol.wachowski@intel.com>
>>
>> commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.
>>
>> Trigger recovery of the NPU upon receiving HW context violation from
>> the firmware. The context violation error is a fatal error that prevents
>> any subsequent jobs from being executed. Without this fix it is
>> necessary to reload the driver to restore the NPU operational state.
>>
>> This is simplified version of upstream commit as the full implementation
>> would require all engine reset/resume logic to be backported.
> 
> We REALLY do not like taking patches that are not upstream.  Why not
> backport all of the needed patches instead, how many would that be?
> Taking one-off patches like this just makes it harder/impossible to
> maintain the code over time as further fixes in this same area will NOT
> apply properly at all.
> 
> Think about what you want to be touching 5 years from now, a one-off
> change that doesn't match the rest of the kernel tree, or something that
> is the same?

Sure, I'm totally on board with backporting all required patches.
I thought it was not possible due to 100 line limit.

This would be the minimum set of patches:

Patch 1:
 drivers/accel/ivpu/ivpu_drv.c   | 32 +++-----------
 drivers/accel/ivpu/ivpu_drv.h   |  2 +
 drivers/accel/ivpu/ivpu_job.c   | 78 ++++++++++++++++++++++++++-------
 drivers/accel/ivpu/ivpu_job.h   |  1 +
 drivers/accel/ivpu/ivpu_mmu.c   |  3 +-
 drivers/accel/ivpu/ivpu_sysfs.c |  5 ++-
 6 files changed, 75 insertions(+), 46 deletions(-)

Patch 2:
 drivers/accel/ivpu/ivpu_job.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

Patch 3:
 drivers/accel/ivpu/ivpu_job.c     |   2 +-
 drivers/accel/ivpu/ivpu_jsm_msg.c |   3 +-
 drivers/accel/ivpu/vpu_boot_api.h |  45 +++--
 drivers/accel/ivpu/vpu_jsm_api.h  | 303 +++++++++++++++++++++++++-----
 4 files changed, 293 insertions(+), 60 deletions(-)

Patch 4:
 drivers/accel/ivpu/ivpu_job.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

First patch needs some changes to apply correctly to 6.12 but the rest of them apply pretty cleanly.
Is this acceptable?

Regards,
Jacek
Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW
Posted by Greg KH 2 weeks, 4 days ago
On Thu, Apr 24, 2025 at 12:22:31PM +0200, Jacek Lawrynowicz wrote:
> Hi,
> 
> On 4/22/2025 2:17 PM, Greg KH wrote:
> > On Tue, Apr 08, 2025 at 11:57:11AM +0200, Jacek Lawrynowicz wrote:
> >> From: Karol Wachowski <karol.wachowski@intel.com>
> >>
> >> commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.
> >>
> >> Trigger recovery of the NPU upon receiving HW context violation from
> >> the firmware. The context violation error is a fatal error that prevents
> >> any subsequent jobs from being executed. Without this fix it is
> >> necessary to reload the driver to restore the NPU operational state.
> >>
> >> This is simplified version of upstream commit as the full implementation
> >> would require all engine reset/resume logic to be backported.
> > 
> > We REALLY do not like taking patches that are not upstream.  Why not
> > backport all of the needed patches instead, how many would that be?
> > Taking one-off patches like this just makes it harder/impossible to
> > maintain the code over time as further fixes in this same area will NOT
> > apply properly at all.
> > 
> > Think about what you want to be touching 5 years from now, a one-off
> > change that doesn't match the rest of the kernel tree, or something that
> > is the same?
> 
> Sure, I'm totally on board with backporting all required patches.
> I thought it was not possible due to 100 line limit.
> 
> This would be the minimum set of patches:
> 
> Patch 1:
>  drivers/accel/ivpu/ivpu_drv.c   | 32 +++-----------
>  drivers/accel/ivpu/ivpu_drv.h   |  2 +
>  drivers/accel/ivpu/ivpu_job.c   | 78 ++++++++++++++++++++++++++-------
>  drivers/accel/ivpu/ivpu_job.h   |  1 +
>  drivers/accel/ivpu/ivpu_mmu.c   |  3 +-
>  drivers/accel/ivpu/ivpu_sysfs.c |  5 ++-
>  6 files changed, 75 insertions(+), 46 deletions(-)
> 
> Patch 2:
>  drivers/accel/ivpu/ivpu_job.c | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> Patch 3:
>  drivers/accel/ivpu/ivpu_job.c     |   2 +-
>  drivers/accel/ivpu/ivpu_jsm_msg.c |   3 +-
>  drivers/accel/ivpu/vpu_boot_api.h |  45 +++--
>  drivers/accel/ivpu/vpu_jsm_api.h  | 303 +++++++++++++++++++++++++-----
>  4 files changed, 293 insertions(+), 60 deletions(-)
> 
> Patch 4:
>  drivers/accel/ivpu/ivpu_job.c | 27 ++++++++++++++++++++++++++-
>  1 file changed, 26 insertions(+), 1 deletion(-)
> 
> First patch needs some changes to apply correctly to 6.12 but the rest of them apply pretty cleanly.
> Is this acceptable?

Totally acceptable, that's trivial compared to many of the larger
backports we have taken over the years :)

thanks,

greg k-h
Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW
Posted by Jacek Lawrynowicz 1 week, 5 days ago
Hi,

On 4/24/2025 12:34 PM, Greg KH wrote:
> On Thu, Apr 24, 2025 at 12:22:31PM +0200, Jacek Lawrynowicz wrote:
>> Hi,
>>
>> On 4/22/2025 2:17 PM, Greg KH wrote:
>>> On Tue, Apr 08, 2025 at 11:57:11AM +0200, Jacek Lawrynowicz wrote:
>>>> From: Karol Wachowski <karol.wachowski@intel.com>
>>>>
>>>> commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.
>>>>
>>>> Trigger recovery of the NPU upon receiving HW context violation from
>>>> the firmware. The context violation error is a fatal error that prevents
>>>> any subsequent jobs from being executed. Without this fix it is
>>>> necessary to reload the driver to restore the NPU operational state.
>>>>
>>>> This is simplified version of upstream commit as the full implementation
>>>> would require all engine reset/resume logic to be backported.
>>>
>>> We REALLY do not like taking patches that are not upstream.  Why not
>>> backport all of the needed patches instead, how many would that be?
>>> Taking one-off patches like this just makes it harder/impossible to
>>> maintain the code over time as further fixes in this same area will NOT
>>> apply properly at all.
>>>
>>> Think about what you want to be touching 5 years from now, a one-off
>>> change that doesn't match the rest of the kernel tree, or something that
>>> is the same?
>>
>> Sure, I'm totally on board with backporting all required patches.
>> I thought it was not possible due to 100 line limit.
>>
>> This would be the minimum set of patches:
>>
>> Patch 1:
>>  drivers/accel/ivpu/ivpu_drv.c   | 32 +++-----------
>>  drivers/accel/ivpu/ivpu_drv.h   |  2 +
>>  drivers/accel/ivpu/ivpu_job.c   | 78 ++++++++++++++++++++++++++-------
>>  drivers/accel/ivpu/ivpu_job.h   |  1 +
>>  drivers/accel/ivpu/ivpu_mmu.c   |  3 +-
>>  drivers/accel/ivpu/ivpu_sysfs.c |  5 ++-
>>  6 files changed, 75 insertions(+), 46 deletions(-)
>>
>> Patch 2:
>>  drivers/accel/ivpu/ivpu_job.c | 15 ++++++---------
>>  1 file changed, 6 insertions(+), 9 deletions(-)
>>
>> Patch 3:
>>  drivers/accel/ivpu/ivpu_job.c     |   2 +-
>>  drivers/accel/ivpu/ivpu_jsm_msg.c |   3 +-
>>  drivers/accel/ivpu/vpu_boot_api.h |  45 +++--
>>  drivers/accel/ivpu/vpu_jsm_api.h  | 303 +++++++++++++++++++++++++-----
>>  4 files changed, 293 insertions(+), 60 deletions(-)
>>
>> Patch 4:
>>  drivers/accel/ivpu/ivpu_job.c | 27 ++++++++++++++++++++++++++-
>>  1 file changed, 26 insertions(+), 1 deletion(-)
>>
>> First patch needs some changes to apply correctly to 6.12 but the rest of them apply pretty cleanly.
>> Is this acceptable?
> 
> Totally acceptable, that's trivial compared to many of the larger
> backports we have taken over the years :)

OK, I've sent two separate patchses for 6.12 and 6.14 that contain minimal number of patches.
I've rebased only two patches in each patchsets and the rest is as-is from upstream.
Let me know in case I messed something up.

Regards,
Jacek
Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW
Posted by Jacek Lawrynowicz 1 month ago
Hi,

This is an important patch for the Intel NPU.
Is there anything it is missing to be included in stable?

Regards,
Jacek

On 4/8/2025 11:57 AM, Jacek Lawrynowicz wrote:
> From: Karol Wachowski <karol.wachowski@intel.com>
> 
> commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.
> 
> Trigger recovery of the NPU upon receiving HW context violation from
> the firmware. The context violation error is a fatal error that prevents
> any subsequent jobs from being executed. Without this fix it is
> necessary to reload the driver to restore the NPU operational state.
> 
> This is simplified version of upstream commit as the full implementation
> would require all engine reset/resume logic to be backported.
> 
> Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
> Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
> Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> Link: https://patchwork.freedesktop.org/patch/msgid/20250107173238.381120-13-maciej.falkowski@linux.intel.com
> Fixes: 0adff3b0ef12 ("accel/ivpu: Share NPU busy time in sysfs")
> Cc: <stable@vger.kernel.org> # v6.11+
> ---
>  drivers/accel/ivpu/ivpu_job.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
> index be2e2bf0f43f0..70b3676974407 100644
> --- a/drivers/accel/ivpu/ivpu_job.c
> +++ b/drivers/accel/ivpu/ivpu_job.c
> @@ -482,6 +482,8 @@ static struct ivpu_job *ivpu_job_remove_from_submitted_jobs(struct ivpu_device *
>  	return job;
>  }
>  
> +#define VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 0xEU
> +
>  static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32 job_status)
>  {
>  	struct ivpu_job *job;
> @@ -490,6 +492,9 @@ static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32
>  	if (!job)
>  		return -ENOENT;
>  
> +	if (job_status == VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW)
> +		ivpu_pm_trigger_recovery(vdev, "HW context violation");
> +
>  	if (job->file_priv->has_mmu_faults)
>  		job_status = DRM_IVPU_JOB_STATUS_ABORTED;
>
Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW
Posted by Greg Kroah-Hartman 1 month ago
On Thu, Apr 10, 2025 at 09:49:37AM +0200, Jacek Lawrynowicz wrote:
> Hi,
> 
> This is an important patch for the Intel NPU.
> Is there anything it is missing to be included in stable?

Patience, you only sent this:

> On 4/8/2025 11:57 AM, Jacek Lawrynowicz wrote:

2 days ago, AFTER the latest of -rc releases was sent out for review,
and those kernels have NOT even been released yet!

[rant about how you all know this process works deleted as it was
 just snarky on my side, although quite cathartic, thanks for letting me
 vent...]

Relax, it will get handled when we can get to it.  To help out, please
take the time to review pending stable backported patches that have been
submitted to the mailing list ahead of yours.

greg k-h
Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW
Posted by Jacek Lawrynowicz 1 month ago
Hi,

On 4/10/2025 10:03 AM, Greg Kroah-Hartman wrote:
> On Thu, Apr 10, 2025 at 09:49:37AM +0200, Jacek Lawrynowicz wrote:
>> Hi,
>>
>> This is an important patch for the Intel NPU.
>> Is there anything it is missing to be included in stable?
> 
> Patience, you only sent this:
> 
>> On 4/8/2025 11:57 AM, Jacek Lawrynowicz wrote:
> 
> 2 days ago, AFTER the latest of -rc releases was sent out for review,
> and those kernels have NOT even been released yet!
> 
> [rant about how you all know this process works deleted as it was
>  just snarky on my side, although quite cathartic, thanks for letting me
>  vent...]
> 
> Relax, it will get handled when we can get to it.  To help out, please
> take the time to review pending stable backported patches that have been
> submitted to the mailing list ahead of yours.

Yeah, sorry about that. It seems I'm still learning the art of patience in the kernel processes.
I didn't mean to rush anything. I was just worried my patch might have been overlooked due to a typo or a wrong commit message tag.

I'll make sure to review the backported patches in the meantime.

And thanks for holding back on the rant. I really think that the standard set by Linus is not healthy.

Regards,
Jacek