* [PATCH] accel/ivpu: Trigger device recovery on engine reset/resume failure
@ 2025-05-28 15:42 Jacek Lawrynowicz
2025-05-28 17:53 ` Lizhi Hou
2025-06-05 12:45 ` Jacek Lawrynowicz
0 siblings, 2 replies; 5+ messages in thread
From: Jacek Lawrynowicz @ 2025-05-28 15:42 UTC (permalink / raw)
To: dri-devel
Cc: jeff.hugo, lizhi.hou, Karol Wachowski, stable, Jacek Lawrynowicz
From: Karol Wachowski <karol.wachowski@intel.com>
Trigger full device recovery when the driver fails to restore device state
via engine reset and resume operations. This is necessary because, even if
submissions from a faulty context are blocked, the NPU may still process
previously submitted faulty jobs if the engine reset fails to abort them.
Such jobs can continue to generate faults and occupy device resources.
When engine reset is ineffective, the only way to recover is to perform
a full device recovery.
Fixes: dad945c27a42 ("accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW")
Cc: <stable@vger.kernel.org> # v6.15+
Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
---
drivers/accel/ivpu/ivpu_job.c | 6 ++++--
drivers/accel/ivpu/ivpu_jsm_msg.c | 9 +++++++--
2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
index 1c8e283ad9854..fae8351aa3309 100644
--- a/drivers/accel/ivpu/ivpu_job.c
+++ b/drivers/accel/ivpu/ivpu_job.c
@@ -986,7 +986,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
return;
if (vdev->fw->sched_mode == VPU_SCHEDULING_MODE_HW)
- ivpu_jsm_reset_engine(vdev, 0);
+ if (ivpu_jsm_reset_engine(vdev, 0))
+ return;
mutex_lock(&vdev->context_list_lock);
xa_for_each(&vdev->context_xa, ctx_id, file_priv) {
@@ -1009,7 +1010,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
if (vdev->fw->sched_mode != VPU_SCHEDULING_MODE_HW)
goto runtime_put;
- ivpu_jsm_hws_resume_engine(vdev, 0);
+ if (ivpu_jsm_hws_resume_engine(vdev, 0))
+ return;
/*
* In hardware scheduling mode NPU already has stopped processing jobs
* and won't send us any further notifications, thus we have to free job related resources
diff --git a/drivers/accel/ivpu/ivpu_jsm_msg.c b/drivers/accel/ivpu/ivpu_jsm_msg.c
index 219ab8afefabd..0256b2dfefc10 100644
--- a/drivers/accel/ivpu/ivpu_jsm_msg.c
+++ b/drivers/accel/ivpu/ivpu_jsm_msg.c
@@ -7,6 +7,7 @@
#include "ivpu_hw.h"
#include "ivpu_ipc.h"
#include "ivpu_jsm_msg.h"
+#include "ivpu_pm.h"
#include "vpu_jsm_api.h"
const char *ivpu_jsm_msg_type_to_str(enum vpu_ipc_msg_type type)
@@ -163,8 +164,10 @@ int ivpu_jsm_reset_engine(struct ivpu_device *vdev, u32 engine)
ret = ivpu_ipc_send_receive(vdev, &req, VPU_JSM_MSG_ENGINE_RESET_DONE, &resp,
VPU_IPC_CHAN_ASYNC_CMD, vdev->timeout.jsm);
- if (ret)
+ if (ret) {
ivpu_err_ratelimited(vdev, "Failed to reset engine %d: %d\n", engine, ret);
+ ivpu_pm_trigger_recovery(vdev, "Engine reset failed");
+ }
return ret;
}
@@ -354,8 +357,10 @@ int ivpu_jsm_hws_resume_engine(struct ivpu_device *vdev, u32 engine)
ret = ivpu_ipc_send_receive(vdev, &req, VPU_JSM_MSG_HWS_RESUME_ENGINE_DONE, &resp,
VPU_IPC_CHAN_ASYNC_CMD, vdev->timeout.jsm);
- if (ret)
+ if (ret) {
ivpu_err_ratelimited(vdev, "Failed to resume engine %d: %d\n", engine, ret);
+ ivpu_pm_trigger_recovery(vdev, "Engine resume failed");
+ }
return ret;
}
--
2.45.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] accel/ivpu: Trigger device recovery on engine reset/resume failure
2025-05-28 15:42 [PATCH] accel/ivpu: Trigger device recovery on engine reset/resume failure Jacek Lawrynowicz
@ 2025-05-28 17:53 ` Lizhi Hou
2025-06-02 13:05 ` Jacek Lawrynowicz
2025-06-05 12:45 ` Jacek Lawrynowicz
1 sibling, 1 reply; 5+ messages in thread
From: Lizhi Hou @ 2025-05-28 17:53 UTC (permalink / raw)
To: Jacek Lawrynowicz, dri-devel; +Cc: jeff.hugo, Karol Wachowski, stable
On 5/28/25 08:42, Jacek Lawrynowicz wrote:
> From: Karol Wachowski <karol.wachowski@intel.com>
>
> Trigger full device recovery when the driver fails to restore device state
> via engine reset and resume operations. This is necessary because, even if
> submissions from a faulty context are blocked, the NPU may still process
> previously submitted faulty jobs if the engine reset fails to abort them.
> Such jobs can continue to generate faults and occupy device resources.
> When engine reset is ineffective, the only way to recover is to perform
> a full device recovery.
>
> Fixes: dad945c27a42 ("accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW")
> Cc: <stable@vger.kernel.org> # v6.15+
> Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> ---
> drivers/accel/ivpu/ivpu_job.c | 6 ++++--
> drivers/accel/ivpu/ivpu_jsm_msg.c | 9 +++++++--
> 2 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
> index 1c8e283ad9854..fae8351aa3309 100644
> --- a/drivers/accel/ivpu/ivpu_job.c
> +++ b/drivers/accel/ivpu/ivpu_job.c
> @@ -986,7 +986,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
> return;
>
> if (vdev->fw->sched_mode == VPU_SCHEDULING_MODE_HW)
> - ivpu_jsm_reset_engine(vdev, 0);
> + if (ivpu_jsm_reset_engine(vdev, 0))
> + return;
Is it possible the context aborting is entered again before the full
device recovery work is executed?
Thanks,
Lizhi
>
> mutex_lock(&vdev->context_list_lock);
> xa_for_each(&vdev->context_xa, ctx_id, file_priv) {
> @@ -1009,7 +1010,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
> if (vdev->fw->sched_mode != VPU_SCHEDULING_MODE_HW)
> goto runtime_put;
>
> - ivpu_jsm_hws_resume_engine(vdev, 0);
> + if (ivpu_jsm_hws_resume_engine(vdev, 0))
> + return;
> /*
> * In hardware scheduling mode NPU already has stopped processing jobs
> * and won't send us any further notifications, thus we have to free job related resources
> diff --git a/drivers/accel/ivpu/ivpu_jsm_msg.c b/drivers/accel/ivpu/ivpu_jsm_msg.c
> index 219ab8afefabd..0256b2dfefc10 100644
> --- a/drivers/accel/ivpu/ivpu_jsm_msg.c
> +++ b/drivers/accel/ivpu/ivpu_jsm_msg.c
> @@ -7,6 +7,7 @@
> #include "ivpu_hw.h"
> #include "ivpu_ipc.h"
> #include "ivpu_jsm_msg.h"
> +#include "ivpu_pm.h"
> #include "vpu_jsm_api.h"
>
> const char *ivpu_jsm_msg_type_to_str(enum vpu_ipc_msg_type type)
> @@ -163,8 +164,10 @@ int ivpu_jsm_reset_engine(struct ivpu_device *vdev, u32 engine)
>
> ret = ivpu_ipc_send_receive(vdev, &req, VPU_JSM_MSG_ENGINE_RESET_DONE, &resp,
> VPU_IPC_CHAN_ASYNC_CMD, vdev->timeout.jsm);
> - if (ret)
> + if (ret) {
> ivpu_err_ratelimited(vdev, "Failed to reset engine %d: %d\n", engine, ret);
> + ivpu_pm_trigger_recovery(vdev, "Engine reset failed");
> + }
>
> return ret;
> }
> @@ -354,8 +357,10 @@ int ivpu_jsm_hws_resume_engine(struct ivpu_device *vdev, u32 engine)
>
> ret = ivpu_ipc_send_receive(vdev, &req, VPU_JSM_MSG_HWS_RESUME_ENGINE_DONE, &resp,
> VPU_IPC_CHAN_ASYNC_CMD, vdev->timeout.jsm);
> - if (ret)
> + if (ret) {
> ivpu_err_ratelimited(vdev, "Failed to resume engine %d: %d\n", engine, ret);
> + ivpu_pm_trigger_recovery(vdev, "Engine resume failed");
> + }
>
> return ret;
> }
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] accel/ivpu: Trigger device recovery on engine reset/resume failure
2025-05-28 17:53 ` Lizhi Hou
@ 2025-06-02 13:05 ` Jacek Lawrynowicz
2025-06-05 8:51 ` Lizhi Hou
0 siblings, 1 reply; 5+ messages in thread
From: Jacek Lawrynowicz @ 2025-06-02 13:05 UTC (permalink / raw)
To: Lizhi Hou, dri-devel; +Cc: jeff.hugo, Karol Wachowski, stable
Hi,
On 5/28/2025 7:53 PM, Lizhi Hou wrote:
>
> On 5/28/25 08:42, Jacek Lawrynowicz wrote:
>> From: Karol Wachowski <karol.wachowski@intel.com>
>>
>> Trigger full device recovery when the driver fails to restore device state
>> via engine reset and resume operations. This is necessary because, even if
>> submissions from a faulty context are blocked, the NPU may still process
>> previously submitted faulty jobs if the engine reset fails to abort them.
>> Such jobs can continue to generate faults and occupy device resources.
>> When engine reset is ineffective, the only way to recover is to perform
>> a full device recovery.
>>
>> Fixes: dad945c27a42 ("accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW")
>> Cc: <stable@vger.kernel.org> # v6.15+
>> Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
>> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>> ---
>> drivers/accel/ivpu/ivpu_job.c | 6 ++++--
>> drivers/accel/ivpu/ivpu_jsm_msg.c | 9 +++++++--
>> 2 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
>> index 1c8e283ad9854..fae8351aa3309 100644
>> --- a/drivers/accel/ivpu/ivpu_job.c
>> +++ b/drivers/accel/ivpu/ivpu_job.c
>> @@ -986,7 +986,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
>> return;
>> if (vdev->fw->sched_mode == VPU_SCHEDULING_MODE_HW)
>> - ivpu_jsm_reset_engine(vdev, 0);
>> + if (ivpu_jsm_reset_engine(vdev, 0))
>> + return;
>
> Is it possible the context aborting is entered again before the full device recovery work is executed?
This is a good point but ivpu_context_abort_work_fn() is triggered by an IRQ and the first thing we do when triggering recovery is disabling IRQs.
The recovery work also flushes context_abort_work before staring to tear down everything, so we should be safe.
Regards,
Jacek
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] accel/ivpu: Trigger device recovery on engine reset/resume failure
2025-06-02 13:05 ` Jacek Lawrynowicz
@ 2025-06-05 8:51 ` Lizhi Hou
0 siblings, 0 replies; 5+ messages in thread
From: Lizhi Hou @ 2025-06-05 8:51 UTC (permalink / raw)
To: Jacek Lawrynowicz, dri-devel; +Cc: jeff.hugo, Karol Wachowski, stable
On 6/2/25 06:05, Jacek Lawrynowicz wrote:
> Hi,
>
> On 5/28/2025 7:53 PM, Lizhi Hou wrote:
>> On 5/28/25 08:42, Jacek Lawrynowicz wrote:
>>> From: Karol Wachowski <karol.wachowski@intel.com>
>>>
>>> Trigger full device recovery when the driver fails to restore device state
>>> via engine reset and resume operations. This is necessary because, even if
>>> submissions from a faulty context are blocked, the NPU may still process
>>> previously submitted faulty jobs if the engine reset fails to abort them.
>>> Such jobs can continue to generate faults and occupy device resources.
>>> When engine reset is ineffective, the only way to recover is to perform
>>> a full device recovery.
>>>
>>> Fixes: dad945c27a42 ("accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW")
>>> Cc: <stable@vger.kernel.org> # v6.15+
>>> Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
>>> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>>> ---
>>> drivers/accel/ivpu/ivpu_job.c | 6 ++++--
>>> drivers/accel/ivpu/ivpu_jsm_msg.c | 9 +++++++--
>>> 2 files changed, 11 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
>>> index 1c8e283ad9854..fae8351aa3309 100644
>>> --- a/drivers/accel/ivpu/ivpu_job.c
>>> +++ b/drivers/accel/ivpu/ivpu_job.c
>>> @@ -986,7 +986,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
>>> return;
>>> if (vdev->fw->sched_mode == VPU_SCHEDULING_MODE_HW)
>>> - ivpu_jsm_reset_engine(vdev, 0);
>>> + if (ivpu_jsm_reset_engine(vdev, 0))
>>> + return;
>> Is it possible the context aborting is entered again before the full device recovery work is executed?
> This is a good point but ivpu_context_abort_work_fn() is triggered by an IRQ and the first thing we do when triggering recovery is disabling IRQs.
> The recovery work also flushes context_abort_work before staring to tear down everything, so we should be safe.
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
>
> Regards,
> Jacek
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] accel/ivpu: Trigger device recovery on engine reset/resume failure
2025-05-28 15:42 [PATCH] accel/ivpu: Trigger device recovery on engine reset/resume failure Jacek Lawrynowicz
2025-05-28 17:53 ` Lizhi Hou
@ 2025-06-05 12:45 ` Jacek Lawrynowicz
1 sibling, 0 replies; 5+ messages in thread
From: Jacek Lawrynowicz @ 2025-06-05 12:45 UTC (permalink / raw)
To: dri-devel; +Cc: jeff.hugo, lizhi.hou, Karol Wachowski, stable
Applied to drm-misc-fixes
On 5/28/2025 5:42 PM, Jacek Lawrynowicz wrote:
> From: Karol Wachowski <karol.wachowski@intel.com>
>
> Trigger full device recovery when the driver fails to restore device state
> via engine reset and resume operations. This is necessary because, even if
> submissions from a faulty context are blocked, the NPU may still process
> previously submitted faulty jobs if the engine reset fails to abort them.
> Such jobs can continue to generate faults and occupy device resources.
> When engine reset is ineffective, the only way to recover is to perform
> a full device recovery.
>
> Fixes: dad945c27a42 ("accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW")
> Cc: <stable@vger.kernel.org> # v6.15+
> Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> ---
> drivers/accel/ivpu/ivpu_job.c | 6 ++++--
> drivers/accel/ivpu/ivpu_jsm_msg.c | 9 +++++++--
> 2 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
> index 1c8e283ad9854..fae8351aa3309 100644
> --- a/drivers/accel/ivpu/ivpu_job.c
> +++ b/drivers/accel/ivpu/ivpu_job.c
> @@ -986,7 +986,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
> return;
>
> if (vdev->fw->sched_mode == VPU_SCHEDULING_MODE_HW)
> - ivpu_jsm_reset_engine(vdev, 0);
> + if (ivpu_jsm_reset_engine(vdev, 0))
> + return;
>
> mutex_lock(&vdev->context_list_lock);
> xa_for_each(&vdev->context_xa, ctx_id, file_priv) {
> @@ -1009,7 +1010,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
> if (vdev->fw->sched_mode != VPU_SCHEDULING_MODE_HW)
> goto runtime_put;
>
> - ivpu_jsm_hws_resume_engine(vdev, 0);
> + if (ivpu_jsm_hws_resume_engine(vdev, 0))
> + return;
> /*
> * In hardware scheduling mode NPU already has stopped processing jobs
> * and won't send us any further notifications, thus we have to free job related resources
> diff --git a/drivers/accel/ivpu/ivpu_jsm_msg.c b/drivers/accel/ivpu/ivpu_jsm_msg.c
> index 219ab8afefabd..0256b2dfefc10 100644
> --- a/drivers/accel/ivpu/ivpu_jsm_msg.c
> +++ b/drivers/accel/ivpu/ivpu_jsm_msg.c
> @@ -7,6 +7,7 @@
> #include "ivpu_hw.h"
> #include "ivpu_ipc.h"
> #include "ivpu_jsm_msg.h"
> +#include "ivpu_pm.h"
> #include "vpu_jsm_api.h"
>
> const char *ivpu_jsm_msg_type_to_str(enum vpu_ipc_msg_type type)
> @@ -163,8 +164,10 @@ int ivpu_jsm_reset_engine(struct ivpu_device *vdev, u32 engine)
>
> ret = ivpu_ipc_send_receive(vdev, &req, VPU_JSM_MSG_ENGINE_RESET_DONE, &resp,
> VPU_IPC_CHAN_ASYNC_CMD, vdev->timeout.jsm);
> - if (ret)
> + if (ret) {
> ivpu_err_ratelimited(vdev, "Failed to reset engine %d: %d\n", engine, ret);
> + ivpu_pm_trigger_recovery(vdev, "Engine reset failed");
> + }
>
> return ret;
> }
> @@ -354,8 +357,10 @@ int ivpu_jsm_hws_resume_engine(struct ivpu_device *vdev, u32 engine)
>
> ret = ivpu_ipc_send_receive(vdev, &req, VPU_JSM_MSG_HWS_RESUME_ENGINE_DONE, &resp,
> VPU_IPC_CHAN_ASYNC_CMD, vdev->timeout.jsm);
> - if (ret)
> + if (ret) {
> ivpu_err_ratelimited(vdev, "Failed to resume engine %d: %d\n", engine, ret);
> + ivpu_pm_trigger_recovery(vdev, "Engine resume failed");
> + }
>
> return ret;
> }
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-06-05 12:45 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-28 15:42 [PATCH] accel/ivpu: Trigger device recovery on engine reset/resume failure Jacek Lawrynowicz
2025-05-28 17:53 ` Lizhi Hou
2025-06-02 13:05 ` Jacek Lawrynowicz
2025-06-05 8:51 ` Lizhi Hou
2025-06-05 12:45 ` Jacek Lawrynowicz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox