* [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW
@ 2025-04-08 9:57 Jacek Lawrynowicz
2025-04-10 7:49 ` Jacek Lawrynowicz
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Jacek Lawrynowicz @ 2025-04-08 9:57 UTC (permalink / raw)
To: linux-kernel; +Cc: stable, Karol Wachowski, Jacek Lawrynowicz
From: Karol Wachowski <karol.wachowski@intel.com>
commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.
Trigger recovery of the NPU upon receiving HW context violation from
the firmware. The context violation error is a fatal error that prevents
any subsequent jobs from being executed. Without this fix it is
necessary to reload the driver to restore the NPU operational state.
This is simplified version of upstream commit as the full implementation
would require all engine reset/resume logic to be backported.
Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250107173238.381120-13-maciej.falkowski@linux.intel.com
Fixes: 0adff3b0ef12 ("accel/ivpu: Share NPU busy time in sysfs")
Cc: <stable@vger.kernel.org> # v6.11+
---
drivers/accel/ivpu/ivpu_job.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
index be2e2bf0f43f0..70b3676974407 100644
--- a/drivers/accel/ivpu/ivpu_job.c
+++ b/drivers/accel/ivpu/ivpu_job.c
@@ -482,6 +482,8 @@ static struct ivpu_job *ivpu_job_remove_from_submitted_jobs(struct ivpu_device *
return job;
}
+#define VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 0xEU
+
static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32 job_status)
{
struct ivpu_job *job;
@@ -490,6 +492,9 @@ static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32
if (!job)
return -ENOENT;
+ if (job_status == VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW)
+ ivpu_pm_trigger_recovery(vdev, "HW context violation");
+
if (job->file_priv->has_mmu_faults)
job_status = DRM_IVPU_JOB_STATUS_ABORTED;
--
2.45.1
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 2025-04-08 9:57 [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW Jacek Lawrynowicz @ 2025-04-10 7:49 ` Jacek Lawrynowicz 2025-04-10 8:03 ` Greg Kroah-Hartman 2025-04-10 15:53 ` Sasha Levin 2025-04-22 12:17 ` Greg KH 2 siblings, 1 reply; 10+ messages in thread From: Jacek Lawrynowicz @ 2025-04-10 7:49 UTC (permalink / raw) To: linux-kernel, Greg Kroah-Hartman; +Cc: stable, Karol Wachowski Hi, This is an important patch for the Intel NPU. Is there anything it is missing to be included in stable? Regards, Jacek On 4/8/2025 11:57 AM, Jacek Lawrynowicz wrote: > From: Karol Wachowski <karol.wachowski@intel.com> > > commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream. > > Trigger recovery of the NPU upon receiving HW context violation from > the firmware. The context violation error is a fatal error that prevents > any subsequent jobs from being executed. Without this fix it is > necessary to reload the driver to restore the NPU operational state. > > This is simplified version of upstream commit as the full implementation > would require all engine reset/resume logic to be backported. > > Signed-off-by: Karol Wachowski <karol.wachowski@intel.com> > Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> > Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> > Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> > Link: https://patchwork.freedesktop.org/patch/msgid/20250107173238.381120-13-maciej.falkowski@linux.intel.com > Fixes: 0adff3b0ef12 ("accel/ivpu: Share NPU busy time in sysfs") > Cc: <stable@vger.kernel.org> # v6.11+ > --- > drivers/accel/ivpu/ivpu_job.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c > index be2e2bf0f43f0..70b3676974407 100644 > --- a/drivers/accel/ivpu/ivpu_job.c > +++ b/drivers/accel/ivpu/ivpu_job.c > @@ -482,6 +482,8 @@ static struct ivpu_job *ivpu_job_remove_from_submitted_jobs(struct ivpu_device * > return job; > } > > +#define VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 0xEU > + > static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32 job_status) > { > struct ivpu_job *job; > @@ -490,6 +492,9 @@ static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32 > if (!job) > return -ENOENT; > > + if (job_status == VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW) > + ivpu_pm_trigger_recovery(vdev, "HW context violation"); > + > if (job->file_priv->has_mmu_faults) > job_status = DRM_IVPU_JOB_STATUS_ABORTED; > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 2025-04-10 7:49 ` Jacek Lawrynowicz @ 2025-04-10 8:03 ` Greg Kroah-Hartman 2025-04-10 9:27 ` Jacek Lawrynowicz 0 siblings, 1 reply; 10+ messages in thread From: Greg Kroah-Hartman @ 2025-04-10 8:03 UTC (permalink / raw) To: Jacek Lawrynowicz; +Cc: linux-kernel, stable, Karol Wachowski On Thu, Apr 10, 2025 at 09:49:37AM +0200, Jacek Lawrynowicz wrote: > Hi, > > This is an important patch for the Intel NPU. > Is there anything it is missing to be included in stable? Patience, you only sent this: > On 4/8/2025 11:57 AM, Jacek Lawrynowicz wrote: 2 days ago, AFTER the latest of -rc releases was sent out for review, and those kernels have NOT even been released yet! [rant about how you all know this process works deleted as it was just snarky on my side, although quite cathartic, thanks for letting me vent...] Relax, it will get handled when we can get to it. To help out, please take the time to review pending stable backported patches that have been submitted to the mailing list ahead of yours. greg k-h ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 2025-04-10 8:03 ` Greg Kroah-Hartman @ 2025-04-10 9:27 ` Jacek Lawrynowicz 0 siblings, 0 replies; 10+ messages in thread From: Jacek Lawrynowicz @ 2025-04-10 9:27 UTC (permalink / raw) To: Greg Kroah-Hartman; +Cc: linux-kernel, stable, Karol Wachowski Hi, On 4/10/2025 10:03 AM, Greg Kroah-Hartman wrote: > On Thu, Apr 10, 2025 at 09:49:37AM +0200, Jacek Lawrynowicz wrote: >> Hi, >> >> This is an important patch for the Intel NPU. >> Is there anything it is missing to be included in stable? > > Patience, you only sent this: > >> On 4/8/2025 11:57 AM, Jacek Lawrynowicz wrote: > > 2 days ago, AFTER the latest of -rc releases was sent out for review, > and those kernels have NOT even been released yet! > > [rant about how you all know this process works deleted as it was > just snarky on my side, although quite cathartic, thanks for letting me > vent...] > > Relax, it will get handled when we can get to it. To help out, please > take the time to review pending stable backported patches that have been > submitted to the mailing list ahead of yours. Yeah, sorry about that. It seems I'm still learning the art of patience in the kernel processes. I didn't mean to rush anything. I was just worried my patch might have been overlooked due to a typo or a wrong commit message tag. I'll make sure to review the backported patches in the meantime. And thanks for holding back on the rant. I really think that the standard set by Linus is not healthy. Regards, Jacek ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 2025-04-08 9:57 [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW Jacek Lawrynowicz 2025-04-10 7:49 ` Jacek Lawrynowicz @ 2025-04-10 15:53 ` Sasha Levin 2025-04-22 12:17 ` Greg KH 2 siblings, 0 replies; 10+ messages in thread From: Sasha Levin @ 2025-04-10 15:53 UTC (permalink / raw) To: stable, jacek.lawrynowicz; +Cc: Sasha Levin [ Sasha's backport helper bot ] Hi, Summary of potential issues: ❌ Build failures detected The upstream commit SHA1 provided is correct: dad945c27a42dfadddff1049cf5ae417209a8996 WARNING: Author mismatch between patch and upstream commit: Backport author: Jacek Lawrynowicz<jacek.lawrynowicz@linux.intel.com> Commit author: Karol Wachowski<karol.wachowski@intel.com> Note: The patch differs from the upstream commit: --- 1: dad945c27a42d < -: ------------- accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW -: ------------- > 1: 09299f7b8afa8 accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW --- Results of testing on various branches: | Branch | Patch Apply | Build Test | |---------------------------|-------------|------------| | stable/linux-6.14.y | Success | Success | | stable/linux-6.13.y | Success | Success | | stable/linux-6.12.y | Success | Success | | stable/linux-6.6.y | Success | Success | | stable/linux-6.1.y | Success | Success | | stable/linux-5.15.y | Success | Success | | stable/linux-5.10.y | Success | Success | | stable/linux-5.4.y | Success | Failed | Build Errors: Build error for stable/linux-5.4.y: arch/arm64/boot/dts/exynos/exynos5433.dtsi:254.3-29: Warning (reg_format): /gpu@14ac0000:reg: property has invalid length (8 bytes) (#address-cells == 2, #size-cells == 2) arch/arm64/boot/dts/exynos/exynos5433-tm2.dtb: Warning (pci_device_bus_num): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos5433-tm2.dtb: Warning (i2c_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos5433-tm2.dtb: Warning (spi_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos7.dtsi:83.3-29: Warning (reg_format): /gpu@14ac0000:reg: property has invalid length (8 bytes) (#address-cells == 2, #size-cells == 2) arch/arm64/boot/dts/exynos/exynos7-espresso.dtb: Warning (pci_device_bus_num): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos7-espresso.dtb: Warning (i2c_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos7-espresso.dtb: Warning (spi_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos5433.dtsi:254.3-29: Warning (reg_format): /gpu@14ac0000:reg: property has invalid length (8 bytes) (#address-cells == 2, #size-cells == 2) arch/arm64/boot/dts/exynos/exynos5433-tm2e.dtb: Warning (pci_device_bus_num): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos5433-tm2e.dtb: Warning (i2c_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos5433-tm2e.dtb: Warning (spi_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/qcom/ipq8074-hk01.dts:61.5-15: Warning (reg_format): /soc/nand@79b0000/nand@0:reg: property has invalid length (4 bytes) (#address-cells == 2, #size-cells == 1) arch/arm64/boot/dts/qcom/ipq8074-hk01.dtb: Warning (pci_device_bus_num): Failed prerequisite 'reg_format' arch/arm64/boot/dts/qcom/ipq8074-hk01.dtb: Warning (i2c_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/qcom/ipq8074-hk01.dtb: Warning (spi_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/qcom/ipq8074-hk01.dts:60.11-65.6: Warning (avoid_default_addr_size): /soc/nand@79b0000/nand@0: Relying on default #address-cells value arch/arm64/boot/dts/qcom/ipq8074-hk01.dts:60.11-65.6: Warning (avoid_default_addr_size): /soc/nand@79b0000/nand@0: Relying on default #size-cells value In file included from ./include/linux/list.h:9, from ./include/linux/kobject.h:19, from ./include/linux/of.h:17, from ./include/linux/clk-provider.h:9, from drivers/clk/qcom/clk-rpmh.c:6: drivers/clk/qcom/clk-rpmh.c: In function 'clk_rpmh_bcm_send_cmd': ./include/linux/kernel.h:843:43: warning: comparison of distinct pointer types lacks a cast [-Wcompare-distinct-pointer-types] 843 | (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) | ^~ ./include/linux/kernel.h:857:18: note: in expansion of macro '__typecheck' 857 | (__typecheck(x, y) && __no_side_effects(x, y)) | ^~~~~~~~~~~ ./include/linux/kernel.h:867:31: note: in expansion of macro '__safe_cmp' 867 | __builtin_choose_expr(__safe_cmp(x, y), \ | ^~~~~~~~~~ ./include/linux/kernel.h:876:25: note: in expansion of macro '__careful_cmp' 876 | #define min(x, y) __careful_cmp(x, y, <) | ^~~~~~~~~~~~~ drivers/clk/qcom/clk-rpmh.c:273:21: note: in expansion of macro 'min' 273 | cmd_state = min(cmd_state, BCM_TCS_CMD_VOTE_MASK); | ^~~ fs/xfs/libxfs/xfs_inode_fork.c: In function 'xfs_ifork_verify_attr': fs/xfs/libxfs/xfs_inode_fork.c:735:13: warning: the comparison will always evaluate as 'true' for the address of 'i_df' will never be NULL [-Waddress] 735 | if (!XFS_IFORK_PTR(ip, XFS_ATTR_FORK)) | ^ In file included from fs/xfs/libxfs/xfs_inode_fork.c:14: ./fs/xfs/xfs_inode.h:38:33: note: 'i_df' declared here 38 | struct xfs_ifork i_df; /* data fork */ | ^~~~ drivers/net/dsa/microchip/ksz9477.c: In function 'ksz9477_reset_switch': drivers/net/dsa/microchip/ksz9477.c:198:12: warning: unused variable 'data8' [-Wunused-variable] 198 | u8 data8; | ^~~~~ In file included from ./include/linux/bitops.h:5, from ./include/linux/kernel.h:12, from ./include/linux/list.h:9, from ./include/linux/module.h:9, from drivers/net/ethernet/qlogic/qed/qed_debug.c:6: drivers/net/ethernet/qlogic/qed/qed_debug.c: In function 'qed_grc_dump_addr_range': ./include/linux/bits.h:8:33: warning: overflow in conversion from 'long unsigned int' to 'u8' {aka 'unsigned char'} changes value from '(long unsigned int)((int)vf_id << 8 | 128)' to '128' [-Woverflow] 8 | #define BIT(nr) (UL(1) << (nr)) | ^ drivers/net/ethernet/qlogic/qed/qed_debug.c:2572:31: note: in expansion of macro 'BIT' 2572 | fid = BIT(PXP_PRETEND_CONCRETE_FID_VFVALID_SHIFT) | | ^~~ drivers/gpu/drm/nouveau/dispnv50/wndw.c:628:1: warning: conflicting types for 'nv50_wndw_new_' due to enum/integer mismatch; have 'int(const struct nv50_wndw_func *, struct drm_device *, enum drm_plane_type, const char *, int, const u32 *, u32, enum nv50_disp_interlock_type, u32, struct nv50_wndw **)' {aka 'int(const struct nv50_wndw_func *, struct drm_device *, enum drm_plane_type, const char *, int, const unsigned int *, unsigned int, enum nv50_disp_interlock_type, unsigned int, struct nv50_wndw **)'} [-Wenum-int-mismatch] 628 | nv50_wndw_new_(const struct nv50_wndw_func *func, struct drm_device *dev, | ^~~~~~~~~~~~~~ In file included from drivers/gpu/drm/nouveau/dispnv50/wndw.c:22: drivers/gpu/drm/nouveau/dispnv50/wndw.h:39:5: note: previous declaration of 'nv50_wndw_new_' with type 'int(const struct nv50_wndw_func *, struct drm_device *, enum drm_plane_type, const char *, int, const u32 *, enum nv50_disp_interlock_type, u32, u32, struct nv50_wndw **)' {aka 'int(const struct nv50_wndw_func *, struct drm_device *, enum drm_plane_type, const char *, int, const unsigned int *, enum nv50_disp_interlock_type, unsigned int, unsigned int, struct nv50_wndw **)'} 39 | int nv50_wndw_new_(const struct nv50_wndw_func *, struct drm_device *, | ^~~~~~~~~~~~~~ Terminated make: *** [Makefile:1121: vmlinux] Error 143 make: Target '_all' not remade because of errors. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 2025-04-08 9:57 [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW Jacek Lawrynowicz 2025-04-10 7:49 ` Jacek Lawrynowicz 2025-04-10 15:53 ` Sasha Levin @ 2025-04-22 12:17 ` Greg KH 2025-04-24 10:22 ` Jacek Lawrynowicz 2 siblings, 1 reply; 10+ messages in thread From: Greg KH @ 2025-04-22 12:17 UTC (permalink / raw) To: Jacek Lawrynowicz; +Cc: linux-kernel, stable, Karol Wachowski On Tue, Apr 08, 2025 at 11:57:11AM +0200, Jacek Lawrynowicz wrote: > From: Karol Wachowski <karol.wachowski@intel.com> > > commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream. > > Trigger recovery of the NPU upon receiving HW context violation from > the firmware. The context violation error is a fatal error that prevents > any subsequent jobs from being executed. Without this fix it is > necessary to reload the driver to restore the NPU operational state. > > This is simplified version of upstream commit as the full implementation > would require all engine reset/resume logic to be backported. We REALLY do not like taking patches that are not upstream. Why not backport all of the needed patches instead, how many would that be? Taking one-off patches like this just makes it harder/impossible to maintain the code over time as further fixes in this same area will NOT apply properly at all. Think about what you want to be touching 5 years from now, a one-off change that doesn't match the rest of the kernel tree, or something that is the same? thanks, greg k-h ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 2025-04-22 12:17 ` Greg KH @ 2025-04-24 10:22 ` Jacek Lawrynowicz 2025-04-24 10:34 ` Greg KH 0 siblings, 1 reply; 10+ messages in thread From: Jacek Lawrynowicz @ 2025-04-24 10:22 UTC (permalink / raw) To: Greg KH; +Cc: linux-kernel, stable, Karol Wachowski Hi, On 4/22/2025 2:17 PM, Greg KH wrote: > On Tue, Apr 08, 2025 at 11:57:11AM +0200, Jacek Lawrynowicz wrote: >> From: Karol Wachowski <karol.wachowski@intel.com> >> >> commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream. >> >> Trigger recovery of the NPU upon receiving HW context violation from >> the firmware. The context violation error is a fatal error that prevents >> any subsequent jobs from being executed. Without this fix it is >> necessary to reload the driver to restore the NPU operational state. >> >> This is simplified version of upstream commit as the full implementation >> would require all engine reset/resume logic to be backported. > > We REALLY do not like taking patches that are not upstream. Why not > backport all of the needed patches instead, how many would that be? > Taking one-off patches like this just makes it harder/impossible to > maintain the code over time as further fixes in this same area will NOT > apply properly at all. > > Think about what you want to be touching 5 years from now, a one-off > change that doesn't match the rest of the kernel tree, or something that > is the same? Sure, I'm totally on board with backporting all required patches. I thought it was not possible due to 100 line limit. This would be the minimum set of patches: Patch 1: drivers/accel/ivpu/ivpu_drv.c | 32 +++----------- drivers/accel/ivpu/ivpu_drv.h | 2 + drivers/accel/ivpu/ivpu_job.c | 78 ++++++++++++++++++++++++++------- drivers/accel/ivpu/ivpu_job.h | 1 + drivers/accel/ivpu/ivpu_mmu.c | 3 +- drivers/accel/ivpu/ivpu_sysfs.c | 5 ++- 6 files changed, 75 insertions(+), 46 deletions(-) Patch 2: drivers/accel/ivpu/ivpu_job.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) Patch 3: drivers/accel/ivpu/ivpu_job.c | 2 +- drivers/accel/ivpu/ivpu_jsm_msg.c | 3 +- drivers/accel/ivpu/vpu_boot_api.h | 45 +++-- drivers/accel/ivpu/vpu_jsm_api.h | 303 +++++++++++++++++++++++++----- 4 files changed, 293 insertions(+), 60 deletions(-) Patch 4: drivers/accel/ivpu/ivpu_job.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) First patch needs some changes to apply correctly to 6.12 but the rest of them apply pretty cleanly. Is this acceptable? Regards, Jacek ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 2025-04-24 10:22 ` Jacek Lawrynowicz @ 2025-04-24 10:34 ` Greg KH 2025-04-30 12:50 ` Jacek Lawrynowicz 0 siblings, 1 reply; 10+ messages in thread From: Greg KH @ 2025-04-24 10:34 UTC (permalink / raw) To: Jacek Lawrynowicz; +Cc: linux-kernel, stable, Karol Wachowski On Thu, Apr 24, 2025 at 12:22:31PM +0200, Jacek Lawrynowicz wrote: > Hi, > > On 4/22/2025 2:17 PM, Greg KH wrote: > > On Tue, Apr 08, 2025 at 11:57:11AM +0200, Jacek Lawrynowicz wrote: > >> From: Karol Wachowski <karol.wachowski@intel.com> > >> > >> commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream. > >> > >> Trigger recovery of the NPU upon receiving HW context violation from > >> the firmware. The context violation error is a fatal error that prevents > >> any subsequent jobs from being executed. Without this fix it is > >> necessary to reload the driver to restore the NPU operational state. > >> > >> This is simplified version of upstream commit as the full implementation > >> would require all engine reset/resume logic to be backported. > > > > We REALLY do not like taking patches that are not upstream. Why not > > backport all of the needed patches instead, how many would that be? > > Taking one-off patches like this just makes it harder/impossible to > > maintain the code over time as further fixes in this same area will NOT > > apply properly at all. > > > > Think about what you want to be touching 5 years from now, a one-off > > change that doesn't match the rest of the kernel tree, or something that > > is the same? > > Sure, I'm totally on board with backporting all required patches. > I thought it was not possible due to 100 line limit. > > This would be the minimum set of patches: > > Patch 1: > drivers/accel/ivpu/ivpu_drv.c | 32 +++----------- > drivers/accel/ivpu/ivpu_drv.h | 2 + > drivers/accel/ivpu/ivpu_job.c | 78 ++++++++++++++++++++++++++------- > drivers/accel/ivpu/ivpu_job.h | 1 + > drivers/accel/ivpu/ivpu_mmu.c | 3 +- > drivers/accel/ivpu/ivpu_sysfs.c | 5 ++- > 6 files changed, 75 insertions(+), 46 deletions(-) > > Patch 2: > drivers/accel/ivpu/ivpu_job.c | 15 ++++++--------- > 1 file changed, 6 insertions(+), 9 deletions(-) > > Patch 3: > drivers/accel/ivpu/ivpu_job.c | 2 +- > drivers/accel/ivpu/ivpu_jsm_msg.c | 3 +- > drivers/accel/ivpu/vpu_boot_api.h | 45 +++-- > drivers/accel/ivpu/vpu_jsm_api.h | 303 +++++++++++++++++++++++++----- > 4 files changed, 293 insertions(+), 60 deletions(-) > > Patch 4: > drivers/accel/ivpu/ivpu_job.c | 27 ++++++++++++++++++++++++++- > 1 file changed, 26 insertions(+), 1 deletion(-) > > First patch needs some changes to apply correctly to 6.12 but the rest of them apply pretty cleanly. > Is this acceptable? Totally acceptable, that's trivial compared to many of the larger backports we have taken over the years :) thanks, greg k-h ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 2025-04-24 10:34 ` Greg KH @ 2025-04-30 12:50 ` Jacek Lawrynowicz 0 siblings, 0 replies; 10+ messages in thread From: Jacek Lawrynowicz @ 2025-04-30 12:50 UTC (permalink / raw) To: Greg KH; +Cc: linux-kernel, stable, Karol Wachowski Hi, On 4/24/2025 12:34 PM, Greg KH wrote: > On Thu, Apr 24, 2025 at 12:22:31PM +0200, Jacek Lawrynowicz wrote: >> Hi, >> >> On 4/22/2025 2:17 PM, Greg KH wrote: >>> On Tue, Apr 08, 2025 at 11:57:11AM +0200, Jacek Lawrynowicz wrote: >>>> From: Karol Wachowski <karol.wachowski@intel.com> >>>> >>>> commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream. >>>> >>>> Trigger recovery of the NPU upon receiving HW context violation from >>>> the firmware. The context violation error is a fatal error that prevents >>>> any subsequent jobs from being executed. Without this fix it is >>>> necessary to reload the driver to restore the NPU operational state. >>>> >>>> This is simplified version of upstream commit as the full implementation >>>> would require all engine reset/resume logic to be backported. >>> >>> We REALLY do not like taking patches that are not upstream. Why not >>> backport all of the needed patches instead, how many would that be? >>> Taking one-off patches like this just makes it harder/impossible to >>> maintain the code over time as further fixes in this same area will NOT >>> apply properly at all. >>> >>> Think about what you want to be touching 5 years from now, a one-off >>> change that doesn't match the rest of the kernel tree, or something that >>> is the same? >> >> Sure, I'm totally on board with backporting all required patches. >> I thought it was not possible due to 100 line limit. >> >> This would be the minimum set of patches: >> >> Patch 1: >> drivers/accel/ivpu/ivpu_drv.c | 32 +++----------- >> drivers/accel/ivpu/ivpu_drv.h | 2 + >> drivers/accel/ivpu/ivpu_job.c | 78 ++++++++++++++++++++++++++------- >> drivers/accel/ivpu/ivpu_job.h | 1 + >> drivers/accel/ivpu/ivpu_mmu.c | 3 +- >> drivers/accel/ivpu/ivpu_sysfs.c | 5 ++- >> 6 files changed, 75 insertions(+), 46 deletions(-) >> >> Patch 2: >> drivers/accel/ivpu/ivpu_job.c | 15 ++++++--------- >> 1 file changed, 6 insertions(+), 9 deletions(-) >> >> Patch 3: >> drivers/accel/ivpu/ivpu_job.c | 2 +- >> drivers/accel/ivpu/ivpu_jsm_msg.c | 3 +- >> drivers/accel/ivpu/vpu_boot_api.h | 45 +++-- >> drivers/accel/ivpu/vpu_jsm_api.h | 303 +++++++++++++++++++++++++----- >> 4 files changed, 293 insertions(+), 60 deletions(-) >> >> Patch 4: >> drivers/accel/ivpu/ivpu_job.c | 27 ++++++++++++++++++++++++++- >> 1 file changed, 26 insertions(+), 1 deletion(-) >> >> First patch needs some changes to apply correctly to 6.12 but the rest of them apply pretty cleanly. >> Is this acceptable? > > Totally acceptable, that's trivial compared to many of the larger > backports we have taken over the years :) OK, I've sent two separate patchses for 6.12 and 6.14 that contain minimal number of patches. I've rebased only two patches in each patchsets and the rest is as-is from upstream. Let me know in case I messed something up. Regards, Jacek ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW
@ 2025-04-01 11:08 Jacek Lawrynowicz
0 siblings, 0 replies; 10+ messages in thread
From: Jacek Lawrynowicz @ 2025-04-01 11:08 UTC (permalink / raw)
To: stable; +Cc: Karol Wachowski, Jacek Lawrynowicz
From: Karol Wachowski <karol.wachowski@intel.com>
commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.
Trigger recovery of the NPU upon receiving HW context violation from
the firmware. The context violation error is a fatal error that prevents
any subsequent jobs from being executed. Without this fix it is
necessary to reload the driver to restore the NPU operational state.
This is simplified version of upstream commit as the full implementation
would require all engine reset/resume logic to be backported.
Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250107173238.381120-13-maciej.falkowski@linux.intel.com
Fixes: 0adff3b0ef12 ("accel/ivpu: Share NPU busy time in sysfs")
Cc: <stable@vger.kernel.org> # v6.11+
---
drivers/accel/ivpu/ivpu_job.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
index be2e2bf0f43f0..70b3676974407 100644
--- a/drivers/accel/ivpu/ivpu_job.c
+++ b/drivers/accel/ivpu/ivpu_job.c
@@ -482,6 +482,8 @@ static struct ivpu_job *ivpu_job_remove_from_submitted_jobs(struct ivpu_device *
return job;
}
+#define VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 0xEU
+
static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32 job_status)
{
struct ivpu_job *job;
@@ -490,6 +492,9 @@ static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32
if (!job)
return -ENOENT;
+ if (job_status == VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW)
+ ivpu_pm_trigger_recovery(vdev, "HW context violation");
+
if (job->file_priv->has_mmu_faults)
job_status = DRM_IVPU_JOB_STATUS_ABORTED;
--
2.45.1
^ permalink raw reply related [flat|nested] 10+ messages in threadend of thread, other threads:[~2025-04-30 12:50 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-08 9:57 [PATCH] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW Jacek Lawrynowicz 2025-04-10 7:49 ` Jacek Lawrynowicz 2025-04-10 8:03 ` Greg Kroah-Hartman 2025-04-10 9:27 ` Jacek Lawrynowicz 2025-04-10 15:53 ` Sasha Levin 2025-04-22 12:17 ` Greg KH 2025-04-24 10:22 ` Jacek Lawrynowicz 2025-04-24 10:34 ` Greg KH 2025-04-30 12:50 ` Jacek Lawrynowicz -- strict thread matches above, loose matches on Subject: below -- 2025-04-01 11:08 Jacek Lawrynowicz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox