[PATCH] accel/rocket: Fix dma_fence reference leak on error paths in rocket_job

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] accel/rocket: Fix dma_fence reference leak on error paths in rocket_job_run()
@ 2026-06-08  8:38 ZhaoJinming
  2026-06-08  8:50 ` sashiko-bot
  0 siblings, 1 reply; 21+ messages in thread
From: ZhaoJinming @ 2026-06-08  8:38 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

In rocket_job_run(), after creating a fence at line 305 and taking an
extra reference for job->done_fence at line 311 via dma_fence_get(),
two error paths fail to release that extra reference before returning:

  - Line 314: When pm_runtime_get_sync() fails, returns fence without
    putting job->done_fence.

  - Line 318: When iommu_attach_group() fails, returns fence without
    putting job->done_fence, and also omits pm_runtime_put() to balance
    the successful pm_runtime_get_sync() at line 313.

The leaked fence reference prevents the fence and its underlying rocket
device from being freed while job->done_fence still points to it.
Repeated failures will accumulate leaked fences, consuming kernel memory
and holding device resources indefinitely.

Fix both error paths by calling dma_fence_put(job->done_fence) and
setting job->done_fence to NULL before returning. Also add the missing
pm_runtime_put() on the iommu_attach_group error path.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index ac51bff39833..66e4a1d57842 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -311,12 +311,19 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
 	job->done_fence = dma_fence_get(fence);
 
 	ret = pm_runtime_get_sync(core->dev);
-	if (ret < 0)
+	if (ret < 0) {
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
 		return fence;
+	}
 
 	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
-	if (ret < 0)
+	if (ret < 0) {
+		pm_runtime_put(core->dev);
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
 		return fence;
+	}
 
 	scoped_guard(mutex, &core->job_lock) {
 		core->in_flight_job = job;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] accel/rocket: Fix dma_fence reference leak on error paths in rocket_job_run()
  2026-06-08  8:38 [PATCH] accel/rocket: Fix dma_fence reference leak on error paths in rocket_job_run() ZhaoJinming
@ 2026-06-08  8:50 ` sashiko-bot
  2026-06-08 10:00   ` [PATCH v2] accel/rocket: Fix fence leak, runtime PM leak and WARN in rocket_job_run() error paths ZhaoJinming
  0 siblings, 1 reply; 21+ messages in thread
From: sashiko-bot @ 2026-06-08  8:50 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: dri-devel, linux-media

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] The `pm_runtime_get_sync()` error path fails to decrement the runtime PM usage counter, permanently leaking the reference.
- [High] Returning an unsignaled dma_fence instead of an error pointer on job submission failure triggers a kernel WARN.
--

commit f0df2d483139ad216e9d43923cbf9543a8f4e81c
Author: ZhaoJinming <zhaojinming@uniontech.com>

accel/rocket: Fix dma_fence reference leak on error paths in rocket_job_run()

In rocket_job_run(), after creating a fence at line 305 and taking an
extra reference for job->done_fence at line 311 via dma_fence_get(),
two error paths fail to release that extra reference before returning.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index 2f1861f960ccb..fc10703534e37 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -311,12 +311,19 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
>  	job->done_fence = dma_fence_get(fence);
>  
>  	ret = pm_runtime_get_sync(core->dev);
> -	if (ret < 0)
> +	if (ret < 0) {
> +		dma_fence_put(job->done_fence);
> +		job->done_fence = NULL;

[Severity: High]
This is a pre-existing issue, but does this error path permanently leak the
runtime PM usage counter? 

pm_runtime_get_sync() increments the device's usage counter before attempting
to resume the device. If the resume fails, it returns a negative error code
but the usage counter remains incremented. 

Without explicitly dropping this usage counter (e.g., using
pm_runtime_put_noidle()), could repeated failures prevent the NPU from ever
suspending again?

>  		return fence;

[Severity: High]
This is also a pre-existing issue, but will returning the valid, unsignaled
fence instead of ERR_PTR(ret) trigger a kernel WARN?

Since the hardware wasn't submitted, the fence refcount immediately hits 0
after the DRM scheduler drops its initial reference. dma_fence_release()
detects pending callbacks on an unsignaled fence and triggers a loud kernel
WARN(1, "Fence ... released with pending signals!\n"), forcibly signaling it
with -EDEADLK.

Should this path destroy the fence and return ERR_PTR(ret) so the scheduler
cleanly aborts the job?

> +	}
>  
>  	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> -	if (ret < 0)
> +	if (ret < 0) {
> +		pm_runtime_put(core->dev);
> +		dma_fence_put(job->done_fence);
> +		job->done_fence = NULL;
>  		return fence;

[Severity: High]
This is a pre-existing issue, but similar to the above, does returning the
unsignaled fence here instead of ERR_PTR(ret) cause the same DRM scheduler
WARN when the fence is released?

> +	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608083824.775261-1-zhaojinming@uniontech.com?part=1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2] accel/rocket: Fix fence leak, runtime PM leak and WARN in rocket_job_run() error paths
  2026-06-08  8:50 ` sashiko-bot
@ 2026-06-08 10:00   ` ZhaoJinming
  2026-06-08 10:38     ` sashiko-bot
  0 siblings, 1 reply; 21+ messages in thread
From: ZhaoJinming @ 2026-06-08 10:00 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

Three issues exist in the error paths of rocket_job_run():

1) dma_fence reference leak: After creating a fence and taking an extra
   reference for job->done_fence via dma_fence_get(), the error paths
   return without releasing the extra reference held by job->done_fence.
   The leaked reference prevents the fence from being freed, causing
   resource accumulation on repeated failures.

2) pm_runtime_get_sync() usage counter leak: pm_runtime_get_sync()
   increments the runtime PM usage counter before attempting to resume
   the device. If the resume fails and returns an error, the usage
   counter remains incremented. The original error path does not call
   pm_runtime_put_noidle() to balance it. Repeated failures will
   permanently prevent the NPU from entering suspend.

3) Unsignaled fence returned on failure: The error paths return a valid
   but unsignaled dma_fence to the DRM scheduler. Since the hardware
   was never submitted, the fence is never signaled. When the scheduler
   eventually drops its reference, dma_fence_release() detects the
   unsignaled fence and triggers:
     WARN(1, "Fence ... released with pending signals!")
   and forcibly signals it with -EDEADLK.

Fix all three issues by:

- Replace pm_runtime_get_sync() with pm_runtime_resume_and_get(), which
  automatically decrements the usage counter on failure, eliminating
  the need for a manual pm_runtime_put_noidle() call and avoiding the
  usage counter leak. The pm_runtime_get_sync() documentation itself
  recommends pm_runtime_resume_and_get() as the preferred alternative
  when the return value is checked by the caller.

- Release both fence references (job->done_fence and the local fence)
  before returning ERR_PTR(ret) so the DRM scheduler cleanly aborts
  the job without triggering the unsignaled fence WARN.

- Add pm_runtime_put() on the iommu_attach_group error path to release
  the runtime PM reference that was successfully acquired.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index ac51bff39833..e8a073e22ac2 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
 		dma_fence_put(job->done_fence);
 	job->done_fence = dma_fence_get(fence);
 
-	ret = pm_runtime_get_sync(core->dev);
-	if (ret < 0)
-		return fence;
+	ret = pm_runtime_resume_and_get(core->dev);
+	if (ret < 0) {
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
-	if (ret < 0)
-		return fence;
+	if (ret < 0) {
+		pm_runtime_put(core->dev);
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	scoped_guard(mutex, &core->job_lock) {
 		core->in_flight_job = job;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] accel/rocket: Fix fence leak, runtime PM leak and WARN in rocket_job_run() error paths
  2026-06-08 10:00   ` [PATCH v2] accel/rocket: Fix fence leak, runtime PM leak and WARN in rocket_job_run() error paths ZhaoJinming
@ 2026-06-08 10:38     ` sashiko-bot
  2026-06-09  9:33       ` [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
  0 siblings, 1 reply; 21+ messages in thread
From: sashiko-bot @ 2026-06-08 10:38 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: linux-media, dri-devel

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] IOMMU group reference leak in rocket_job_handle_irq()
- [High] Unsafe hardware register access in shared IRQ handler while runtime suspended
--

commit fca53bbe333680ce9d4e0368b39e684506994081
Author: ZhaoJinming <zhaojinming@uniontech.com>
Subject: accel/rocket: Fix fence leak, runtime PM leak and WARN in rocket_job_run() error paths

This commit fixes multiple resource leaks in the error paths of rocket_job_run().
It replaces pm_runtime_get_sync() with pm_runtime_resume_and_get() to handle
the usage counter, correctly releases dma_fence references on failure, and
adds pm_runtime_put() during IOMMU attachment failures.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index 2f1861f960ccb..cf9f61682dd8b 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c

[ ... ]

> @@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
>  		dma_fence_put(job->done_fence);
>  	job->done_fence = dma_fence_get(fence);
>  
> -	ret = pm_runtime_get_sync(core->dev);
> -	if (ret < 0)
> -		return fence;
> +	ret = pm_runtime_resume_and_get(core->dev);
> +	if (ret < 0) {
> +		dma_fence_put(job->done_fence);
> +		job->done_fence = NULL;
> +		dma_fence_put(fence);
> +		return ERR_PTR(ret);
> +	}
>  
>  	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> -	if (ret < 0)
> -		return fence;
> +	if (ret < 0) {
> +		pm_runtime_put(core->dev);
> +		dma_fence_put(job->done_fence);
> +		job->done_fence = NULL;
> +		dma_fence_put(fence);
> +		return ERR_PTR(ret);
> +	}
>  
>  	scoped_guard(mutex, &core->job_lock) {
>  		core->in_flight_job = job;

[ ... ]

>  static void rocket_job_handle_irq(struct rocket_core *core)
>  {
>  	pm_runtime_mark_last_busy(core->dev);
>  
>  	rocket_pc_writel(core, OPERATION_ENABLE, 0x0);
>  	rocket_pc_writel(core, INTERRUPT_CLEAR, 0x1ffff);
>  
>  	scoped_guard(mutex, &core->job_lock)
>  		if (core->in_flight_job) {
>  			if (core->in_flight_job->next_task_idx < core->in_flight_job->task_count) {
>  				rocket_job_hw_submit(core, core->in_flight_job);
>  				return;
>  			}
>  
>  			iommu_detach_group(NULL, iommu_group_get(core->dev));

[Severity: High]
This isn't a bug introduced by this patch, but does this leak an IOMMU
group reference?

It looks like iommu_group_get() increments the kobject reference count for the
group. However, iommu_detach_group() doesn't consume this reference, and there
is no subsequent iommu_group_put() call.

Since this runs on every completed job, could this eventually overflow the
reference counter and prevent the group from being freed?

[ ... ]

>  static irqreturn_t rocket_job_irq_handler(int irq, void *data)
>  {
>  	struct rocket_core *core = data;
>  	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);

[Severity: High]
This is a pre-existing issue, but is it safe to access hardware registers
here without first checking if the device is active?

Since the driver registers this as a shared interrupt handler, it could be
invoked if another device on the same IRQ line triggers an interrupt while the
NPU device is runtime suspended.

Reading from a powered-down device typically returns 0xffffffff, which would
spuriously trigger the WARN_ON macros below and falsely return 
IRQ_WAKE_THREAD.

Should this use pm_runtime_get_if_active() to verify the device is powered
before reading INTERRUPT_RAW_STATUS?

>  
>  	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
>  	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608100040.985526-1-zhaojinming@uniontech.com?part=1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run()
  2026-06-08 10:38     ` sashiko-bot
@ 2026-06-09  9:33       ` ZhaoJinming
  2026-06-09  9:33         ` [PATCH v3 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
  2026-06-09  9:50         ` [PATCH v3 " sashiko-bot
  0 siblings, 2 replies; 21+ messages in thread
From: ZhaoJinming @ 2026-06-09  9:33 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

In rocket_job_run(), after taking an extra fence reference for
job->done_fence via dma_fence_get(), the error paths have three bugs:

- The dma_fence reference held by job->done_fence is never released,
  causing a reference leak.
- pm_runtime_get_sync() increments the usage counter even on failure,
  but the error path does not decrement it, leaking the runtime PM
  reference and preventing the NPU from suspending.
- A valid but unsignaled fence is returned to the DRM scheduler,
  which triggers WARN("Fence ... released with pending signals!")
  when the scheduler drops its reference.

Fix by replacing pm_runtime_get_sync() with pm_runtime_resume_and_get()
which auto-balances the usage counter on failure, releasing both fence
references on error, and returning ERR_PTR(ret) instead of the
unsignaled fence.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index ac51bff39833..e8a073e22ac2 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
 		dma_fence_put(job->done_fence);
 	job->done_fence = dma_fence_get(fence);
 
-	ret = pm_runtime_get_sync(core->dev);
-	if (ret < 0)
-		return fence;
+	ret = pm_runtime_resume_and_get(core->dev);
+	if (ret < 0) {
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
-	if (ret < 0)
-		return fence;
+	if (ret < 0) {
+		pm_runtime_put(core->dev);
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	scoped_guard(mutex, &core->job_lock) {
 		core->in_flight_job = job;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access
  2026-06-09  9:33       ` [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
@ 2026-06-09  9:33         ` ZhaoJinming
  2026-06-09  9:47           ` sashiko-bot
  2026-06-09  9:50         ` [PATCH v3 " sashiko-bot
  1 sibling, 1 reply; 21+ messages in thread
From: ZhaoJinming @ 2026-06-09  9:33 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

Two bugs in the IRQ handling path:

1) iommu_group reference leak in rocket_job_handle_irq():
   iommu_group_get() increments the reference count but the returned
   pointer is passed directly to iommu_detach_group() which does not
   consume it. Since this runs on every completed job, the reference
   count accumulates and prevents the group from being freed. Use
   core->iommu_group instead, consistent with rocket_reset().

2) Unsafe hardware register access in shared IRQ handler:
   rocket_job_irq_handler() is registered with IRQF_SHARED but accesses
   hardware registers without checking runtime PM status. If another
   device on the same IRQ line triggers an interrupt while the NPU is
   suspended, register reads return 0xffffffff, spuriously triggering
   WARN_ON macros and falsely returning IRQ_WAKE_THREAD.

   Replace with pm_runtime_get_if_active(), which atomically checks the
   device status under spinlock and increments the usage count to keep
   the device active during register access. Balance the extra reference:
   - If the interrupt is not for this device (raw_status does not match
     DPU_0/DPU_1), put in the hardirq handler before returning IRQ_NONE.
   - If the interrupt is for this device, the thread handler puts after
     completing register writes.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index e8a073e22ac2..30c677952b64 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -349,7 +349,7 @@ static void rocket_job_handle_irq(struct rocket_core *core)
 				return;
 			}
 
-			iommu_detach_group(NULL, iommu_group_get(core->dev));
+			iommu_detach_group(NULL, core->iommu_group);
 			dma_fence_signal(core->in_flight_job->done_fence);
 			pm_runtime_put_autosuspend(core->dev);
 			core->in_flight_job = NULL;
@@ -421,6 +421,7 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
 	struct rocket_core *core = data;
 
 	rocket_job_handle_irq(core);
+	pm_runtime_put(core->dev);
 
 	return IRQ_HANDLED;
 }
@@ -428,14 +429,20 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
 static irqreturn_t rocket_job_irq_handler(int irq, void *data)
 {
 	struct rocket_core *core = data;
+
+	if (!pm_runtime_get_if_active(core->dev))
+		return IRQ_NONE;
+
 	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
 
 	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
 	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);
 
 	if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 ||
-	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1))
+	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) {
+		pm_runtime_put(core->dev);
 		return IRQ_NONE;
+	}
 
 	rocket_pc_writel(core, INTERRUPT_MASK, 0x0);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access
  2026-06-09  9:33         ` [PATCH v3 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
@ 2026-06-09  9:47           ` sashiko-bot
  2026-06-10  3:03             ` [PATCH v4 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
  0 siblings, 1 reply; 21+ messages in thread
From: sashiko-bot @ 2026-06-09  9:47 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:

New issues:
- [High] Incorrect error handling of pm_runtime_get_if_active() return value causes unbalanced pm_runtime_put() and unsafe hardware access.
- [High] PM usage count leak due to threaded IRQ coalescing on shared interrupts.

Pre-existing issues:
- [High] Missing synchronize_irq() in the runtime suspend callback for a device with a shared IRQ handler.
- [High] Attacker-controlled DMA faults leave the device interrupt asserted without clearing it, causing an interrupt storm and kernel log spam.
--

commit 57f30fe728d7f765520f0d3ccb419368e6f945d0
Author: ZhaoJinming <zhaojinming@uniontech.com>

accel/rocket: Fix iommu_group leak and unsafe IRQ register access

This commit fixes an iommu_group reference leak in rocket_job_handle_irq()
by using the stored iommu_group instead of getting a new reference. It also
attempts to fix unsafe hardware register accesses in the shared IRQ handler
by adding pm_runtime_get_if_active() checks to ensure the device is awake.

> 2) Unsafe hardware register access in shared IRQ handler:
>    rocket_job_irq_handler() is registered with IRQF_SHARED but accesses
>    hardware registers without checking runtime PM status.

[Severity: High]
This is a pre-existing issue, but since the device uses a shared IRQ handler,
does the runtime suspend callback (rocket_device_runtime_suspend() in
rocket_drv.c) need to call synchronize_irq() before disabling clocks? 

Without it, the IRQ handler could execute mid-flight while the hardware is
turned off, leading to invalid register accesses and spurious interrupts.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index cf9f61682dd8b..8eaa3d427a51d 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
[ ... ]
> @@ -428,14 +429,20 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
>  static irqreturn_t rocket_job_irq_handler(int irq, void *data)
>  {
>  	struct rocket_core *core = data;
> +
> +	if (!pm_runtime_get_if_active(core->dev))
> +		return IRQ_NONE;
> +

[Severity: High]
Does this correctly handle negative error codes? 

pm_runtime_get_if_active() returns negative error codes (e.g., -EINVAL) when
runtime PM is disabled. The condition here only traps a return value of 0.
When it returns a negative error code, the condition evaluates to false,
allowing execution to proceed to read hardware registers on a potentially
suspended device.

Because the PM usage counter is not incremented on error, wouldn't the
subsequent unconditional pm_runtime_put() calls cause a PM usage count
underflow?

>  	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
>  
>  	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
>  	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);
>  

[Severity: High]
This is a pre-existing issue, but if a DMA error occurs and the DPU_0 and
DPU_1 bits are not set, the handler returns IRQ_NONE without clearing the
interrupt.

Since this is a shared level-triggered IRQ line, won't this leave the
interrupt asserted and cause an interrupt storm, permanently disabling the
shared IRQ line?

>  	if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 ||
> -	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1))
> +	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) {
> +		pm_runtime_put(core->dev);
>  		return IRQ_NONE;
> +	}
>  
>  	rocket_pc_writel(core, INTERRUPT_MASK, 0x0);
>  

[Severity: High]
Does this leak the PM usage count if the IRQ subsystem coalesces the threaded
interrupt?

If another device triggers an interrupt on this shared line while
rocket_job_irq_handler_thread() is already pending, this hardirq handler
will run again. It will increment the PM usage count via
pm_runtime_get_if_active() and eventually return IRQ_WAKE_THREAD.

Because the thread is already pending, the IRQ core coalesces the request
and the thread only runs once. Doesn't this mean pm_runtime_put() is only
called once, permanently leaking the extra PM reference acquired here?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260609093346.380396-1-zhaojinming@uniontech.com?part=2

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run()
  2026-06-09  9:33       ` [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
  2026-06-09  9:33         ` [PATCH v3 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
@ 2026-06-09  9:50         ` sashiko-bot
  1 sibling, 0 replies; 21+ messages in thread
From: sashiko-bot @ 2026-06-09  9:50 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

Pre-existing issues:
- [High] Calling `drm_sched_job_cleanup()` after `drm_sched_job_arm()` violates the DRM scheduler API contract.
- [High] The runtime suspend callback fails to synchronize the shared IRQ before powering down.
- [High] Potential NULL pointer dereference in `iommu_attach_group()` if the device lacks an IOMMU group.
--

commit a91812825621124e6ae3a091adc6205cbeefed80
Author: ZhaoJinming <zhaojinming@uniontech.com>

accel/rocket: Fix error path handling in rocket_job_run()

This commit fixes multiple issues in the error paths of rocket_job_run(). It
resolves a dma_fence reference leak, correctly balances the runtime PM
usage counter on failure by switching to pm_runtime_resume_and_get(), and
returns ERR_PTR(ret) instead of an unsignaled fence to avoid DRM scheduler
warnings.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index 2f1861f960ccb..cf9f61682dd8b 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)

[ ... ]

>  	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> -	if (ret < 0)
> -		return fence;
> +	if (ret < 0) {
> +		pm_runtime_put(core->dev);
> +		dma_fence_put(job->done_fence);
> +		job->done_fence = NULL;
> +		dma_fence_put(fence);
> +		return ERR_PTR(ret);
> +	}

[Severity: High]
This is a pre-existing issue, but can core->iommu_group be NULL here?

If the device does not have an IOMMU group (e.g., IOMMU is disabled or not
present in the device tree), iommu_group_get() returns NULL. Calling
iommu_attach_group() unconditionally with a NULL group parameter will cause a
kernel crash when it attempts to dereference the pointer.

Should there be a check for a valid iommu_group before attaching?

[ ... ]

I also noticed a couple of pre-existing issues in this driver:

[Severity: High]
This isn't a bug introduced by this patch, but is it safe to call
drm_sched_job_cleanup() after drm_sched_job_arm() has already been called?

In drivers/accel/rocket/rocket_job.c:rocket_job_push(), if
rocket_acquire_object_fences() fails:

	drm_sched_job_arm(&job->base);
	...
	ret = rocket_acquire_object_fences(job->in_bos, job->in_bo_count,
					   &job->base, false);
	if (ret)
		goto err_unlock;

The error path returns to rocket_ioctl_submit_job(), which then calls:

out_cleanup_job:
	if (ret)
		drm_sched_job_cleanup(&rjob->base);

The DRM scheduler API contract states that drm_sched_job_arm() is a point of
no return that initializes fences and sequence numbers. Calling
drm_sched_job_cleanup() after this breaks the dma_fence sequence numbers,
potentially causing any context waiting on the aborted sequence number to hang
indefinitely.

[Severity: High]
This is a pre-existing issue, but does the runtime suspend callback need to
synchronize the shared IRQ before powering down?

In drivers/accel/rocket/rocket_drv.c:rocket_device_runtime_suspend():

	clk_bulk_disable_unprepare(ARRAY_SIZE(rdev->cores[core].clks),
				   rdev->cores[core].clks);

The Rocket driver registers its IRQ handler with IRQF_SHARED. According to
the power management subsystem rules, drivers using shared IRQs must call
synchronize_irq() in their runtime suspend callback before disabling clocks.

Without this, could the hardware be powered off while the IRQ handler is
mid-flight and reading registers, potentially leading to system hangs or
spurious wakeups?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260609093346.380396-1-zhaojinming@uniontech.com?part=1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 1/2] accel/rocket: Fix error path handling in rocket_job_run()
  2026-06-09  9:47           ` sashiko-bot
@ 2026-06-10  3:03             ` ZhaoJinming
  2026-06-10  3:03               ` [PATCH v4 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
  2026-06-10  3:19               ` [PATCH v4 " sashiko-bot
  0 siblings, 2 replies; 21+ messages in thread
From: ZhaoJinming @ 2026-06-10  3:03 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

In rocket_job_run(), after taking an extra fence reference for
job->done_fence via dma_fence_get(), the error paths have three bugs:

- The dma_fence reference held by job->done_fence is never released,
  causing a reference leak.
- pm_runtime_get_sync() increments the usage counter even on failure,
  but the error path does not decrement it, leaking the runtime PM
  reference and preventing the NPU from suspending.
- A valid but unsignaled fence is returned to the DRM scheduler,
  which triggers WARN("Fence ... released with pending signals!")
  when the scheduler drops its reference.

Fix by replacing pm_runtime_get_sync() with pm_runtime_resume_and_get()
which auto-balances the usage counter on failure, releasing both fence
references on error, and returning ERR_PTR(ret) instead of the
unsignaled fence.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index ac51bff39833..e8a073e22ac2 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
 		dma_fence_put(job->done_fence);
 	job->done_fence = dma_fence_get(fence);
 
-	ret = pm_runtime_get_sync(core->dev);
-	if (ret < 0)
-		return fence;
+	ret = pm_runtime_resume_and_get(core->dev);
+	if (ret < 0) {
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
-	if (ret < 0)
-		return fence;
+	if (ret < 0) {
+		pm_runtime_put(core->dev);
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	scoped_guard(mutex, &core->job_lock) {
 		core->in_flight_job = job;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access
  2026-06-10  3:03             ` [PATCH v4 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
@ 2026-06-10  3:03               ` ZhaoJinming
  2026-06-10  3:14                 ` sashiko-bot
  2026-06-10  3:19               ` [PATCH v4 " sashiko-bot
  1 sibling, 1 reply; 21+ messages in thread
From: ZhaoJinming @ 2026-06-10  3:03 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

Two bugs in the IRQ handling path:

1) iommu_group reference leak in rocket_job_handle_irq():
   iommu_group_get() increments the reference count but the returned
   pointer is passed directly to iommu_detach_group() which does not
   consume it. Since this runs on every completed job, the reference
   count accumulates and prevents the group from being freed. Use
   core->iommu_group instead, consistent with rocket_reset().

2) Unsafe hardware register access in shared IRQ handler:
   rocket_job_irq_handler() is registered with IRQF_SHARED but accesses
   hardware registers without checking runtime PM status. If another
   device on the same IRQ line triggers an interrupt while the NPU is
   suspended, register reads return 0xffffffff, spuriously triggering
   WARN_ON macros and falsely returning IRQ_WAKE_THREAD.

   Add pm_runtime_get_if_active() in the hardirq handler to atomically
   verify the device is active before accessing registers. Each handler
   (hardirq and threaded) independently acquires and releases its own
   runtime PM reference to avoid coalescing-related leaks when the
   IRQ core coalesces multiple wakeups into a single thread execution.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index e8a073e22ac2..f05043619169 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -349,7 +349,7 @@ static void rocket_job_handle_irq(struct rocket_core *core)
 				return;
 			}

-			iommu_detach_group(NULL, iommu_group_get(core->dev));
+			iommu_detach_group(NULL, core->iommu_group);
 			dma_fence_signal(core->in_flight_job->done_fence);
 			pm_runtime_put_autosuspend(core->dev);
 			core->in_flight_job = NULL;
@@ -420,7 +420,10 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
 {
 	struct rocket_core *core = data;

-	rocket_job_handle_irq(core);
+	if (pm_runtime_get_if_active(core->dev)) {
+		rocket_job_handle_irq(core);
+		pm_runtime_put(core->dev);
+	}

 	return IRQ_HANDLED;
 }
@@ -428,16 +431,24 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
 static irqreturn_t rocket_job_irq_handler(int irq, void *data)
 {
 	struct rocket_core *core = data;
-	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
+	u32 raw_status;
+
+	if (pm_runtime_get_if_active(core->dev) != 1)
+		return IRQ_NONE;
+
+	raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);

 	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
 	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);

 	if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 ||
-	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1))
+	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) {
+		pm_runtime_put(core->dev);
 		return IRQ_NONE;
+	}

 	rocket_pc_writel(core, INTERRUPT_MASK, 0x0);
+	pm_runtime_put(core->dev);

 	return IRQ_WAKE_THREAD;
 }
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access
  2026-06-10  3:03               ` [PATCH v4 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
@ 2026-06-10  3:14                 ` sashiko-bot
  2026-06-10  3:25                   ` 赵金明
  2026-06-10  6:01                   ` [PATCH v5 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
  0 siblings, 2 replies; 21+ messages in thread
From: sashiko-bot @ 2026-06-10  3:14 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [High] Runtime PM usage counter underflow due to incorrect error handling of `pm_runtime_get_if_active()` in threaded IRQ handler.

Pre-existing issues:
- [High] Returning `IRQ_NONE` for valid DMA error interrupts without clearing the interrupt status causes a hardirq storm.
--

commit 5d8993f6abe579de7eec7f4c98365e11dd141bf8
Author: ZhaoJinming <zhaojinming@uniontech.com>

accel/rocket: Fix iommu_group leak and unsafe IRQ register access

This commit fixes an iommu_group reference leak in the IRQ handling path
and introduces runtime PM checks in the IRQ handlers to prevent unsafe
hardware register access when the device is suspended.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index cf9f61682dd8b..6ac79b8517319 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
[ ... ]
> @@ -420,7 +420,10 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
>  {
>  	struct rocket_core *core = data;
>  
> -	rocket_job_handle_irq(core);
> +	if (pm_runtime_get_if_active(core->dev)) {
> +		rocket_job_handle_irq(core);
> +		pm_runtime_put(core->dev);
> +	}
>  
>  	return IRQ_HANDLED;
>  }

[Severity: High]
Does this evaluation handle negative error codes correctly?

pm_runtime_get_if_active() returns 1 if the device is active, 0 if inactive,
and a negative error code (such as -EINVAL) if runtime PM is disabled. Because
any non-zero value evaluates to true in C, a negative return value will satisfy
this condition.

If that happens, the code proceeds to call pm_runtime_put(). Since the initial
get call failed and did not increment the usage counter, wouldn't this
unconditionally decrement the counter and lead to an underflow?

Could this explicitly check for success, such as evaluating if the return value
is strictly equal to 1?

> @@ -428,16 +431,24 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
>  static irqreturn_t rocket_job_irq_handler(int irq, void *data)
>  {
>  	struct rocket_core *core = data;
> -	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
> +	u32 raw_status;
> +
> +	if (pm_runtime_get_if_active(core->dev) != 1)
> +		return IRQ_NONE;
> +
> +	raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
>  
>  	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
>  	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);
>  
>  	if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 ||
> -	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1))
> +	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) {
> +		pm_runtime_put(core->dev);
>  		return IRQ_NONE;
> +	}
>  
>  	rocket_pc_writel(core, INTERRUPT_MASK, 0x0);
> +	pm_runtime_put(core->dev);
>  
>  	return IRQ_WAKE_THREAD;
>  }

[Severity: High]
This isn't a bug introduced by this patch, but will this cause a hardirq storm
if a DMA error occurs without a DPU completion?

If an interrupt is raised due to a DMA read or write error, the code logs a
warning via WARN_ON. However, if the DPU completion bits are not set, the
handler returns IRQ_NONE without clearing, acknowledging, or masking the
pending interrupt.

Since the interrupt remains asserted at the hardware level, won't the kernel's
IRQ core immediately re-invoke the handler in an infinite loop, hanging the CPU
until the spurious interrupt detector permanently disables the shared IRQ line?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610030322.2802744-1-zhaojinming@uniontech.com?part=2

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/2] accel/rocket: Fix error path handling in rocket_job_run()
  2026-06-10  3:03             ` [PATCH v4 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
  2026-06-10  3:03               ` [PATCH v4 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
@ 2026-06-10  3:19               ` sashiko-bot
  1 sibling, 0 replies; 21+ messages in thread
From: sashiko-bot @ 2026-06-10  3:19 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Missing `synchronize_irq()` in the runtime suspend callback for a device using shared IRQs.
--

commit badac0ffd08d84278aad7a9068c91e38322fa44f
Author: ZhaoJinming <zhaojinming@uniontech.com>

accel/rocket: Fix error path handling in rocket_job_run()

This commit fixes bugs in the error paths of rocket_job_run() by replacing
pm_runtime_get_sync() with pm_runtime_resume_and_get() to prevent leaking
the runtime PM reference. It also properly releases fence references on
error and returns ERR_PTR instead of an unsignaled fence.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index 2f1861f960ccb..cf9f61682dd8b 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
[ ... ]
> @@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
>  		dma_fence_put(job->done_fence);
>  	job->done_fence = dma_fence_get(fence);
>  
> -	ret = pm_runtime_get_sync(core->dev);
> -	if (ret < 0)
> -		return fence;
> +	ret = pm_runtime_resume_and_get(core->dev);

[Severity: High]
This isn't a bug introduced by this patch, but since we are modifying the
runtime PM handling here, I noticed a regression in the runtime suspend path:

drivers/accel/rocket/rocket_drv.c:rocket_device_runtime_suspend() {
    ...
	if (!rocket_job_is_idle(&rdev->cores[core]))
		return -EBUSY;

	clk_bulk_disable_unprepare(ARRAY_SIZE(rdev->cores[core].clks),
				   rdev->cores[core].clks);

	return 0;
}

Since the device's IRQ is registered with the IRQF_SHARED flag, should
synchronize_irq() be called before gating the device clocks?

Without synchronization, could a concurrently running IRQ handler on another
CPU be halfway through executing when the clocks are abruptly disabled,
potentially causing an external abort when the handler attempts its next
MMIO access?

I saw that the IRQ handler was updated in a later commit ("accel/rocket: Fix
iommu_group leak and unsafe IRQ register access") to use
pm_runtime_get_if_active(), but doesn't the PM subsystem still mandate this
synchronization pattern for all drivers using IRQF_SHARED to guarantee no
handler is executing mid-flight during power down?

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610030322.2802744-1-zhaojinming@uniontech.com?part=1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Re: [PATCH v4 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access
  2026-06-10  3:14                 ` sashiko-bot
@ 2026-06-10  3:25                   ` 赵金明
  2026-06-10  6:01                   ` [PATCH v5 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
  1 sibling, 0 replies; 21+ messages in thread
From: 赵金明 @ 2026-06-10  3:25 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: dri-devel





>Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:



>



>New issues:



>- [High] Runtime PM usage counter underflow due to incorrect error handling of `pm_runtime_get_if_active()` in threaded IRQ handler.



>
commits：

Regarding the -EINVAL case in the threaded IRQ handler's pm_runtime_get_if_active() call at line 423:

While it is true that a non-zero return value (including -EINVAL) would satisfy the if condition, the concern about underflow does not apply in practice. This is because pm_runtime_get_if_active() returns -EINVAL only when disable_depth > 0 (see runtime.c:1234), which means pm_runtime_disable() has been called on the device. In this driver, pm_runtime_disable() is only called in rocket_core_fini() (rocket_core.c:100), after the IRQ handler has been torn down by rocket_job_fini(). So the threaded IRQ handler can never execute while PM is disabled ― the two states are mutually exclusive in the driver's lifecycle.

Furthermore, even if this theoretical path were somehow reached, the kernel's runtime PM subsystem has an explicit underflow guard in rpm_drop_usage_count() (runtime.c:1079-1096): it atomically decrements the usage count, and if the result is negative, it immediately increments it back and emits a dev_warn. The usage count is never allowed to go negative.

That said, the hardirq handler at line 437 already uses != 1 for this exact reason. For consistency and defense-in-depth, changing the thread handler to use != 1 as well is a reasonable improvement and can be done in a follow-up if desired.



>Pre-existing issues:



>- [High] Returning `IRQ_NONE` for valid DMA error interrupts without clearing the interrupt status causes a hardirq storm.



>--



>



>commit 5d8993f6abe579de7eec7f4c98365e11dd141bf8



>Author: ZhaoJinming <zhaojinming@uniontech.com>



>



>accel/rocket: Fix iommu_group leak and unsafe IRQ register access



>



>This commit fixes an iommu_group reference leak in the IRQ handling path



>and introduces runtime PM checks in the IRQ handlers to prevent unsafe



>hardware register access when the device is suspended.



>



>> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c



>> index cf9f61682dd8b..6ac79b8517319 100644



>> --- a/drivers/accel/rocket/rocket_job.c



>> +++ b/drivers/accel/rocket/rocket_job.c



>[ ... ]



>> @@ -420,7 +420,10 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)



>>? {



>>? 	struct rocket_core *core = data;



>>? 



>> -	rocket_job_handle_irq(core);



>> +	if (pm_runtime_get_if_active(core->dev)) {



>> +		rocket_job_handle_irq(core);



>> +		pm_runtime_put(core->dev);



>> +	}



>>? 



>>? 	return IRQ_HANDLED;



>>? }



>



>[Severity: High]



>Does this evaluation handle negative error codes correctly?



>



>pm_runtime_get_if_active() returns 1 if the device is active, 0 if inactive,



>and a negative error code (such as -EINVAL) if runtime PM is disabled. Because



>any non-zero value evaluates to true in C, a negative return value will satisfy



>this condition.



>



>If that happens, the code proceeds to call pm_runtime_put(). Since the initial



>get call failed and did not increment the usage counter, wouldn't this



>unconditionally decrement the counter and lead to an underflow?



>



>Could this explicitly check for success, such as evaluating if the return value



>is strictly equal to 1?



>



>> @@ -428,16 +431,24 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)



>>? static irqreturn_t rocket_job_irq_handler(int irq, void *data)



>>? {



>>? 	struct rocket_core *core = data;



>> -	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);



>> +	u32 raw_status;



>> +



>> +	if (pm_runtime_get_if_active(core->dev) != 1)



>> +		return IRQ_NONE;



>> +



>> +	raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);



>>? 



>>? 	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);



>>? 	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);



>>? 



>>? 	if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 ||



>> -	????? raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1))



>> +	????? raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) {



>> +		pm_runtime_put(core->dev);



>>? 		return IRQ_NONE;



>> +	}



>>? 



>>? 	rocket_pc_writel(core, INTERRUPT_MASK, 0x0);



>> +	pm_runtime_put(core->dev);



>>? 



>>? 	return IRQ_WAKE_THREAD;



>>? }



>



>[Severity: High]



>This isn't a bug introduced by this patch, but will this cause a hardirq storm



>if a DMA error occurs without a DPU completion?



>



>If an interrupt is raised due to a DMA read or write error, the code logs a



>warning via WARN_ON. However, if the DPU completion bits are not set, the



>handler returns IRQ_NONE without clearing, acknowledging, or masking the



>pending interrupt.



>



>Since the interrupt remains asserted at the hardware level, won't the kernel's



>IRQ core immediately re-invoke the handler in an infinite loop, hanging the CPU



>until the spurious interrupt detector permanently disables the shared IRQ line?



>



>-- 



>Sashiko AI review ・ https://sashiko.dev/#/patchset/20260610030322.2802744-1-zhaojinming@uniontech.com?part=2



>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v5 1/2] accel/rocket: Fix error path handling in rocket_job_run()
  2026-06-10  3:14                 ` sashiko-bot
  2026-06-10  3:25                   ` 赵金明
@ 2026-06-10  6:01                   ` ZhaoJinming
  2026-06-10  6:01                     ` [PATCH v5 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
  1 sibling, 1 reply; 21+ messages in thread
From: ZhaoJinming @ 2026-06-10  6:01 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

In rocket_job_run(), after taking an extra fence reference for
job->done_fence via dma_fence_get(), the error paths have three bugs:

- The dma_fence reference held by job->done_fence is never released,
  causing a reference leak.
- pm_runtime_get_sync() increments the usage counter even on failure,
  but the error path does not decrement it, leaking the runtime PM
  reference and preventing the NPU from suspending.
- A valid but unsignaled fence is returned to the DRM scheduler,
  which triggers WARN("Fence ... released with pending signals!")
  when the scheduler drops its reference.

Fix by replacing pm_runtime_get_sync() with pm_runtime_resume_and_get()
which auto-balances the usage counter on failure, releasing both fence
references on error, and returning ERR_PTR(ret) instead of the
unsignaled fence.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index ac51bff39833..e8a073e22ac2 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
 		dma_fence_put(job->done_fence);
 	job->done_fence = dma_fence_get(fence);
 
-	ret = pm_runtime_get_sync(core->dev);
-	if (ret < 0)
-		return fence;
+	ret = pm_runtime_resume_and_get(core->dev);
+	if (ret < 0) {
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
-	if (ret < 0)
-		return fence;
+	if (ret < 0) {
+		pm_runtime_put(core->dev);
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	scoped_guard(mutex, &core->job_lock) {
 		core->in_flight_job = job;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access
  2026-06-10  6:01                   ` [PATCH v5 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
@ 2026-06-10  6:01                     ` ZhaoJinming
  2026-06-10  6:19                       ` sashiko-bot
  0 siblings, 1 reply; 21+ messages in thread
From: ZhaoJinming @ 2026-06-10  6:01 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

Two bugs in the IRQ handling path:

1) iommu_group reference leak in rocket_job_handle_irq():
   iommu_group_get() increments the reference count but the returned
   pointer is passed directly to iommu_detach_group() which does not
   consume it. Since this runs on every completed job, the reference
   count accumulates and prevents the group from being freed. Use
   core->iommu_group instead, consistent with rocket_reset().

2) Unsafe hardware register access in shared IRQ handler:
   rocket_job_irq_handler() is registered with IRQF_SHARED but accesses
   hardware registers without checking runtime PM status. If another
   device on the same IRQ line triggers an interrupt while the NPU is
   suspended, register reads return 0xffffffff, spuriously triggering
   WARN_ON macros and falsely returning IRQ_WAKE_THREAD.

   Add pm_runtime_get_if_active() in the hardirq handler to atomically
   verify the device is active before accessing registers. Each handler
   (hardirq and threaded) independently acquires and releases its own
   runtime PM reference to avoid coalescing-related leaks when the
   IRQ core coalesces multiple wakeups into a single thread execution.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index e8a073e22ac2..0ea3b3099704 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -349,7 +349,7 @@ static void rocket_job_handle_irq(struct rocket_core *core)
 				return;
 			}

-			iommu_detach_group(NULL, iommu_group_get(core->dev));
+			iommu_detach_group(NULL, core->iommu_group);
 			dma_fence_signal(core->in_flight_job->done_fence);
 			pm_runtime_put_autosuspend(core->dev);
 			core->in_flight_job = NULL;
@@ -420,7 +420,10 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
 {
 	struct rocket_core *core = data;

-	rocket_job_handle_irq(core);
+	if (pm_runtime_get_if_active(core->dev) == 1) {
+		rocket_job_handle_irq(core);
+		pm_runtime_put(core->dev);
+	}

 	return IRQ_HANDLED;
 }
@@ -428,16 +431,24 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
 static irqreturn_t rocket_job_irq_handler(int irq, void *data)
 {
 	struct rocket_core *core = data;
-	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
+	u32 raw_status;
+
+	if (pm_runtime_get_if_active(core->dev) != 1)
+		return IRQ_NONE;
+
+	raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);

 	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
 	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);

 	if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 ||
-	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1))
+	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) {
+		pm_runtime_put(core->dev);
 		return IRQ_NONE;
+	}

 	rocket_pc_writel(core, INTERRUPT_MASK, 0x0);
+	pm_runtime_put(core->dev);

 	return IRQ_WAKE_THREAD;
 }
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access
  2026-06-10  6:01                     ` [PATCH v5 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
@ 2026-06-10  6:19                       ` sashiko-bot
  2026-06-10  7:08                         ` 赵金明
  2026-06-10  7:10                         ` [PATCH v6 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
  0 siblings, 2 replies; 21+ messages in thread
From: sashiko-bot @ 2026-06-10  6:19 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:

New issues:
- [High] Unhandled interrupt storm during device removal due to `pm_runtime_get_if_active()`.
- [High] NULL pointer dereference in the IRQ thread during teardown (Use-After-Free).

Pre-existing issues:
- [Critical] Double free of `rjob->tasks` array during job submission error handling.
- [High] NULL pointer dereference when freeing an uninitialized `job->domain`.
--

commit 7c59a4d5e2e74264bcfe7bc8aa7c903516b1c1c0
Author: ZhaoJinming <zhaojinming@uniontech.com>

accel/rocket: Fix iommu_group leak and unsafe IRQ register access

This commit replaces iommu_group_get() with the cached core->iommu_group
to prevent a reference leak. It also adds pm_runtime_get_if_active() checks
in the IRQ handler and thread to prevent unsafe hardware register accesses
when the device is suspended.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index cf9f61682dd8b..21b433826b479 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -349,7 +349,7 @@ static void rocket_job_handle_irq(struct rocket_core *core)
>  				return;
>  			}
>  
> -			iommu_detach_group(NULL, iommu_group_get(core->dev));
> +			iommu_detach_group(NULL, core->iommu_group);

[Severity: High]
Could this lead to a NULL pointer dereference during device teardown?

In rocket_core_fini(), core->iommu_group is set to NULL without acquiring
core->job_lock or waiting for the IRQ thread to finish:

rocket_core_fini()
    ...
    core->iommu_group = NULL;

If the threaded IRQ handler runs concurrently after passing the runtime PM
check, it will pass this newly nulled pointer to iommu_detach_group(),
causing the IOMMU core to dereference a NULL pointer when acquiring the
group's mutex.

>  			dma_fence_signal(core->in_flight_job->done_fence);
>  			pm_runtime_put_autosuspend(core->dev);
>  			core->in_flight_job = NULL;

[ ... ]

> @@ -428,16 +431,24 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
>  static irqreturn_t rocket_job_irq_handler(int irq, void *data)
>  {
>  	struct rocket_core *core = data;
> -	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
> +	u32 raw_status;
> +
> +	if (pm_runtime_get_if_active(core->dev) != 1)
> +		return IRQ_NONE;

[Severity: High]
Will this cause an unhandled interrupt storm during module removal?

During module removal, rocket_core_fini() disables runtime PM:

rocket_core_fini()
    pm_runtime_disable(core->dev);

This happens before the hardware is stopped. If a hardware job completes
subsequently, the device raises a level-triggered interrupt. Since runtime
PM is disabled, pm_runtime_get_if_active() returns -EINVAL.

The handler then unconditionally returns IRQ_NONE without clearing the
interrupt. Because the IRQ is level-triggered, the unacknowledged interrupt
will immediately refire upon handler exit, potentially causing an infinite
interrupt storm that deadlocks the CPU.

[Severity: Critical]
This isn't a bug introduced by this patch, but does the error handling in
rocket_copy_tasks() cause a double free of the tasks array?

If validation of the incoming tasks fails, the array is freed but the
pointer is not cleared:

rocket_copy_tasks() {
    ...
fail:
    kvfree(rjob->tasks);
    return ret;
}

The caller then jumps to the error label and invokes rocket_job_put(),
which triggers rocket_job_cleanup(). This cleanup function frees the
dangling pointer again:

rocket_job_cleanup() {
    ...
    kvfree(job->tasks);
}

Can a local unprivileged attacker exploit this via the submit ioctl to
corrupt kernel memory?

[Severity: High]
This isn't a bug introduced by this patch, but is there a NULL pointer
dereference when freeing an uninitialized domain?

When the job structure is allocated, all fields are initialized to zero.
If setup fails early (for example, in drm_sched_job_init() or
rocket_copy_tasks()), job->domain remains NULL.

The error handling path invokes rocket_job_cleanup(), which unconditionally
releases the domain:

rocket_job_cleanup() {
    ...
    rocket_iommu_domain_put(job->domain);
}

This function immediately dereferences the pointer:

rocket_iommu_domain_put() {
    ...
    kref_put(&domain->kref, rocket_iommu_domain_destroy);
}

Could this cause a kernel panic if an unprivileged user passes invalid
arguments?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610060132.3239648-1-zhaojinming@uniontech.com?part=2

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Re: [PATCH v5 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access
  2026-06-10  6:19                       ` sashiko-bot
@ 2026-06-10  7:08                         ` 赵金明
  2026-06-10  7:10                         ` [PATCH v6 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
  1 sibling, 0 replies; 21+ messages in thread
From: 赵金明 @ 2026-06-10  7:08 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: dri-devel





>Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:



>



>New issues:



>- [High] Unhandled interrupt storm during device removal due to `pm_runtime_get_if_active()`.
I try to fix it  in next version of patches.


>- [High] NULL pointer dereference in the IRQ thread during teardown (Use-After-Free).



Regarding the NULL pointer dereference during teardown concern:

This cannot occur in the current patch. The threaded IRQ handler has three layers of protection:

pm_runtime_get_if_active() == 1 (line 423): In rocket_core_fini(), pm_runtime_disable() is called before core->iommu_group = NULL. After pm_runtime_disable(), disable_depth > 0, so pm_runtime_get_if_active() returns -EINVAL. Since -EINVAL != 1 is false, the threaded handler never calls rocket_job_handle_irq() and thus never accesses core->iommu_group. This holds even if the thread is woken by coalescing after pm_runtime_disable().
scoped_guard(mutex, &core->job_lock) with in_flight_job check (line 345): Even if the first check were somehow bypassed, rocket_job_handle_irq() only accesses core->iommu_group inside a scoped_guard that checks core->in_flight_job != NULL. By the time rocket_core_fini() runs, drm_sched_fini() has already been called (via rocket_job_fini()), which stops the scheduler and releases all jobs, leaving in_flight_job as NULL.
Execution ordering guarantee: rocket_core_fini() calls pm_runtime_disable() before setting iommu_group = NULL. The pm_runtime_get_if_active() check in the thread handler is evaluated under spin_lock_irqsave (see pm_runtime_get_conditional() at runtime.c:1233), which provides the happens-before guarantee with respect to disable_depth modification in __pm_runtime_disable().
No code path exists where the threaded handler can reach iommu_detach_group(NULL, core->iommu_group) after core->iommu_group has been set to NULL.




>Pre-existing issues:



>- [Critical] Double free of `rjob->tasks` array during job submission error handling.



>- [High] NULL pointer dereference when freeing an uninitialized `job->domain`.



>--



>



>commit 7c59a4d5e2e74264bcfe7bc8aa7c903516b1c1c0



>Author: ZhaoJinming <zhaojinming@uniontech.com>



>



>accel/rocket: Fix iommu_group leak and unsafe IRQ register access



>



>This commit replaces iommu_group_get() with the cached core->iommu_group



>to prevent a reference leak. It also adds pm_runtime_get_if_active() checks



>in the IRQ handler and thread to prevent unsafe hardware register accesses



>when the device is suspended.



>



>> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c



>> index cf9f61682dd8b..21b433826b479 100644



>> --- a/drivers/accel/rocket/rocket_job.c



>> +++ b/drivers/accel/rocket/rocket_job.c



>> @@ -349,7 +349,7 @@ static void rocket_job_handle_irq(struct rocket_core *core)



>>? 				return;



>>? 			}



>>? 



>> -			iommu_detach_group(NULL, iommu_group_get(core->dev));



>> +			iommu_detach_group(NULL, core->iommu_group);



>



>[Severity: High]



>Could this lead to a NULL pointer dereference during device teardown?



>



>In rocket_core_fini(), core->iommu_group is set to NULL without acquiring



>core->job_lock or waiting for the IRQ thread to finish:



>



>rocket_core_fini()



>??? ...



>??? core->iommu_group = NULL;



>



>If the threaded IRQ handler runs concurrently after passing the runtime PM



>check, it will pass this newly nulled pointer to iommu_detach_group(),



>causing the IOMMU core to dereference a NULL pointer when acquiring the



>group's mutex.



>



>>? 			dma_fence_signal(core->in_flight_job->done_fence);



>>? 			pm_runtime_put_autosuspend(core->dev);



>>? 			core->in_flight_job = NULL;



>



>[ ... ]



>



>> @@ -428,16 +431,24 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)



>>? static irqreturn_t rocket_job_irq_handler(int irq, void *data)



>>? {



>>? 	struct rocket_core *core = data;



>> -	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);



>> +	u32 raw_status;



>> +



>> +	if (pm_runtime_get_if_active(core->dev) != 1)



>> +		return IRQ_NONE;



>



>[Severity: High]



>Will this cause an unhandled interrupt storm during module removal?



>



>During module removal, rocket_core_fini() disables runtime PM:



>



>rocket_core_fini()



>??? pm_runtime_disable(core->dev);



>



>This happens before the hardware is stopped. If a hardware job completes



>subsequently, the device raises a level-triggered interrupt. Since runtime



>PM is disabled, pm_runtime_get_if_active() returns -EINVAL.



>



>The handler then unconditionally returns IRQ_NONE without clearing the



>interrupt. Because the IRQ is level-triggered, the unacknowledged interrupt



>will immediately refire upon handler exit, potentially causing an infinite



>interrupt storm that deadlocks the CPU.



>



>[Severity: Critical]



>This isn't a bug introduced by this patch, but does the error handling in



>rocket_copy_tasks() cause a double free of the tasks array?



>



>If validation of the incoming tasks fails, the array is freed but the



>pointer is not cleared:



>



>rocket_copy_tasks() {



>??? ...



>fail:



>??? kvfree(rjob->tasks);



>??? return ret;



>}



>



>The caller then jumps to the error label and invokes rocket_job_put(),



>which triggers rocket_job_cleanup(). This cleanup function frees the



>dangling pointer again:



>



>rocket_job_cleanup() {



>??? ...



>??? kvfree(job->tasks);



>}



>



>Can a local unprivileged attacker exploit this via the submit ioctl to



>corrupt kernel memory?



>



>[Severity: High]



>This isn't a bug introduced by this patch, but is there a NULL pointer



>dereference when freeing an uninitialized domain?



>



>When the job structure is allocated, all fields are initialized to zero.



>If setup fails early (for example, in drm_sched_job_init() or



>rocket_copy_tasks()), job->domain remains NULL.



>



>The error handling path invokes rocket_job_cleanup(), which unconditionally



>releases the domain:



>



>rocket_job_cleanup() {



>??? ...



>??? rocket_iommu_domain_put(job->domain);



>}



>



>This function immediately dereferences the pointer:



>



>rocket_iommu_domain_put() {



>??? ...



>??? kref_put(&domain->kref, rocket_iommu_domain_destroy);



>}



>



>Could this cause a kernel panic if an unprivileged user passes invalid



>arguments?



>



>-- 



>Sashiko AI review ・ https://sashiko.dev/#/patchset/20260610060132.3239648-1-zhaojinming@uniontech.com?part=2



>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v6 1/2] accel/rocket: Fix error path handling in rocket_job_run()
  2026-06-10  6:19                       ` sashiko-bot
  2026-06-10  7:08                         ` 赵金明
@ 2026-06-10  7:10                         ` ZhaoJinming
  2026-06-10  7:10                           ` [PATCH v6 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
  2026-06-10  7:24                           ` [PATCH v6 1/2] accel/rocket: Fix error path handling in rocket_job_run() sashiko-bot
  1 sibling, 2 replies; 21+ messages in thread
From: ZhaoJinming @ 2026-06-10  7:10 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

In rocket_job_run(), after taking an extra fence reference for
job->done_fence via dma_fence_get(), the error paths have three bugs:

- The dma_fence reference held by job->done_fence is never released,
  causing a reference leak.
- pm_runtime_get_sync() increments the usage counter even on failure,
  but the error path does not decrement it, leaking the runtime PM
  reference and preventing the NPU from suspending.
- A valid but unsignaled fence is returned to the DRM scheduler,
  which triggers WARN("Fence ... released with pending signals!")
  when the scheduler drops its reference.

Fix by replacing pm_runtime_get_sync() with pm_runtime_resume_and_get()
which auto-balances the usage counter on failure, releasing both fence
references on error, and returning ERR_PTR(ret) instead of the
unsignaled fence.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index ac51bff39833..e8a073e22ac2 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
 		dma_fence_put(job->done_fence);
 	job->done_fence = dma_fence_get(fence);
 
-	ret = pm_runtime_get_sync(core->dev);
-	if (ret < 0)
-		return fence;
+	ret = pm_runtime_resume_and_get(core->dev);
+	if (ret < 0) {
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
-	if (ret < 0)
-		return fence;
+	if (ret < 0) {
+		pm_runtime_put(core->dev);
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	scoped_guard(mutex, &core->job_lock) {
 		core->in_flight_job = job;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access
  2026-06-10  7:10                         ` [PATCH v6 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
@ 2026-06-10  7:10                           ` ZhaoJinming
  2026-06-10  7:24                             ` sashiko-bot
  2026-06-10  7:24                           ` [PATCH v6 1/2] accel/rocket: Fix error path handling in rocket_job_run() sashiko-bot
  1 sibling, 1 reply; 21+ messages in thread
From: ZhaoJinming @ 2026-06-10  7:10 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

Two bugs in the IRQ handling path:

1) iommu_group reference leak in rocket_job_handle_irq():
   iommu_group_get() increments the reference count but the returned
   pointer is passed directly to iommu_detach_group() which does not
   consume it. Since this runs on every completed job, the reference
   count accumulates and prevents the group from being freed. Use
   core->iommu_group instead, consistent with rocket_reset().

2) Unsafe hardware register access in shared IRQ handler:
   rocket_job_irq_handler() is registered with IRQF_SHARED but accesses
   hardware registers without checking runtime PM status. If another
   device on the same IRQ line triggers an interrupt while the NPU is
   suspended, register reads return 0xffffffff, spuriously triggering
   WARN_ON macros and falsely returning IRQ_WAKE_THREAD.

   Add pm_runtime_get_if_active() in the hardirq handler to atomically
   verify the device is active before accessing registers. Each handler
   (hardirq and threaded) independently acquires and releases its own
   runtime PM reference to avoid coalescing-related leaks when the
   IRQ core coalesces multiple wakeups into a single thread execution.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index e8a073e22ac2..334bf01c3382 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -349,7 +349,7 @@ static void rocket_job_handle_irq(struct rocket_core *core)
 				return;
 			}
 
-			iommu_detach_group(NULL, iommu_group_get(core->dev));
+			iommu_detach_group(NULL, core->iommu_group);
 			dma_fence_signal(core->in_flight_job->done_fence);
 			pm_runtime_put_autosuspend(core->dev);
 			core->in_flight_job = NULL;
@@ -420,7 +420,10 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
 {
 	struct rocket_core *core = data;
 
-	rocket_job_handle_irq(core);
+	if (pm_runtime_get_if_active(core->dev) == 1) {
+		rocket_job_handle_irq(core);
+		pm_runtime_put(core->dev);
+	}
 
 	return IRQ_HANDLED;
 }
@@ -428,16 +431,28 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
 static irqreturn_t rocket_job_irq_handler(int irq, void *data)
 {
 	struct rocket_core *core = data;
-	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
+	u32 raw_status;
+	int ret;
+
+	ret = pm_runtime_get_if_active(core->dev);
+	if (ret == 0)
+		return IRQ_NONE;
+
+	raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
 
 	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
 	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);
 
 	if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 ||
-	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1))
+	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) {
+		if (ret > 0)
+			pm_runtime_put(core->dev);
 		return IRQ_NONE;
+	}
 
 	rocket_pc_writel(core, INTERRUPT_MASK, 0x0);
+	if (ret > 0)
+		pm_runtime_put(core->dev);
 
 	return IRQ_WAKE_THREAD;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/2] accel/rocket: Fix error path handling in rocket_job_run()
  2026-06-10  7:10                         ` [PATCH v6 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
  2026-06-10  7:10                           ` [PATCH v6 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
@ 2026-06-10  7:24                           ` sashiko-bot
  1 sibling, 0 replies; 21+ messages in thread
From: sashiko-bot @ 2026-06-10  7:24 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Aborting a DRM scheduler job after calling drm_sched_job_arm() corrupts scheduler state.
--

commit e1b7e4281cdb5c3f1d9985be73947531edf81dc5
Author: ZhaoJinming <zhaojinming@uniontech.com>

accel/rocket: Fix error path handling in rocket_job_run()

This commit replaces pm_runtime_get_sync() with pm_runtime_resume_and_get()
in rocket_job_run() to correctly balance the usage counter on failure. It
also fixes dma_fence reference leaks and prevents returning an unsignaled
fence on error paths.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index 2f1861f960ccb..cf9f61682dd8b 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c

[ ... ]

> 	scoped_guard(mutex, &rdev->sched_lock) {
> 		drm_sched_job_arm(&job->base);
> 
> 		job->inference_done_fence = dma_fence_get(&job->base.s_fence->finished);
> 
> 		ret = rocket_acquire_object_fences(job->in_bos, job->in_bo_count, &job->base, false);
> 		if (ret)
> 			goto err_unlock;

[Severity: High]
This isn't a bug introduced by this patch, but does this sequence corrupt the
scheduler state?

In rocket_job_push(), if rocket_acquire_object_fences() fails, the job
setup is aborted via the goto err_unlock path where the caller invokes
drm_sched_job_cleanup().

Because drm_sched_job_arm() allocates sequence numbers and initializes
fences, aborting the job after it has been armed leaves gaps in the
sequence numbers. Can this lead to system hangs where waiters may wait
indefinitely for a missing sequence number?

Should fence acquisition be performed before arming the job?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610061915.1CA281F00893@smtp.kernel.org?part=1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access
  2026-06-10  7:10                           ` [PATCH v6 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
@ 2026-06-10  7:24                             ` sashiko-bot
  0 siblings, 0 replies; 21+ messages in thread
From: sashiko-bot @ 2026-06-10  7:24 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:

New issues:
- [High] Mismatched and strict return value checks for `pm_runtime_get_if_active()` break IRQ handling when Runtime PM is disabled, causing job hangs.

Pre-existing issues:
- [High] The runtime suspend callback powers down hardware without calling `synchronize_irq()`, violating shared IRQ rules.
- [High] Unhandled hardware DMA errors result in an unacknowledged interrupt, causing an immediate IRQ storm.
- [High] Calling `iommu_detach_group()` and `iommu_attach_group()` with a NULL group pointer causes a kernel panic if the device is not bound to an IOMMU.
--

commit 1a0ceb9d138ab6f44b71efac6f2c376438bd8d22
Author: ZhaoJinming <zhaojinming@uniontech.com>

accel/rocket: Fix iommu_group leak and unsafe IRQ register access

This commit fixes a reference leak of the iommu_group in the IRQ handler
by reusing the cached core->iommu_group. It also fixes unsafe hardware register
accesses in the shared IRQ handler by adding runtime PM checks to verify the
device is active before reading registers.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index cf9f61682dd8b..81b222198dac9 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -349,7 +349,7 @@ static void rocket_job_handle_irq(struct rocket_core *core)
>  				return;
>  			}
>  
> -			iommu_detach_group(NULL, iommu_group_get(core->dev));
> +			iommu_detach_group(NULL, core->iommu_group);

[Severity: High]
This is a pre-existing issue, but calling iommu_detach_group() and
iommu_attach_group() with a NULL group pointer causes a kernel panic if
the device is not bound to an IOMMU.

If the device is not behind an IOMMU, iommu_group_get() returns NULL
during initialization.

If CONFIG_IOMMU_API is enabled, both functions unconditionally
dereference the group pointer to acquire its lock, which triggers a NULL
pointer dereference crash.

Does this path also need to ensure core->iommu_group is not NULL before
detaching, and similarly during job submission?

>  			dma_fence_signal(core->in_flight_job->done_fence);
>  			pm_runtime_put_autosuspend(core->dev);
>  			core->in_flight_job = NULL;

[ ... ]

> @@ -420,7 +420,10 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
>  {
>  	struct rocket_core *core = data;
>  
> -	rocket_job_handle_irq(core);
> +	if (pm_runtime_get_if_active(core->dev) == 1) {
> +		rocket_job_handle_irq(core);
> +		pm_runtime_put(core->dev);
> +	}

[Severity: High]
When CONFIG_PM is disabled or runtime PM is globally disabled for the
device, pm_runtime_get_if_active() returns -EINVAL.

Because -EINVAL != 1, the condition evaluates to false, silently
skipping rocket_job_handle_irq() entirely.

Since the threaded handler skips execution, the completed job is never
processed, the IOMMU group is not detached, the DMA fence is never
signaled, and waiting user processes hang permanently.

Could this strict check cause a regression in IRQ handling when Runtime
PM is disabled?

>  
>  	return IRQ_HANDLED;
>  }
> @@ -428,16 +431,28 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
>  static irqreturn_t rocket_job_irq_handler(int irq, void *data)
>  {
>  	struct rocket_core *core = data;
> -	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
> +	u32 raw_status;
> +	int ret;
> +
> +	ret = pm_runtime_get_if_active(core->dev);
> +	if (ret == 0)
> +		return IRQ_NONE;
> +
> +	raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
>  
>  	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
>  	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);
>  
>  	if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 ||
> -	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1))
> +	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) {
> +		if (ret > 0)
> +			pm_runtime_put(core->dev);
>  		return IRQ_NONE;
> +	}

[Severity: High]
This is a pre-existing issue, but unhandled hardware DMA errors result
in an unacknowledged interrupt, causing an immediate IRQ storm.

If the hardware triggers an interrupt with
PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR set, but the DPU completion bits
are not set, the condition !(raw_status & DPU) evaluates as true.

It executes the WARN_ON() and immediately returns IRQ_NONE without
writing to INTERRUPT_CLEAR or INTERRUPT_MASK.

Because the shared IRQ line is level-triggered and the hardware is still
asserting the unacknowledged error, the interrupt controller will
immediately re-invoke the handler as soon as it returns.

Will this create an infinite IRQ storm that permanently hogs the CPU
core until the kernel's spurious interrupt detector forcibly disables
the shared IRQ line?

>  
>  	rocket_pc_writel(core, INTERRUPT_MASK, 0x0);
> +	if (ret > 0)
> +		pm_runtime_put(core->dev);
>  
>  	return IRQ_WAKE_THREAD;
>  }

[Severity: High]
This is a pre-existing issue, but the runtime suspend callback powers
down hardware without calling synchronize_irq(), violating shared IRQ
rules.

The global subsystem guide explicitly states: "When devices use shared
IRQs (IRQF_SHARED), the runtime suspend callback must call
synchronize_irq() before powering down hardware to ensure no IRQ handler
is executing mid-flight."

The device requests its interrupt with IRQF_SHARED, but its runtime
suspend callback disables the device's clocks without waiting for
executing IRQ handlers to finish:

drivers/accel/rocket/rocket_drv.c:rocket_device_runtime_suspend() {
    ...
    clk_bulk_disable_unprepare(ARRAY_SIZE(rdev->cores[core].clks),
                               rdev->cores[core].clks);
    ...
}

Might this lead to mid-flight handlers reading powered-down hardware
registers, causing bus faults?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610061915.1CA281F00893@smtp.kernel.org?part=2

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-06-10  7:58 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08  8:38 [PATCH] accel/rocket: Fix dma_fence reference leak on error paths in rocket_job_run() ZhaoJinming
2026-06-08  8:50 ` sashiko-bot
2026-06-08 10:00   ` [PATCH v2] accel/rocket: Fix fence leak, runtime PM leak and WARN in rocket_job_run() error paths ZhaoJinming
2026-06-08 10:38     ` sashiko-bot
2026-06-09  9:33       ` [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
2026-06-09  9:33         ` [PATCH v3 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
2026-06-09  9:47           ` sashiko-bot
2026-06-10  3:03             ` [PATCH v4 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
2026-06-10  3:03               ` [PATCH v4 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
2026-06-10  3:14                 ` sashiko-bot
2026-06-10  3:25                   ` 赵金明
2026-06-10  6:01                   ` [PATCH v5 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
2026-06-10  6:01                     ` [PATCH v5 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
2026-06-10  6:19                       ` sashiko-bot
2026-06-10  7:08                         ` 赵金明
2026-06-10  7:10                         ` [PATCH v6 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
2026-06-10  7:10                           ` [PATCH v6 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
2026-06-10  7:24                             ` sashiko-bot
2026-06-10  7:24                           ` [PATCH v6 1/2] accel/rocket: Fix error path handling in rocket_job_run() sashiko-bot
2026-06-10  3:19               ` [PATCH v4 " sashiko-bot
2026-06-09  9:50         ` [PATCH v3 " sashiko-bot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.