All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] accel/rocket: Fix dma_fence reference leak on error paths in rocket_job_run()
@ 2026-06-08  8:38 ZhaoJinming
  2026-06-08  8:50 ` sashiko-bot
  0 siblings, 1 reply; 8+ messages in thread
From: ZhaoJinming @ 2026-06-08  8:38 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

In rocket_job_run(), after creating a fence at line 305 and taking an
extra reference for job->done_fence at line 311 via dma_fence_get(),
two error paths fail to release that extra reference before returning:

  - Line 314: When pm_runtime_get_sync() fails, returns fence without
    putting job->done_fence.

  - Line 318: When iommu_attach_group() fails, returns fence without
    putting job->done_fence, and also omits pm_runtime_put() to balance
    the successful pm_runtime_get_sync() at line 313.

The leaked fence reference prevents the fence and its underlying rocket
device from being freed while job->done_fence still points to it.
Repeated failures will accumulate leaked fences, consuming kernel memory
and holding device resources indefinitely.

Fix both error paths by calling dma_fence_put(job->done_fence) and
setting job->done_fence to NULL before returning. Also add the missing
pm_runtime_put() on the iommu_attach_group error path.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index ac51bff39833..66e4a1d57842 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -311,12 +311,19 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
 	job->done_fence = dma_fence_get(fence);
 
 	ret = pm_runtime_get_sync(core->dev);
-	if (ret < 0)
+	if (ret < 0) {
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
 		return fence;
+	}
 
 	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
-	if (ret < 0)
+	if (ret < 0) {
+		pm_runtime_put(core->dev);
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
 		return fence;
+	}
 
 	scoped_guard(mutex, &core->job_lock) {
 		core->in_flight_job = job;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] accel/rocket: Fix dma_fence reference leak on error paths in rocket_job_run()
  2026-06-08  8:38 [PATCH] accel/rocket: Fix dma_fence reference leak on error paths in rocket_job_run() ZhaoJinming
@ 2026-06-08  8:50 ` sashiko-bot
  2026-06-08 10:00   ` [PATCH v2] accel/rocket: Fix fence leak, runtime PM leak and WARN in rocket_job_run() error paths ZhaoJinming
  0 siblings, 1 reply; 8+ messages in thread
From: sashiko-bot @ 2026-06-08  8:50 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: dri-devel, linux-media

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] The `pm_runtime_get_sync()` error path fails to decrement the runtime PM usage counter, permanently leaking the reference.
- [High] Returning an unsignaled dma_fence instead of an error pointer on job submission failure triggers a kernel WARN.
--

commit f0df2d483139ad216e9d43923cbf9543a8f4e81c
Author: ZhaoJinming <zhaojinming@uniontech.com>

accel/rocket: Fix dma_fence reference leak on error paths in rocket_job_run()

In rocket_job_run(), after creating a fence at line 305 and taking an
extra reference for job->done_fence at line 311 via dma_fence_get(),
two error paths fail to release that extra reference before returning.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index 2f1861f960ccb..fc10703534e37 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -311,12 +311,19 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
>  	job->done_fence = dma_fence_get(fence);
>  
>  	ret = pm_runtime_get_sync(core->dev);
> -	if (ret < 0)
> +	if (ret < 0) {
> +		dma_fence_put(job->done_fence);
> +		job->done_fence = NULL;

[Severity: High]
This is a pre-existing issue, but does this error path permanently leak the
runtime PM usage counter? 

pm_runtime_get_sync() increments the device's usage counter before attempting
to resume the device. If the resume fails, it returns a negative error code
but the usage counter remains incremented. 

Without explicitly dropping this usage counter (e.g., using
pm_runtime_put_noidle()), could repeated failures prevent the NPU from ever
suspending again?

>  		return fence;

[Severity: High]
This is also a pre-existing issue, but will returning the valid, unsignaled
fence instead of ERR_PTR(ret) trigger a kernel WARN?

Since the hardware wasn't submitted, the fence refcount immediately hits 0
after the DRM scheduler drops its initial reference. dma_fence_release()
detects pending callbacks on an unsignaled fence and triggers a loud kernel
WARN(1, "Fence ... released with pending signals!\n"), forcibly signaling it
with -EDEADLK.

Should this path destroy the fence and return ERR_PTR(ret) so the scheduler
cleanly aborts the job?

> +	}
>  
>  	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> -	if (ret < 0)
> +	if (ret < 0) {
> +		pm_runtime_put(core->dev);
> +		dma_fence_put(job->done_fence);
> +		job->done_fence = NULL;
>  		return fence;

[Severity: High]
This is a pre-existing issue, but similar to the above, does returning the
unsignaled fence here instead of ERR_PTR(ret) cause the same DRM scheduler
WARN when the fence is released?

> +	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608083824.775261-1-zhaojinming@uniontech.com?part=1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2] accel/rocket: Fix fence leak, runtime PM leak and WARN in rocket_job_run() error paths
  2026-06-08  8:50 ` sashiko-bot
@ 2026-06-08 10:00   ` ZhaoJinming
  2026-06-08 10:38     ` sashiko-bot
  0 siblings, 1 reply; 8+ messages in thread
From: ZhaoJinming @ 2026-06-08 10:00 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

Three issues exist in the error paths of rocket_job_run():

1) dma_fence reference leak: After creating a fence and taking an extra
   reference for job->done_fence via dma_fence_get(), the error paths
   return without releasing the extra reference held by job->done_fence.
   The leaked reference prevents the fence from being freed, causing
   resource accumulation on repeated failures.

2) pm_runtime_get_sync() usage counter leak: pm_runtime_get_sync()
   increments the runtime PM usage counter before attempting to resume
   the device. If the resume fails and returns an error, the usage
   counter remains incremented. The original error path does not call
   pm_runtime_put_noidle() to balance it. Repeated failures will
   permanently prevent the NPU from entering suspend.

3) Unsignaled fence returned on failure: The error paths return a valid
   but unsignaled dma_fence to the DRM scheduler. Since the hardware
   was never submitted, the fence is never signaled. When the scheduler
   eventually drops its reference, dma_fence_release() detects the
   unsignaled fence and triggers:
     WARN(1, "Fence ... released with pending signals!")
   and forcibly signals it with -EDEADLK.

Fix all three issues by:

- Replace pm_runtime_get_sync() with pm_runtime_resume_and_get(), which
  automatically decrements the usage counter on failure, eliminating
  the need for a manual pm_runtime_put_noidle() call and avoiding the
  usage counter leak. The pm_runtime_get_sync() documentation itself
  recommends pm_runtime_resume_and_get() as the preferred alternative
  when the return value is checked by the caller.

- Release both fence references (job->done_fence and the local fence)
  before returning ERR_PTR(ret) so the DRM scheduler cleanly aborts
  the job without triggering the unsignaled fence WARN.

- Add pm_runtime_put() on the iommu_attach_group error path to release
  the runtime PM reference that was successfully acquired.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index ac51bff39833..e8a073e22ac2 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
 		dma_fence_put(job->done_fence);
 	job->done_fence = dma_fence_get(fence);
 
-	ret = pm_runtime_get_sync(core->dev);
-	if (ret < 0)
-		return fence;
+	ret = pm_runtime_resume_and_get(core->dev);
+	if (ret < 0) {
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
-	if (ret < 0)
-		return fence;
+	if (ret < 0) {
+		pm_runtime_put(core->dev);
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	scoped_guard(mutex, &core->job_lock) {
 		core->in_flight_job = job;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] accel/rocket: Fix fence leak, runtime PM leak and WARN in rocket_job_run() error paths
  2026-06-08 10:00   ` [PATCH v2] accel/rocket: Fix fence leak, runtime PM leak and WARN in rocket_job_run() error paths ZhaoJinming
@ 2026-06-08 10:38     ` sashiko-bot
  2026-06-09  9:33       ` [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
  0 siblings, 1 reply; 8+ messages in thread
From: sashiko-bot @ 2026-06-08 10:38 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: linux-media, dri-devel

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] IOMMU group reference leak in rocket_job_handle_irq()
- [High] Unsafe hardware register access in shared IRQ handler while runtime suspended
--

commit fca53bbe333680ce9d4e0368b39e684506994081
Author: ZhaoJinming <zhaojinming@uniontech.com>
Subject: accel/rocket: Fix fence leak, runtime PM leak and WARN in rocket_job_run() error paths

This commit fixes multiple resource leaks in the error paths of rocket_job_run().
It replaces pm_runtime_get_sync() with pm_runtime_resume_and_get() to handle
the usage counter, correctly releases dma_fence references on failure, and
adds pm_runtime_put() during IOMMU attachment failures.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index 2f1861f960ccb..cf9f61682dd8b 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c

[ ... ]

> @@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
>  		dma_fence_put(job->done_fence);
>  	job->done_fence = dma_fence_get(fence);
>  
> -	ret = pm_runtime_get_sync(core->dev);
> -	if (ret < 0)
> -		return fence;
> +	ret = pm_runtime_resume_and_get(core->dev);
> +	if (ret < 0) {
> +		dma_fence_put(job->done_fence);
> +		job->done_fence = NULL;
> +		dma_fence_put(fence);
> +		return ERR_PTR(ret);
> +	}
>  
>  	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> -	if (ret < 0)
> -		return fence;
> +	if (ret < 0) {
> +		pm_runtime_put(core->dev);
> +		dma_fence_put(job->done_fence);
> +		job->done_fence = NULL;
> +		dma_fence_put(fence);
> +		return ERR_PTR(ret);
> +	}
>  
>  	scoped_guard(mutex, &core->job_lock) {
>  		core->in_flight_job = job;

[ ... ]

>  static void rocket_job_handle_irq(struct rocket_core *core)
>  {
>  	pm_runtime_mark_last_busy(core->dev);
>  
>  	rocket_pc_writel(core, OPERATION_ENABLE, 0x0);
>  	rocket_pc_writel(core, INTERRUPT_CLEAR, 0x1ffff);
>  
>  	scoped_guard(mutex, &core->job_lock)
>  		if (core->in_flight_job) {
>  			if (core->in_flight_job->next_task_idx < core->in_flight_job->task_count) {
>  				rocket_job_hw_submit(core, core->in_flight_job);
>  				return;
>  			}
>  
>  			iommu_detach_group(NULL, iommu_group_get(core->dev));

[Severity: High]
This isn't a bug introduced by this patch, but does this leak an IOMMU
group reference?

It looks like iommu_group_get() increments the kobject reference count for the
group. However, iommu_detach_group() doesn't consume this reference, and there
is no subsequent iommu_group_put() call.

Since this runs on every completed job, could this eventually overflow the
reference counter and prevent the group from being freed?

[ ... ]

>  static irqreturn_t rocket_job_irq_handler(int irq, void *data)
>  {
>  	struct rocket_core *core = data;
>  	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);

[Severity: High]
This is a pre-existing issue, but is it safe to access hardware registers
here without first checking if the device is active?

Since the driver registers this as a shared interrupt handler, it could be
invoked if another device on the same IRQ line triggers an interrupt while the
NPU device is runtime suspended.

Reading from a powered-down device typically returns 0xffffffff, which would
spuriously trigger the WARN_ON macros below and falsely return 
IRQ_WAKE_THREAD.

Should this use pm_runtime_get_if_active() to verify the device is powered
before reading INTERRUPT_RAW_STATUS?

>  
>  	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
>  	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608100040.985526-1-zhaojinming@uniontech.com?part=1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run()
  2026-06-08 10:38     ` sashiko-bot
@ 2026-06-09  9:33       ` ZhaoJinming
  2026-06-09  9:33         ` [PATCH v3 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
  2026-06-09  9:50         ` [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run() sashiko-bot
  0 siblings, 2 replies; 8+ messages in thread
From: ZhaoJinming @ 2026-06-09  9:33 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

In rocket_job_run(), after taking an extra fence reference for
job->done_fence via dma_fence_get(), the error paths have three bugs:

- The dma_fence reference held by job->done_fence is never released,
  causing a reference leak.
- pm_runtime_get_sync() increments the usage counter even on failure,
  but the error path does not decrement it, leaking the runtime PM
  reference and preventing the NPU from suspending.
- A valid but unsignaled fence is returned to the DRM scheduler,
  which triggers WARN("Fence ... released with pending signals!")
  when the scheduler drops its reference.

Fix by replacing pm_runtime_get_sync() with pm_runtime_resume_and_get()
which auto-balances the usage counter on failure, releasing both fence
references on error, and returning ERR_PTR(ret) instead of the
unsignaled fence.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index ac51bff39833..e8a073e22ac2 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
 		dma_fence_put(job->done_fence);
 	job->done_fence = dma_fence_get(fence);
 
-	ret = pm_runtime_get_sync(core->dev);
-	if (ret < 0)
-		return fence;
+	ret = pm_runtime_resume_and_get(core->dev);
+	if (ret < 0) {
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
-	if (ret < 0)
-		return fence;
+	if (ret < 0) {
+		pm_runtime_put(core->dev);
+		dma_fence_put(job->done_fence);
+		job->done_fence = NULL;
+		dma_fence_put(fence);
+		return ERR_PTR(ret);
+	}
 
 	scoped_guard(mutex, &core->job_lock) {
 		core->in_flight_job = job;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access
  2026-06-09  9:33       ` [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
@ 2026-06-09  9:33         ` ZhaoJinming
  2026-06-09  9:47           ` sashiko-bot
  2026-06-09  9:50         ` [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run() sashiko-bot
  1 sibling, 1 reply; 8+ messages in thread
From: ZhaoJinming @ 2026-06-09  9:33 UTC (permalink / raw)
  To: Tomeu Vizoso, Oded Gabbay
  Cc: Sumit Semwal, Christian König, Jeff Hugo, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig, ZhaoJinming, stable

Two bugs in the IRQ handling path:

1) iommu_group reference leak in rocket_job_handle_irq():
   iommu_group_get() increments the reference count but the returned
   pointer is passed directly to iommu_detach_group() which does not
   consume it. Since this runs on every completed job, the reference
   count accumulates and prevents the group from being freed. Use
   core->iommu_group instead, consistent with rocket_reset().

2) Unsafe hardware register access in shared IRQ handler:
   rocket_job_irq_handler() is registered with IRQF_SHARED but accesses
   hardware registers without checking runtime PM status. If another
   device on the same IRQ line triggers an interrupt while the NPU is
   suspended, register reads return 0xffffffff, spuriously triggering
   WARN_ON macros and falsely returning IRQ_WAKE_THREAD.

   Replace with pm_runtime_get_if_active(), which atomically checks the
   device status under spinlock and increments the usage count to keep
   the device active during register access. Balance the extra reference:
   - If the interrupt is not for this device (raw_status does not match
     DPU_0/DPU_1), put in the hardirq handler before returning IRQ_NONE.
   - If the interrupt is for this device, the thread handler puts after
     completing register writes.

Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
---
 drivers/accel/rocket/rocket_job.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index e8a073e22ac2..30c677952b64 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -349,7 +349,7 @@ static void rocket_job_handle_irq(struct rocket_core *core)
 				return;
 			}
 
-			iommu_detach_group(NULL, iommu_group_get(core->dev));
+			iommu_detach_group(NULL, core->iommu_group);
 			dma_fence_signal(core->in_flight_job->done_fence);
 			pm_runtime_put_autosuspend(core->dev);
 			core->in_flight_job = NULL;
@@ -421,6 +421,7 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
 	struct rocket_core *core = data;
 
 	rocket_job_handle_irq(core);
+	pm_runtime_put(core->dev);
 
 	return IRQ_HANDLED;
 }
@@ -428,14 +429,20 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
 static irqreturn_t rocket_job_irq_handler(int irq, void *data)
 {
 	struct rocket_core *core = data;
+
+	if (!pm_runtime_get_if_active(core->dev))
+		return IRQ_NONE;
+
 	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
 
 	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
 	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);
 
 	if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 ||
-	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1))
+	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) {
+		pm_runtime_put(core->dev);
 		return IRQ_NONE;
+	}
 
 	rocket_pc_writel(core, INTERRUPT_MASK, 0x0);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access
  2026-06-09  9:33         ` [PATCH v3 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
@ 2026-06-09  9:47           ` sashiko-bot
  0 siblings, 0 replies; 8+ messages in thread
From: sashiko-bot @ 2026-06-09  9:47 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:

New issues:
- [High] Incorrect error handling of pm_runtime_get_if_active() return value causes unbalanced pm_runtime_put() and unsafe hardware access.
- [High] PM usage count leak due to threaded IRQ coalescing on shared interrupts.

Pre-existing issues:
- [High] Missing synchronize_irq() in the runtime suspend callback for a device with a shared IRQ handler.
- [High] Attacker-controlled DMA faults leave the device interrupt asserted without clearing it, causing an interrupt storm and kernel log spam.
--

commit 57f30fe728d7f765520f0d3ccb419368e6f945d0
Author: ZhaoJinming <zhaojinming@uniontech.com>

accel/rocket: Fix iommu_group leak and unsafe IRQ register access

This commit fixes an iommu_group reference leak in rocket_job_handle_irq()
by using the stored iommu_group instead of getting a new reference. It also
attempts to fix unsafe hardware register accesses in the shared IRQ handler
by adding pm_runtime_get_if_active() checks to ensure the device is awake.

> 2) Unsafe hardware register access in shared IRQ handler:
>    rocket_job_irq_handler() is registered with IRQF_SHARED but accesses
>    hardware registers without checking runtime PM status.

[Severity: High]
This is a pre-existing issue, but since the device uses a shared IRQ handler,
does the runtime suspend callback (rocket_device_runtime_suspend() in
rocket_drv.c) need to call synchronize_irq() before disabling clocks? 

Without it, the IRQ handler could execute mid-flight while the hardware is
turned off, leading to invalid register accesses and spurious interrupts.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index cf9f61682dd8b..8eaa3d427a51d 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
[ ... ]
> @@ -428,14 +429,20 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data)
>  static irqreturn_t rocket_job_irq_handler(int irq, void *data)
>  {
>  	struct rocket_core *core = data;
> +
> +	if (!pm_runtime_get_if_active(core->dev))
> +		return IRQ_NONE;
> +

[Severity: High]
Does this correctly handle negative error codes? 

pm_runtime_get_if_active() returns negative error codes (e.g., -EINVAL) when
runtime PM is disabled. The condition here only traps a return value of 0.
When it returns a negative error code, the condition evaluates to false,
allowing execution to proceed to read hardware registers on a potentially
suspended device.

Because the PM usage counter is not incremented on error, wouldn't the
subsequent unconditional pm_runtime_put() calls cause a PM usage count
underflow?

>  	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
>  
>  	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
>  	WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);
>  

[Severity: High]
This is a pre-existing issue, but if a DMA error occurs and the DPU_0 and
DPU_1 bits are not set, the handler returns IRQ_NONE without clearing the
interrupt.

Since this is a shared level-triggered IRQ line, won't this leave the
interrupt asserted and cause an interrupt storm, permanently disabling the
shared IRQ line?

>  	if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 ||
> -	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1))
> +	      raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) {
> +		pm_runtime_put(core->dev);
>  		return IRQ_NONE;
> +	}
>  
>  	rocket_pc_writel(core, INTERRUPT_MASK, 0x0);
>  

[Severity: High]
Does this leak the PM usage count if the IRQ subsystem coalesces the threaded
interrupt?

If another device triggers an interrupt on this shared line while
rocket_job_irq_handler_thread() is already pending, this hardirq handler
will run again. It will increment the PM usage count via
pm_runtime_get_if_active() and eventually return IRQ_WAKE_THREAD.

Because the thread is already pending, the IRQ core coalesces the request
and the thread only runs once. Doesn't this mean pm_runtime_put() is only
called once, permanently leaking the extra PM reference acquired here?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260609093346.380396-1-zhaojinming@uniontech.com?part=2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run()
  2026-06-09  9:33       ` [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
  2026-06-09  9:33         ` [PATCH v3 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
@ 2026-06-09  9:50         ` sashiko-bot
  1 sibling, 0 replies; 8+ messages in thread
From: sashiko-bot @ 2026-06-09  9:50 UTC (permalink / raw)
  To: ZhaoJinming; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

Pre-existing issues:
- [High] Calling `drm_sched_job_cleanup()` after `drm_sched_job_arm()` violates the DRM scheduler API contract.
- [High] The runtime suspend callback fails to synchronize the shared IRQ before powering down.
- [High] Potential NULL pointer dereference in `iommu_attach_group()` if the device lacks an IOMMU group.
--

commit a91812825621124e6ae3a091adc6205cbeefed80
Author: ZhaoJinming <zhaojinming@uniontech.com>

accel/rocket: Fix error path handling in rocket_job_run()

This commit fixes multiple issues in the error paths of rocket_job_run(). It
resolves a dma_fence reference leak, correctly balances the runtime PM
usage counter on failure by switching to pm_runtime_resume_and_get(), and
returns ERR_PTR(ret) instead of an unsignaled fence to avoid DRM scheduler
warnings.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index 2f1861f960ccb..cf9f61682dd8b 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)

[ ... ]

>  	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> -	if (ret < 0)
> -		return fence;
> +	if (ret < 0) {
> +		pm_runtime_put(core->dev);
> +		dma_fence_put(job->done_fence);
> +		job->done_fence = NULL;
> +		dma_fence_put(fence);
> +		return ERR_PTR(ret);
> +	}

[Severity: High]
This is a pre-existing issue, but can core->iommu_group be NULL here?

If the device does not have an IOMMU group (e.g., IOMMU is disabled or not
present in the device tree), iommu_group_get() returns NULL. Calling
iommu_attach_group() unconditionally with a NULL group parameter will cause a
kernel crash when it attempts to dereference the pointer.

Should there be a check for a valid iommu_group before attaching?

[ ... ]

I also noticed a couple of pre-existing issues in this driver:

[Severity: High]
This isn't a bug introduced by this patch, but is it safe to call
drm_sched_job_cleanup() after drm_sched_job_arm() has already been called?

In drivers/accel/rocket/rocket_job.c:rocket_job_push(), if
rocket_acquire_object_fences() fails:

	drm_sched_job_arm(&job->base);
	...
	ret = rocket_acquire_object_fences(job->in_bos, job->in_bo_count,
					   &job->base, false);
	if (ret)
		goto err_unlock;

The error path returns to rocket_ioctl_submit_job(), which then calls:

out_cleanup_job:
	if (ret)
		drm_sched_job_cleanup(&rjob->base);

The DRM scheduler API contract states that drm_sched_job_arm() is a point of
no return that initializes fences and sequence numbers. Calling
drm_sched_job_cleanup() after this breaks the dma_fence sequence numbers,
potentially causing any context waiting on the aborted sequence number to hang
indefinitely.

[Severity: High]
This is a pre-existing issue, but does the runtime suspend callback need to
synchronize the shared IRQ before powering down?

In drivers/accel/rocket/rocket_drv.c:rocket_device_runtime_suspend():

	clk_bulk_disable_unprepare(ARRAY_SIZE(rdev->cores[core].clks),
				   rdev->cores[core].clks);

The Rocket driver registers its IRQ handler with IRQF_SHARED. According to
the power management subsystem rules, drivers using shared IRQs must call
synchronize_irq() in their runtime suspend callback before disabling clocks.

Without this, could the hardware be powered off while the IRQ handler is
mid-flight and reading registers, potentially leading to system hangs or
spurious wakeups?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260609093346.380396-1-zhaojinming@uniontech.com?part=1

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-09  9:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08  8:38 [PATCH] accel/rocket: Fix dma_fence reference leak on error paths in rocket_job_run() ZhaoJinming
2026-06-08  8:50 ` sashiko-bot
2026-06-08 10:00   ` [PATCH v2] accel/rocket: Fix fence leak, runtime PM leak and WARN in rocket_job_run() error paths ZhaoJinming
2026-06-08 10:38     ` sashiko-bot
2026-06-09  9:33       ` [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run() ZhaoJinming
2026-06-09  9:33         ` [PATCH v3 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access ZhaoJinming
2026-06-09  9:47           ` sashiko-bot
2026-06-09  9:50         ` [PATCH v3 1/2] accel/rocket: Fix error path handling in rocket_job_run() sashiko-bot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.