Re: [RFC PATCH v4 4/9] accel: rocket: Reset the NPU before detaching the IOMMU on timeout

Devicetree
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: MidG971 <midgy971@gmail.com>
Cc: dri-devel@lists.freedesktop.org, conor+dt@kernel.org,
	robh@kernel.org, devicetree@vger.kernel.org
Subject: Re: [RFC PATCH v4 4/9] accel: rocket: Reset the NPU before detaching the IOMMU on timeout
Date: Sat, 13 Jun 2026 07:15:54 +0000	[thread overview]
Message-ID: <20260613071555.3E4A11F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260613070116.438906-5-midgy971@gmail.com>

Thank you for your contribution! Sashiko AI review found 7 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Double free of `rjob->tasks` when `rocket_copy_tasks` fails.
- [High] The shared IRQ handler accesses hardware registers without checking if the device is active via pm_runtime_get_if_active().
- [High] The runtime suspend callback powers down the hardware without synchronizing the shared IRQ.
- [High] Timeout recovery drops the PM usage count early via `pm_runtime_put_noidle()`, stranding the device in an active power state and introducing a race condition.
- [High] PM runtime usage counter leaks on error paths during job submission.
- [High] Driver removal cancels pending PM transitions without synchronous suspend.
- [High] Missing NULL check on `iommu_group_get()` leads to crash if device lacks an IOMMU.
--

commit 54a038a17307496e6d47b23eab270cebaf07629e
Author: Midgy BALON <midgy971@gmail.com>

accel: rocket: Reset the NPU before detaching the IOMMU on timeout

This commit fixes a timeout issue where detaching the IOMMU before resetting 
the wedged NPU AXI master causes a stall. By asserting the core reset first, 
the AXI master is quiesced and the IOMMU detach completes cleanly.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index ac51bff39833f..e25234261536b 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -364,14 +364,20 @@ rocket_reset(struct rocket_core *core, struct drm_sched_job *bad)
>  		if (core->in_flight_job)
>  			pm_runtime_put_noidle(core->dev);

[Severity: High]
This is a pre-existing issue, but does using pm_runtime_put_noidle() here
leave the device stranded in an active power state?

Without restarting the autosuspend timer or scheduling an idle callback, the
usage counter is dropped while leaving the device active. 

Could this also create a race condition if an idle check is triggered
concurrently (such as via a sysfs write), causing the device to begin
suspending while the subsequent rocket_core_reset() is toggling reset lines?

>  
> -		iommu_detach_group(NULL, core->iommu_group);
> -
>  		core->in_flight_job = NULL;
>  	}
>  
> -	/* Proceed with reset now. */
> +	/*
> +	 * Reset the NPU hardware before detaching the IOMMU. A timed-out job
> +	 * leaves the NPU AXI master wedged; detaching the IOMMU then issues a
> +	 * stall request that never drains and times out (warning in the IOMMU
> +	 * core). Asserting the core reset first quiesces the master so the
> +	 * detach completes cleanly.
> +	 */
>  	rocket_core_reset(core);
>  
> +	iommu_detach_group(NULL, core->iommu_group);
> +
>  	/* NPU has been reset, we can clear the reset pending bit. */
>  	atomic_set(&core->reset.pending, 0);
>  

[Severity: High]
This isn't a bug introduced by this patch, but looking at the shared IRQ
handler:

drivers/accel/rocket/rocket_job.c:rocket_job_irq_handler() {
    ...
    u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
    ...
}

Is it possible that this accesses hardware registers without verifying if the
device is active?

If the NPU is suspended and another device triggers the shared interrupt line,
could this read from powered-off hardware and return invalid data
(like 0xffffffff)?

Might this be misinterpreted as a valid interrupt, leading to spurious
threaded IRQ execution and potential bus crashes?

Is it necessary to use pm_runtime_get_if_active() to verify the power state
first?

[Severity: High]
This isn't a bug introduced by this patch, but does the runtime suspend
callback power down hardware without synchronizing the shared IRQ?

drivers/accel/rocket/rocket_drv.c:rocket_device_runtime_suspend() {
    ...
    clk_bulk_disable_unprepare(ARRAY_SIZE(rdev->cores[core].clks),
                               rdev->cores[core].clks);
    ...
}

Since the driver uses a shared IRQ (IRQF_SHARED), could the handler execute 
mid-flight on another CPU while the hardware is actively being shut down?

Should this call synchronize_irq() before disabling hardware clocks to prevent
bus errors or system crashes?

[Severity: High]
This is a pre-existing issue, but does rocket_job_run() leak the PM usage
counter on error paths?

drivers/accel/rocket/rocket_job.c:rocket_job_run() {
    ...
    ret = pm_runtime_get_sync(core->dev);
    if (ret < 0)
        return fence;
    ...
    ret = iommu_attach_group(job->domain->domain, core->iommu_group);
    if (ret < 0)
        return fence;
    ...
}

If either pm_runtime_get_sync() or iommu_attach_group() fails, would returning
an unsignaled fence early without balancing the usage counter via
pm_runtime_put_noidle() permanently leak the PM reference?

Might this prevent the device from ever entering runtime suspend, while also
leaving the DRM scheduler stalled on a dead fence until it times out?

[Severity: High]
This is a pre-existing issue, but does driver removal cancel pending PM
transitions without a synchronous suspend?

drivers/accel/rocket/rocket_core.c:rocket_core_fini() {
    ...
    pm_runtime_disable(core->dev);
    ...
}

Because pm_runtime_disable() cancels pending asynchronous transitions (such as
the autosuspend timer), wouldn't the NPU hardware clocks be left running
indefinitely after the driver is removed and the device is unbound?

Is a call to pm_runtime_put_sync() needed before disabling runtime PM?

[Severity: High]
This isn't a bug introduced by this patch, but is there a missing NULL check 
when fetching the IOMMU group?

drivers/accel/rocket/rocket_core.c:rocket_core_init() {
    ...
    core->iommu_group = iommu_group_get(dev);
    ...
}

If the device lacks an IOMMU (for example, if it is disabled by the user or
failed to probe), will iommu_group_get() return NULL?

If so, when rocket_job_run() later executes:

drivers/accel/rocket/rocket_job.c:rocket_job_run() {
    ...
    ret = iommu_attach_group(job->domain->domain, core->iommu_group);
    ...
}

Would this unconditionally dereference the group pointer and cause a kernel
panic on the first job submission?

[Severity: Critical]
This is a pre-existing issue, but does a failure in rocket_copy_tasks() lead 
to a double free?

drivers/accel/rocket/rocket_job.c:rocket_copy_tasks() {
    ...
fail:
    kvfree(rjob->tasks);
    return ret;
}

If this fails (such as from an invalid pointer during copy_from_user()), the
tasks array is freed, but is rjob->tasks left as a dangling pointer?

When the caller jumps to out_cleanup_job and invokes rocket_job_put(rjob),
it reaches:

drivers/accel/rocket/rocket_job.c:rocket_job_cleanup() {
    ...
    kvfree(job->tasks);
    ...
}

Will this result in freeing the dangling pointer a second time?

Can this double free corrupt kernel memory or cause a panic, potentially
being exploitable by an unprivileged user?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260613070116.438906-1-midgy971@gmail.com?part=4

next prev parent reply	other threads:[~2026-06-13  7:15 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-13  7:01 [RFC PATCH v4 0/9] accel: rocket: Add RK3568 NPU support MidG971
2026-06-13  7:01 ` [RFC PATCH v4 1/9] accel: rocket: Introduce per-SoC rocket_soc_data MidG971
2026-06-13  7:10   ` sashiko-bot
2026-06-13  7:01 ` [RFC PATCH v4 2/9] accel: rocket: Derive DMA width and core count from match data MidG971
2026-06-13  7:09   ` sashiko-bot
2026-06-13  7:01 ` [RFC PATCH v4 3/9] accel: rocket: Add RK3568 SoC support MidG971
2026-06-13  7:11   ` sashiko-bot
2026-06-13  7:01 ` [RFC PATCH v4 4/9] accel: rocket: Reset the NPU before detaching the IOMMU on timeout MidG971
2026-06-13  7:15   ` sashiko-bot [this message]
2026-06-13  7:01 ` [RFC PATCH v4 5/9] accel: rocket: Keep the IOMMU domain attached across jobs MidG971
2026-06-13  7:12   ` sashiko-bot
2026-06-13  7:01 ` [RFC PATCH v4 6/9] dt-bindings: npu: rockchip,rk3588-rknn-core: Add RK3568 MidG971
2026-06-13  7:11   ` sashiko-bot
2026-06-13  7:01 ` [RFC PATCH v4 7/9] arm64: dts: rockchip: rk356x: Add the NPU and its IOMMU MidG971
2026-06-13  7:09   ` sashiko-bot
2026-06-13  8:18   ` Jonas Karlman
2026-06-13  7:01 ` [RFC PATCH v4 8/9] arm64: dts: rockchip: rk3568-rock-3b: Enable the NPU MidG971
2026-06-13  7:40   ` Jonas Karlman
2026-06-13  7:01 ` [RFC PATCH v4 9/9] pmdomain: rockchip: Add a regulator to the RK3568 NPU power domain MidG971
2026-06-13  7:21   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260613071555.3E4A11F000E9@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=conor+dt@kernel.org \
    --cc=devicetree@vger.kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=midgy971@gmail.com \
    --cc=robh@kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox