Re: [RFC PATCH v3 4/9] accel: rocket: Reset the NPU before detaching the IOMMU on timeout

Devicetree
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Midgy BALON" <midgy971@gmail.com>
Cc: conor+dt@kernel.org, devicetree@vger.kernel.org, robh@kernel.org,
	dri-devel@lists.freedesktop.org
Subject: Re: [RFC PATCH v3 4/9] accel: rocket: Reset the NPU before detaching the IOMMU on timeout
Date: Thu, 04 Jun 2026 14:10:50 +0000	[thread overview]
Message-ID: <20260604141050.D38211F00893@smtp.kernel.org> (raw)
In-Reply-To: <20260604135255.62682-5-midgy971@gmail.com>

Thank you for your contribution! Sashiko AI review found 9 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Double free of rjob->tasks array when copy_from_user() fails.
- [High] The shared IRQ handler unconditionally accesses hardware registers without verifying the device's PM active state.
- [High] Missing PM reference drop and unsignaled fence return on error paths in `rocket_job_run()`.
- [High] Concurrent hardware reset and IRQ handler execution can cause an external abort.
- [High] Unbalanced teardown order in rocket_core_fini() causes UAF and data races.
- [High] NULL pointer dereference of job->domain in rocket_job_cleanup() on early error paths.
- [High] Missing `synchronize_irq()` in runtime suspend callback.
- [Medium] The TDR reset path permanently leaks power by preventing the device from entering runtime suspend.
- [Medium] The hardware reset is performed after dropping the PM runtime reference, potentially targeting an unpowered device.
--

commit 323afa3b038052ef6c02ee0548658a65126910a0
Author: Midgy BALON <midgy971@gmail.com>

accel: rocket: Reset the NPU before detaching the IOMMU on timeout

On a job timeout the NPU AXI master can be left wedged with outstanding
transactions. rocket_reset() detached the IOMMU group before resetting
the hardware, causing the IOMMU core to stall. Assert the core reset
first to quiesce the master so the detach completes cleanly.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -364,14 +364,20 @@ rocket_reset(struct rocket_core *core, struct drm_sched_job *bad)
>  		if (core->in_flight_job)
>  			pm_runtime_put_noidle(core->dev);

[Severity: Medium]
This is a pre-existing issue, but does this TDR reset path permanently leak
power by preventing the device from entering runtime suspend?

It appears pm_runtime_put_noidle() drops the reference without triggering an
idle power-down. If the timed-out job was the last active job, the usage count
reaches zero but the device remains in the RPM_ACTIVE state indefinitely,
staying powered on until another job is submitted and correctly triggers
autosuspend.

> -		iommu_detach_group(NULL, core->iommu_group);
> -
>  		core->in_flight_job = NULL;
>  	}
>  
> -	/* Proceed with reset now. */
> +	/*
> +	 * Reset the NPU hardware before detaching the IOMMU. A timed-out job
> +	 * leaves the NPU AXI master wedged; detaching the IOMMU then issues a
> +	 * stall request that never drains and times out (warning in the IOMMU
> +	 * core). Asserting the core reset first quiesces the master so the
> +	 * detach completes cleanly.
> +	 */
>  	rocket_core_reset(core);

[Severity: High]
This isn't a bug introduced by this patch, but can a concurrent hardware reset
and IRQ handler execution cause an external abort?

rocket_core_reset() asserts a hardware reset without first disabling device
interrupts. If an interrupt fires just before or during the reset,
rocket_job_irq_handler() will execute and perform MMIO reads and writes to the
bus while the hardware is actively in reset.

[Severity: Medium]
This is also a pre-existing issue, but is it safe to perform the hardware
reset after dropping the PM runtime reference?

Dropping the PM reference via pm_runtime_put_noidle() earlier decrements the
usage count. A concurrently scheduled autosuspend timer could expire precisely
in this window and power down the device before or during the reset operations.

Asserting resets on a suspended device can lead to system bus faults. Should
the PM reference be dropped after the hardware reset is complete?

> +
> +	iommu_detach_group(NULL, core->iommu_group);
> +
>  	/* NPU has been reset, we can clear the reset pending bit. */
>  	atomic_set(&core->reset.pending, 0);

[Severity: High]
This isn't a bug introduced by this patch, but does the shared IRQ handler
unconditionally access hardware registers without verifying the device's PM
active state?

drivers/accel/rocket/rocket_job.c:rocket_job_irq_handler() {
    ...
    u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
    ...
}

The driver registers rocket_job_irq_handler() with the IRQF_SHARED flag.
Without first calling pm_runtime_get_if_active(), if the NPU is runtime
suspended, this read accesses an unpowered register domain. This typically
returns 0xffffffff (causing spurious handling) or triggers a synchronous
external abort.

[Severity: High]
This is a pre-existing issue, but does rocket_job_run() leak the PM reference
and return an unsignaled fence on error paths?

drivers/accel/rocket/rocket_job.c:rocket_job_run() {
    ...
    ret = pm_runtime_get_sync(core->dev);
    if (ret < 0)
        return fence;
    ...
}

If pm_runtime_get_sync() or iommu_attach_group() fails, the function returns
a newly created, un-signaled fence instead of an error pointer, and leaves
the PM usage counter incremented. Because the hardware submission step is
skipped, the hardware will never signal this fence, causing the DRM scheduler
to hang indefinitely.

[Severity: High]
This is a pre-existing issue, but does the teardown order in rocket_core_fini()
cause a use-after-free and data races?

drivers/accel/rocket/rocket_core.c:rocket_core_fini() {
    ...
    iommu_group_put(core->iommu_group);
    core->iommu_group = NULL;
    rocket_job_fini(core);
    ...
}

Resources are torn down in the wrong order. The iommu_group and attached_domain
members are freed, and pm_runtime_disable() is called before rocket_job_fini().

If a scheduled job is processed by the DRM scheduler during this window,
rocket_job_run() will concurrently access the now-NULL iommu_group without
locking. Should rocket_job_fini() be called first to stop the scheduler and
wait for jobs?

[Severity: Critical]
This isn't a bug introduced by this patch, but is there a double free of the
rjob->tasks array when copy_from_user() fails?

drivers/accel/rocket/rocket_job.c:rocket_copy_tasks() {
    ...
fail:
    kvfree(rjob->tasks);
    return ret;
}

If copy_from_user() fails, it frees rjob->tasks but leaves the dangling pointer.

The function returns an error, prompting rocket_ioctl_submit_job() to jump to
out_cleanup_job and call rocket_job_put(rjob). This invokes
rocket_job_cleanup(), which unconditionally calls kvfree(job->tasks),
resulting in a double-free.

[Severity: High]
This is a pre-existing issue, but does rocket_job_cleanup() dereference a NULL
job->domain on early error paths?

drivers/accel/rocket/rocket_job.c:rocket_job_cleanup() {
    ...
    rocket_iommu_domain_put(job->domain);
    ...
}

The job->domain structure is initially NULL and only populated late in the
function. If an earlier step fails (such as rocket_copy_tasks()), the cleanup
unconditionally calls rocket_iommu_domain_put(job->domain).

This wrapper dereferences the domain struct to access its kref without a
NULL check:

drivers/accel/rocket/rocket_drv.c:rocket_iommu_domain_put() {
    kref_put(&domain->kref, rocket_iommu_domain_destroy);
}

[Severity: High]
This isn't a bug introduced by this patch, but is there a missing
synchronize_irq() in the runtime suspend callback?

drivers/accel/rocket/rocket_drv.c:rocket_device_runtime_suspend() {
    ...
    clk_bulk_disable_unprepare(ARRAY_SIZE(rdev->cores[core].clks),
                               rdev->cores[core].clks);
    ...
}

The driver registers a shared interrupt but fails to call synchronize_irq().
Disabling the clocks while the IRQ handler is executing concurrently on another
CPU could cause an AXI bus fault or a hard system crash.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260604135255.62682-1-midgy971@gmail.com?part=4

next prev parent reply	other threads:[~2026-06-04 14:10 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-04 13:52 [RFC PATCH v3 0/9] accel: rocket: Add RK3568 NPU support Midgy BALON
2026-06-04 13:52 ` [RFC PATCH v3 1/9] accel: rocket: Introduce per-SoC rocket_soc_data Midgy BALON
2026-06-04 14:08   ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 2/9] accel: rocket: Derive DMA width and core count from match data Midgy BALON
2026-06-04 14:05   ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 3/9] accel: rocket: Add RK3568 SoC support Midgy BALON
2026-06-04 14:05   ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 4/9] accel: rocket: Reset the NPU before detaching the IOMMU on timeout Midgy BALON
2026-06-04 14:10   ` sashiko-bot [this message]
2026-06-04 13:52 ` [RFC PATCH v3 5/9] accel: rocket: Keep the IOMMU domain attached across jobs Midgy BALON
2026-06-04 14:08   ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 6/9] iommu/rockchip: Clear AUTO_GATING bit 1 on the RK356x v1 IOMMU Midgy BALON
2026-06-04 14:04   ` sashiko-bot
2026-06-04 14:20   ` Tomeu Vizoso
2026-06-04 13:52 ` [RFC PATCH v3 7/9] dt-bindings: npu: rockchip,rk3588-rknn-core: Add RK3568 Midgy BALON
2026-06-04 14:08   ` sashiko-bot
2026-06-04 16:55     ` Conor Dooley
2026-06-04 13:52 ` [RFC PATCH v3 8/9] arm64: dts: rockchip: rk356x: Add the NPU and its IOMMU Midgy BALON
2026-06-04 14:11   ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 9/9] arm64: dts: rockchip: rk3568-rock-3b: Enable the NPU Midgy BALON

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260604141050.D38211F00893@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=conor+dt@kernel.org \
    --cc=devicetree@vger.kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=midgy971@gmail.com \
    --cc=robh@kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox