From: sashiko-bot@kernel.org
To: "Midgy BALON" <midgy971@gmail.com>
Cc: conor+dt@kernel.org, robh@kernel.org, devicetree@vger.kernel.org,
dri-devel@lists.freedesktop.org
Subject: Re: [RFC PATCH v3 5/9] accel: rocket: Keep the IOMMU domain attached across jobs
Date: Thu, 04 Jun 2026 14:08:02 +0000 [thread overview]
Message-ID: <20260604140803.150401F00893@smtp.kernel.org> (raw)
In-Reply-To: <20260604135255.62682-6-midgy971@gmail.com>
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
New issues:
- [High] The IOMMU hardware registers are left uninitialized after a job timeout and hardware reset, causing subsequent jobs to immediately fault.
Pre-existing issues:
- [High] Incorrect teardown ordering in `rocket_core_fini()` destroys IOMMU and PM resources before stopping the DRM scheduler, leading to data races, use-after-free, and potential hardware faults.
- [High] PM runtime usage counts and unsignaled fences are leaked on error paths in `rocket_job_run()`.
--
commit 49c877f1ba7baae4a28268850b55e2ac3ab4a8cc
Author: Midgy BALON <midgy971@gmail.com>
accel: rocket: Keep the IOMMU domain attached across jobs
rocket attached the job's IOMMU domain in rocket_job_run() and
detached it again on every completion and reset. Each attach/detach
toggles the rk_iommu stall/force-reset/paging handshake, and on
RK3568 the NPU MMU is idle between jobs, so that handshake times out
and logs a burst of "stall/paging request timed out" errors for
every job.
Attach the per-context domain once and keep it: track the attached
domain in the core, swap it only when a job from a different context
runs, and detach it at core teardown. A reference on the attached
domain is held so it outlives the job that first attached it and is
released on swap/teardown.
> diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
> index a8de876365873..634f78dfe2887 100644
> --- a/drivers/accel/rocket/rocket_core.c
> +++ b/drivers/accel/rocket/rocket_core.c
[ ... ]
> @@ -112,6 +113,11 @@ void rocket_core_fini(struct rocket_core *core)
> {
> pm_runtime_dont_use_autosuspend(core->dev);
> pm_runtime_disable(core->dev);
> + if (core->attached_domain) {
> + iommu_detach_group(NULL, core->iommu_group);
> + rocket_iommu_domain_put(core->attached_domain);
> + core->attached_domain = NULL;
> + }
> iommu_group_put(core->iommu_group);
> core->iommu_group = NULL;
[Severity: High]
This is a pre-existing issue, but IOMMU resources are being torn down before
the DRM scheduler is stopped in rocket_job_fini(core).
Because the scheduler might still be active, couldn't an in-flight job be
executing concurrently? This could lead to a data race on core->attached_domain,
the IOMMU domain being forcefully unmapped from a running job causing
AXI bus faults, or a use-after-free when rocket_job_run() accesses
core->iommu_group.
Should rocket_job_fini(core) be called before tearing down the IOMMU state to
ensure the hardware is idle and the scheduler is stopped?
> rocket_job_fini(core);
> }
[ ... ]
> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index e25234261536b..b248371be8a1e 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
[ ... ]
> @@ -314,9 +315,26 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
> if (ret < 0)
> return fence;
[Severity: High]
This isn't a bug introduced by this patch, but if pm_runtime_get_sync()
fails, it returns a negative error code but still increments the PM usage counter.
The error path returns without calling pm_runtime_put_noidle() or signaling
the fence, which leaks the PM counter and causes the scheduler to wait indefinitely.
> - ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> - if (ret < 0)
> - return fence;
> + /*
> + * Attach the job's IOMMU domain only when it differs from the one
> + * already attached. Re-attaching per job toggles the rk_iommu
> + * stall/reset handshake on an idle NPU MMU, which is slow and
> + * noisy; keep the domain attached across jobs instead.
> + */
> + if (core->attached_domain != job->domain) {
> + if (core->attached_domain) {
> + iommu_detach_group(NULL, core->iommu_group);
> + rocket_iommu_domain_put(core->attached_domain);
> + core->attached_domain = NULL;
> + }
> +
> + ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> + if (ret < 0)
> + return fence;
[Severity: High]
Also a pre-existing issue, but if iommu_attach_group() fails, it returns
the fence early without dropping the successfully acquired PM reference.
Should this path include a pm_runtime_put_noidle() call to avoid leaking
the reference?
> +
> + kref_get(&job->domain->kref);
> + core->attached_domain = job->domain;
> + }
>
> scoped_guard(mutex, &core->job_lock) {
> core->in_flight_job = job;
[ ... ]
> @@ -376,8 +393,6 @@ rocket_reset(struct rocket_core *core, struct drm_sched_job *bad)
> */
> rocket_core_reset(core);
>
> - iommu_detach_group(NULL, core->iommu_group);
> -
> /* NPU has been reset, we can clear the reset pending bit. */
> atomic_set(&core->reset.pending, 0);
[Severity: High]
When the NPU is hard-reset by rocket_core_reset(), the IOMMU's hardware registers
(like page table base addresses) are wiped.
Since core->attached_domain is not cleared here, when the scheduler restarts
and executes the next job, rocket_job_run() will see core->attached_domain
matches the job's domain and skip calling iommu_attach_group().
Will this leave the IOMMU hardware unconfigured and cause subsequent jobs to
immediately throw DMA faults? Should core->attached_domain be cleared here
to force a re-attach and reprogramming of the IOMMU?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260604135255.62682-1-midgy971@gmail.com?part=5
next prev parent reply other threads:[~2026-06-04 14:08 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-04 13:52 [RFC PATCH v3 0/9] accel: rocket: Add RK3568 NPU support Midgy BALON
2026-06-04 13:52 ` [RFC PATCH v3 1/9] accel: rocket: Introduce per-SoC rocket_soc_data Midgy BALON
2026-06-04 14:08 ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 2/9] accel: rocket: Derive DMA width and core count from match data Midgy BALON
2026-06-04 14:05 ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 3/9] accel: rocket: Add RK3568 SoC support Midgy BALON
2026-06-04 14:05 ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 4/9] accel: rocket: Reset the NPU before detaching the IOMMU on timeout Midgy BALON
2026-06-04 14:10 ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 5/9] accel: rocket: Keep the IOMMU domain attached across jobs Midgy BALON
2026-06-04 14:08 ` sashiko-bot [this message]
2026-06-04 13:52 ` [RFC PATCH v3 6/9] iommu/rockchip: Clear AUTO_GATING bit 1 on the RK356x v1 IOMMU Midgy BALON
2026-06-04 14:04 ` sashiko-bot
2026-06-04 14:20 ` Tomeu Vizoso
2026-06-04 13:52 ` [RFC PATCH v3 7/9] dt-bindings: npu: rockchip,rk3588-rknn-core: Add RK3568 Midgy BALON
2026-06-04 14:08 ` sashiko-bot
2026-06-04 16:55 ` Conor Dooley
2026-06-04 13:52 ` [RFC PATCH v3 8/9] arm64: dts: rockchip: rk356x: Add the NPU and its IOMMU Midgy BALON
2026-06-04 14:11 ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 9/9] arm64: dts: rockchip: rk3568-rock-3b: Enable the NPU Midgy BALON
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260604140803.150401F00893@smtp.kernel.org \
--to=sashiko-bot@kernel.org \
--cc=conor+dt@kernel.org \
--cc=devicetree@vger.kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=midgy971@gmail.com \
--cc=robh@kernel.org \
--cc=sashiko-reviews@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox