Re: [RFC PATCH v4 5/9] accel: rocket: Keep the IOMMU domain attached across jobs

Devicetree
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: MidG971 <midgy971@gmail.com>
Cc: conor+dt@kernel.org, dri-devel@lists.freedesktop.org,
	devicetree@vger.kernel.org, robh@kernel.org
Subject: Re: [RFC PATCH v4 5/9] accel: rocket: Keep the IOMMU domain attached across jobs
Date: Sat, 13 Jun 2026 07:12:01 +0000	[thread overview]
Message-ID: <20260613071201.C64E11F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260613070116.438906-6-midgy971@gmail.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

New issues:
- [High] rocket_core_fini() tears down the IOMMU domain without quiescing the hardware, causing IOMMU faults and wedged hardware.

Pre-existing issues:
- [High] Error paths in rocket_job_run() leak the PM runtime usage counter and return an unsignaled fence, causing permanent power drain and spurious timeouts.
- [High] Job timeout recovery leaves the NPU permanently powered on due to pm_runtime_put_noidle().
--

commit 629253998abe2060bd2b43f469323132b5743d36
Author: Midgy BALON <midgy971@gmail.com>
Subject: accel: rocket: Keep the IOMMU domain attached across jobs

This commit attaches the IOMMU domain once per context instead of toggling
it on every job to avoid hardware stall timeouts. The domain is released on
context swap, core reset, or core teardown.

> diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
> index 779e951596a15..6c128f585cff4 100644
> --- a/drivers/accel/rocket/rocket_core.c
> +++ b/drivers/accel/rocket/rocket_core.c
[ ... ]
> @@ -112,9 +113,20 @@ void rocket_core_fini(struct rocket_core *core)
>  {
>  	pm_runtime_dont_use_autosuspend(core->dev);
>  	pm_runtime_disable(core->dev);
> +
> +	/*
> +	 * Stop the scheduler before tearing down the IOMMU so an in-flight
> +	 * job can no longer touch the (about to be detached) domain.
> +	 */
> +	rocket_job_fini(core);
> +
> +	if (core->attached_domain) {
> +		iommu_detach_group(NULL, core->iommu_group);

[Severity: High]
Does this detach the IOMMU page tables while the hardware is still active?

Calling drm_sched_fini() inside rocket_job_fini() only stops the software
scheduling thread, but it doesn't wait for the currently executing job on
the hardware to complete or reset the AXI master. Detaching the group while
the NPU is still executing could cause memory translation faults and wedge
the hardware.

Should the hardware be explicitly quiesced, for example by calling
rocket_core_reset(), before the IOMMU is detached here?

> +		rocket_iommu_domain_put(core->attached_domain);
> +		core->attached_domain = NULL;
> +	}
>  	iommu_group_put(core->iommu_group);
>  	core->iommu_group = NULL;
> -	rocket_job_fini(core);
>  }
>  
>  void rocket_core_reset(struct rocket_core *core)
> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> index e25234261536b..368b2ebead1b3 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
[ ... ]
> @@ -314,9 +315,26 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
>  	if (ret < 0)
>  		return fence;
>  
> -	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> -	if (ret < 0)
> -		return fence;

[Severity: High]
This isn't a bug introduced by this patch, but does the pm_runtime_get_sync
error path just above this leak the PM runtime usage counter? 

In rocket_job_run(), the code does:

	ret = pm_runtime_get_sync(core->dev);
	if (ret < 0)
		return fence;

If pm_runtime_get_sync() fails, returning without calling
pm_runtime_put_noidle() leaves the counter elevated.

> +	/*
> +	 * Attach the job's IOMMU domain only when it differs from the one
> +	 * already attached. Re-attaching per job toggles the rk_iommu
> +	 * stall/reset handshake on an idle NPU MMU, which is slow and
> +	 * noisy; keep the domain attached across jobs instead.
> +	 */
> +	if (core->attached_domain != job->domain) {
> +		if (core->attached_domain) {
> +			iommu_detach_group(NULL, core->iommu_group);
> +			rocket_iommu_domain_put(core->attached_domain);
> +			core->attached_domain = NULL;
> +		}
> +
> +		ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> +		if (ret < 0)
> +			return fence;

[Severity: High]
This is a pre-existing issue, but does this error path leak the PM runtime
reference acquired earlier in this function? 

Since the in_flight_job isn't set yet when returning the fence here, the
subsequent timeout handler won't clean this up, preventing the device from
ever entering low power states.

> +
> +		kref_get(&job->domain->kref);
> +		core->attached_domain = job->domain;
> +	}
>  
>  	scoped_guard(mutex, &core->job_lock) {
>  		core->in_flight_job = job;
[ ... ]
> @@ -376,7 +393,15 @@ rocket_reset(struct rocket_core *core, struct drm_sched_job *bad)

[Severity: High]
This isn't a bug introduced by this patch, but does job timeout recovery
leave the NPU permanently powered on?

Earlier in rocket_reset(), the usage counter is decremented:

	scoped_guard(mutex, &core->job_lock) {
		if (core->in_flight_job)
			pm_runtime_put_noidle(core->dev);

		core->in_flight_job = NULL;
	}

Because pm_runtime_put_noidle() does not invoke an idle check or queue an
auto-suspend, doesn't the device remain in the RPM_ACTIVE state and stay
physically powered on indefinitely? Should pm_runtime_put_autosuspend() be
used here instead?

>  	 */
>  	rocket_core_reset(core);
>  
> -	iommu_detach_group(NULL, core->iommu_group);
> +	/*
> +	 * The reset wipes the IOMMU page-table base, so drop the attached
> +	 * domain to force the next job to re-attach and reprogram it.
> +	 */
> +	if (core->attached_domain) {
> +		iommu_detach_group(NULL, core->iommu_group);
> +		rocket_iommu_domain_put(core->attached_domain);
> +		core->attached_domain = NULL;
> +	}
>  
>  	/* NPU has been reset, we can clear the reset pending bit. */
>  	atomic_set(&core->reset.pending, 0);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260613070116.438906-1-midgy971@gmail.com?part=5

next prev parent reply	other threads:[~2026-06-13  7:12 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-13  7:01 [RFC PATCH v4 0/9] accel: rocket: Add RK3568 NPU support MidG971
2026-06-13  7:01 ` [RFC PATCH v4 1/9] accel: rocket: Introduce per-SoC rocket_soc_data MidG971
2026-06-13  7:10   ` sashiko-bot
2026-06-13  7:01 ` [RFC PATCH v4 2/9] accel: rocket: Derive DMA width and core count from match data MidG971
2026-06-13  7:09   ` sashiko-bot
2026-06-13  7:01 ` [RFC PATCH v4 3/9] accel: rocket: Add RK3568 SoC support MidG971
2026-06-13  7:11   ` sashiko-bot
2026-06-13  7:01 ` [RFC PATCH v4 4/9] accel: rocket: Reset the NPU before detaching the IOMMU on timeout MidG971
2026-06-13  7:15   ` sashiko-bot
2026-06-13  7:01 ` [RFC PATCH v4 5/9] accel: rocket: Keep the IOMMU domain attached across jobs MidG971
2026-06-13  7:12   ` sashiko-bot [this message]
2026-06-13  7:01 ` [RFC PATCH v4 6/9] dt-bindings: npu: rockchip,rk3588-rknn-core: Add RK3568 MidG971
2026-06-13  7:11   ` sashiko-bot
2026-06-13  7:01 ` [RFC PATCH v4 7/9] arm64: dts: rockchip: rk356x: Add the NPU and its IOMMU MidG971
2026-06-13  7:09   ` sashiko-bot
2026-06-13  8:18   ` Jonas Karlman
2026-06-13  7:01 ` [RFC PATCH v4 8/9] arm64: dts: rockchip: rk3568-rock-3b: Enable the NPU MidG971
2026-06-13  7:40   ` Jonas Karlman
2026-06-13  7:01 ` [RFC PATCH v4 9/9] pmdomain: rockchip: Add a regulator to the RK3568 NPU power domain MidG971
2026-06-13  7:21   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260613071201.C64E11F000E9@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=conor+dt@kernel.org \
    --cc=devicetree@vger.kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=midgy971@gmail.com \
    --cc=robh@kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox