From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id DC270344D88
	for <devicetree@vger.kernel.org>; Sat, 13 Jun 2026 07:15:55 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1781334957; cv=none; b=jJ3LMghXhtz1rkw4CMrJ7WO7UkXCM+y1KuTG1ZgwCum27Mh9clfi30XeqwQd/unD4uL5YXyAeVQp1ePRZ6KwHfF5EBDdw5n+mXxbzD+HstPyUYYEfI9c+aGzpBGlCSL1+nkgg6kPHpmaS3E+xAVXWYMzIuTrQNzqwEbM2SuvgnM=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1781334957; c=relaxed/simple;
	bh=jO43Tt5vAqBuV/f1FeLBTerCGz9p8FpcGDYzhLaJ+0U=;
	h=From:Subject:To:Cc:In-Reply-To:References:Content-Type:Date:
	 Message-Id; b=GhntkZAHAOj41yCNQdfoBxjTcPx8VN/jNOk37g2aKQ7P7nrfzS8Q+sRKBOkIkwu4pBtZYEgsPrM/ns/PXXXasv9mIc0iRpn/bVqc3+MQmMzAG4N2qRL+CUXl/4ae7WLaoMevDO66mkVErIJ21cJLumsSholHbYT5vsRiw+pNwi0=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Czk0Pbia; arc=none smtp.client-ip=100.103.45.18
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Czk0Pbia"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3E4A11F000E9;
	Sat, 13 Jun 2026 07:15:55 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org;
	s=k20260515; t=1781334955;
	bh=OR9er5a0qh/SaSMqPI4wPpGfPB6L2mMsLIi2N2hTHYE=;
	h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date;
	b=Czk0PbiavbuNGywdutZfh3aB961yzUOAFeutuvCvg91vyHMV4f0hF5LlL1v7foF5r
	 3f9Px0dwutfBdcrF/SMD7tOWK906gQNa83txM/cxxIBEuezgCSrUTgEnYVUCrnh89c
	 8vfJnrbiJhEtGZIeH9Xh0+V2PoCZpl4Jiy55AAcsmh/GZV99Hcq1zOf185jl2kdTKO
	 qK+UcJplVNfGiy1LhKWVzbloSGJztSprz6gWTQI/sLaof2AJZMqxUiAbzAmtMlI1Mz
	 uGs0qJbMcUsm0giEhgpiX/Ch1eiwjEF/tzJAdxmdet47sqJNmS+KgwLvYfVhGQSGeK
	 jDup2kGB/1sNA==
From: sashiko-bot@kernel.org
Subject: Re: [RFC PATCH v4 4/9] accel: rocket: Reset the NPU before
 detaching the IOMMU on timeout
Reply-To: sashiko-reviews@lists.linux.dev
To: MidG971 <midgy971@gmail.com>
Cc: dri-devel@lists.freedesktop.org, conor+dt@kernel.org, robh@kernel.org, devicetree@vger.kernel.org
In-Reply-To: <20260613070116.438906-5-midgy971@gmail.com>
References: <20260613070116.438906-5-midgy971@gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Date: Sat, 13 Jun 2026 07:15:54 +0000
Message-Id: <20260613071555.3E4A11F000E9@smtp.kernel.org>
Precedence: bulk
X-Mailing-List: devicetree@vger.kernel.org
List-Id: <devicetree.vger.kernel.org>
List-Subscribe: <mailto:devicetree+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:devicetree+unsubscribe@vger.kernel.org>

Thank you for your contribution! Sashiko AI review found 7 potential issue(=
s) to consider:

Pre-existing issues:
- [Critical] Double free of `rjob->tasks` when `rocket_copy_tasks` fails.
- [High] The shared IRQ handler accesses hardware registers without checkin=
g if the device is active via pm_runtime_get_if_active().
- [High] The runtime suspend callback powers down the hardware without sync=
hronizing the shared IRQ.
- [High] Timeout recovery drops the PM usage count early via `pm_runtime_pu=
t_noidle()`, stranding the device in an active power state and introducing =
a race condition.
- [High] PM runtime usage counter leaks on error paths during job submissio=
n.
- [High] Driver removal cancels pending PM transitions without synchronous =
suspend.
- [High] Missing NULL check on `iommu_group_get()` leads to crash if device=
 lacks an IOMMU.
--

commit 54a038a17307496e6d47b23eab270cebaf07629e
Author: Midgy BALON <midgy971@gmail.com>

accel: rocket: Reset the NPU before detaching the IOMMU on timeout

This commit fixes a timeout issue where detaching the IOMMU before resettin=
g=20
the wedged NPU AXI master causes a stall. By asserting the core reset first=
,=20
the AXI master is quiesced and the IOMMU detach completes cleanly.

> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/roc=
ket_job.c
> index ac51bff39833f..e25234261536b 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -364,14 +364,20 @@ rocket_reset(struct rocket_core *core, struct drm_s=
ched_job *bad)
>  		if (core->in_flight_job)
>  			pm_runtime_put_noidle(core->dev);

[Severity: High]
This is a pre-existing issue, but does using pm_runtime_put_noidle() here
leave the device stranded in an active power state?

Without restarting the autosuspend timer or scheduling an idle callback, the
usage counter is dropped while leaving the device active.=20

Could this also create a race condition if an idle check is triggered
concurrently (such as via a sysfs write), causing the device to begin
suspending while the subsequent rocket_core_reset() is toggling reset lines?

> =20
> -		iommu_detach_group(NULL, core->iommu_group);
> -
>  		core->in_flight_job =3D NULL;
>  	}
> =20
> -	/* Proceed with reset now. */
> +	/*
> +	 * Reset the NPU hardware before detaching the IOMMU. A timed-out job
> +	 * leaves the NPU AXI master wedged; detaching the IOMMU then issues a
> +	 * stall request that never drains and times out (warning in the IOMMU
> +	 * core). Asserting the core reset first quiesces the master so the
> +	 * detach completes cleanly.
> +	 */
>  	rocket_core_reset(core);
> =20
> +	iommu_detach_group(NULL, core->iommu_group);
> +
>  	/* NPU has been reset, we can clear the reset pending bit. */
>  	atomic_set(&core->reset.pending, 0);
> =20

[Severity: High]
This isn't a bug introduced by this patch, but looking at the shared IRQ
handler:

drivers/accel/rocket/rocket_job.c:rocket_job_irq_handler() {
    ...
    u32 raw_status =3D rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
    ...
}

Is it possible that this accesses hardware registers without verifying if t=
he
device is active?

If the NPU is suspended and another device triggers the shared interrupt li=
ne,
could this read from powered-off hardware and return invalid data
(like 0xffffffff)?

Might this be misinterpreted as a valid interrupt, leading to spurious
threaded IRQ execution and potential bus crashes?

Is it necessary to use pm_runtime_get_if_active() to verify the power state
first?

[Severity: High]
This isn't a bug introduced by this patch, but does the runtime suspend
callback power down hardware without synchronizing the shared IRQ?

drivers/accel/rocket/rocket_drv.c:rocket_device_runtime_suspend() {
    ...
    clk_bulk_disable_unprepare(ARRAY_SIZE(rdev->cores[core].clks),
                               rdev->cores[core].clks);
    ...
}

Since the driver uses a shared IRQ (IRQF_SHARED), could the handler execute=
=20
mid-flight on another CPU while the hardware is actively being shut down?

Should this call synchronize_irq() before disabling hardware clocks to prev=
ent
bus errors or system crashes?

[Severity: High]
This is a pre-existing issue, but does rocket_job_run() leak the PM usage
counter on error paths?

drivers/accel/rocket/rocket_job.c:rocket_job_run() {
    ...
    ret =3D pm_runtime_get_sync(core->dev);
    if (ret < 0)
        return fence;
    ...
    ret =3D iommu_attach_group(job->domain->domain, core->iommu_group);
    if (ret < 0)
        return fence;
    ...
}

If either pm_runtime_get_sync() or iommu_attach_group() fails, would return=
ing
an unsignaled fence early without balancing the usage counter via
pm_runtime_put_noidle() permanently leak the PM reference?

Might this prevent the device from ever entering runtime suspend, while also
leaving the DRM scheduler stalled on a dead fence until it times out?

[Severity: High]
This is a pre-existing issue, but does driver removal cancel pending PM
transitions without a synchronous suspend?

drivers/accel/rocket/rocket_core.c:rocket_core_fini() {
    ...
    pm_runtime_disable(core->dev);
    ...
}

Because pm_runtime_disable() cancels pending asynchronous transitions (such=
 as
the autosuspend timer), wouldn't the NPU hardware clocks be left running
indefinitely after the driver is removed and the device is unbound?

Is a call to pm_runtime_put_sync() needed before disabling runtime PM?

[Severity: High]
This isn't a bug introduced by this patch, but is there a missing NULL chec=
k=20
when fetching the IOMMU group?

drivers/accel/rocket/rocket_core.c:rocket_core_init() {
    ...
    core->iommu_group =3D iommu_group_get(dev);
    ...
}

If the device lacks an IOMMU (for example, if it is disabled by the user or
failed to probe), will iommu_group_get() return NULL?

If so, when rocket_job_run() later executes:

drivers/accel/rocket/rocket_job.c:rocket_job_run() {
    ...
    ret =3D iommu_attach_group(job->domain->domain, core->iommu_group);
    ...
}

Would this unconditionally dereference the group pointer and cause a kernel
panic on the first job submission?

[Severity: Critical]
This is a pre-existing issue, but does a failure in rocket_copy_tasks() lea=
d=20
to a double free?

drivers/accel/rocket/rocket_job.c:rocket_copy_tasks() {
    ...
fail:
    kvfree(rjob->tasks);
    return ret;
}

If this fails (such as from an invalid pointer during copy_from_user()), the
tasks array is freed, but is rjob->tasks left as a dangling pointer?

When the caller jumps to out_cleanup_job and invokes rocket_job_put(rjob),
it reaches:

drivers/accel/rocket/rocket_job.c:rocket_job_cleanup() {
    ...
    kvfree(job->tasks);
    ...
}

Will this result in freeing the dangling pointer a second time?

Can this double free corrupt kernel memory or cause a panic, potentially
being exploitable by an unprivileged user?

--=20
Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260613070116.4389=
06-1-midgy971@gmail.com?part=3D4