Re: [RFC PATCH v3 1/9] accel: rocket: Introduce per-SoC rocket_soc_data

Devicetree
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Midgy BALON" <midgy971@gmail.com>
Cc: devicetree@vger.kernel.org, robh@kernel.org,
	dri-devel@lists.freedesktop.org, conor+dt@kernel.org
Subject: Re: [RFC PATCH v3 1/9] accel: rocket: Introduce per-SoC rocket_soc_data
Date: Thu, 04 Jun 2026 14:08:51 +0000	[thread overview]
Message-ID: <20260604140851.85DB71F0089A@smtp.kernel.org> (raw)
In-Reply-To: <20260604135255.62682-2-midgy971@gmail.com>

Thank you for your contribution! Sashiko AI review found 11 potential issue(s) to consider:

Pre-existing issues:
- [Critical] kref_put() dereferences a NULL pointer when job initialization fails early or a BO is imported.
- [High] Error path in BO creation frees an object that is already owned by the handle table.
- [High] Device is permanently left in an active power state upon unbind.
- [High] Array mismanagement during core removal prevents multi-core unbind.
- [High] Memory and DRM device leak on probe failure or deferral.
- [High] Asynchronous reset work is canceled after the DRM scheduler is destroyed.
- [High] The DRM device is registered and exposed to userspace before the NPU cores are initialized.
- [High] The global `rdev` pointer and `rdev->num_cores` are accessed and modified locklessly during probe and removal.
- [High] The shared interrupt handler accesses hardware registers without checking the PM state, and suspend executes without synchronizing IRQs.
- [High] A failure in `rocket_job_run` leaks the PM usage counter and returns an unsignaled hardware fence, hanging the DRM scheduler.
- [High] Registers are accessed during initialization before resets are deasserted.
--

commit e6cc2e46cba8fcba41d70a8b1e9422607b444db8
Author: Midgy BALON <midgy971@gmail.com>

accel: rocket: Introduce per-SoC rocket_soc_data

Add a per-SoC data structure carried in the OF match table, currently
holding only the NPU AXI address width, and use it for the per-core DMA
mask instead of a hardcoded 40-bit value.

> diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
> index b3b2fa9ba645a..09c445af7de73 100644
> --- a/drivers/accel/rocket/rocket_core.c
> +++ b/drivers/accel/rocket/rocket_core.c
> @@ -21,6 +22,10 @@ int rocket_core_init(struct rocket_core *core)
>  	u32 version;
>  	int err = 0;
>  
> +	core->soc_data = of_device_get_match_data(dev);
> +	if (!core->soc_data)
> +		return dev_err_probe(dev, -EINVAL, "missing SoC match data\n");
> +

[Severity: High]
This is a pre-existing issue, but I noticed that later in this function,
registers are accessed before resets are deasserted:

drivers/accel/rocket/rocket_core.c:rocket_core_init() {
	...
	core->resets[0].id = "srst_a";
	core->resets[1].id = "srst_h";
	err = devm_reset_control_bulk_get_exclusive(&pdev->dev, ARRAY_SIZE(core->resets),
						    core->resets);
	...
	version = rocket_pc_readl(core, VERSION);
	version += rocket_pc_readl(core, VERSION_NUM) & 0xffff;
	...
}

Will this cause a bus stall and trigger an asynchronous external abort if the
hardware or bootloader leaves these resets asserted?

[Severity: High]
This is a pre-existing issue, but does rocket_core_fini() leave the device
permanently powered on with its clocks running after the driver has been
detached?

drivers/accel/rocket/rocket_core.c:rocket_core_fini() {
	pm_runtime_dont_use_autosuspend(core->dev);
	pm_runtime_disable(core->dev);
	...
}

Since rocket_core_init() drops its Runtime PM reference asynchronously via
pm_runtime_put_autosuspend(dev), if the driver is unbound before the
autosuspend delay elapses, pm_runtime_disable() forcefully cancels the
pending asynchronous suspend work without first synchronously powering down
the hardware (e.g., via pm_runtime_put_sync()).

>  	core->resets[0].id = "srst_a";
>  	core->resets[1].id = "srst_h";
>  	err = devm_reset_control_bulk_get_exclusive(&pdev->dev, ARRAY_SIZE(core->resets),
[ ... ]
> @@ -52,7 +57,7 @@ int rocket_core_init(struct rocket_core *core)
>  
>  	dma_set_max_seg_size(dev, UINT_MAX);
>  
> -	err = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(40));
> +	err = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(core->soc_data->dma_bits));
>  	if (err)
>  		return err;
>  
> diff --git a/drivers/accel/rocket/rocket_drv.c b/drivers/accel/rocket/rocket_drv.c
> index 8bbbce594883e..384c38e13acce 100644
> --- a/drivers/accel/rocket/rocket_drv.c
> +++ b/drivers/accel/rocket/rocket_drv.c
> @@ -213,8 +213,12 @@ static void rocket_remove(struct platform_device *pdev)
>  	}
>  }
>  

[Severity: High]
This is a pre-existing issue, but can this array mismanagement during core
removal prevent multi-core unbind?

drivers/accel/rocket/rocket_drv.c:rocket_remove() {
	...
	rocket_core_fini(&rdev->cores[core]);
	rdev->num_cores--;
	...
}

drivers/accel/rocket/rocket_drv.c:find_core_for_dev() {
	for (unsigned int core = 0; core < rdev->num_cores; core++) {
		if (dev == rdev->cores[core].dev)
			return core;
	}
	...
}

When a core is unbound, rdev->num_cores is decremented, but the unbound core's
slot in the static rdev->cores array is left unmodified, and the remaining
elements are not shifted.

Because find_core_for_dev() unconditionally loops only rdev->num_cores times,
any core residing at an array index greater than or equal to the decremented
num_cores becomes invisible.

Will subsequent attempts to unbind those remaining cores fail, silently
exiting early and completely leaking the cores' resources and power state?

[Severity: High]
This is a pre-existing issue, but could rocket_probe() leak memory and the DRM
device on probe failure or deferral?

drivers/accel/rocket/rocket_drv.c:rocket_probe() {
	if (rdev == NULL) {
		/* First core probing, initialize DRM device. */
		rdev = rocket_device_init(drm_dev, &rocket_drm_driver, soc_data);
	...
}

drivers/accel/rocket/rocket_device.c:rocket_device_init() {
	rdev = devm_drm_dev_alloc(dev, rocket_drm_driver, struct rocket_device, ddev);
	...
	rdev->cores = devm_kcalloc(dev, soc_data->num_cores, sizeof(*rdev->cores), GFP_KERNEL);
	...
}

The global facade device rdev is dynamically initialized via rocket_device_init()
using the global, virtually registered drm_dev platform device. Allocations like
devm_drm_dev_alloc() and devm_kcalloc() are bound to the drm_dev lifecycle
rather than the physical pdev currently being probed.

If rocket_core_init() subsequently fails (e.g., due to -EPROBE_DEFER),
rocket_probe() cleans up by calling rocket_device_fini() and resetting
rdev = NULL. However, rocket_device_fini() only unregisters the DRM device.

Because drm_dev remains globally alive until module unload, none of the devm_
allocations are freed. On the next probe attempt, a new rdev is allocated on
drm_dev. Can this create a permanent memory leak?

[Severity: High]
This is a pre-existing issue, but might the DRM device be registered and
exposed to userspace before the NPU cores are initialized?

drivers/accel/rocket/rocket_device.c:rocket_device_init() {
	...
	err = drm_dev_register(ddev, 0);
	if (err)
		return ERR_PTR(err);

	return rdev;
}

drivers/accel/rocket/rocket_drv.c:rocket_probe() {
	...
		rdev = rocket_device_init(drm_dev, &rocket_drm_driver, soc_data);
	...
	unsigned int core = rdev->num_cores;
	dev_set_drvdata(&pdev->dev, rdev);

	rdev->cores[core].rdev = rdev;
	rdev->cores[core].dev = &pdev->dev;
	...
}

When the first core is probed, it calls rocket_device_init(), which ends with
drm_dev_register(). At this moment, the DRM character device becomes visible
and openable by userspace.

However, the core hardware, IRQs, and core pointers (e.g. rdev->cores[core].dev)
are not assigned or initialized until immediately afterward, when
rocket_core_init() runs.

If userspace opens the device in this narrow window, will rocket_open()
evaluate the uninitialized rdev->cores[0].dev (which is NULL) and pass it to
rocket_iommu_domain_create(), causing an immediate NULL pointer dereference?

[Severity: High]
This is a pre-existing issue, but is it possible that the global rdev pointer
and rdev->num_cores are accessed and modified locklessly during probe and
removal?

drivers/accel/rocket/rocket_drv.c:rocket_probe() {
	if (rdev == NULL) {
		/* First core probing, initialize DRM device. */
		rdev = rocket_device_init(drm_dev, &rocket_drm_driver, soc_data);
	...
	rdev->num_cores++;
	...
}

If multiple cores are probed concurrently (e.g., via sysfs asynchronous
binding), multiple threads will evaluate `if (rdev == NULL)` simultaneously
without a lock.

Can this create a data race where concurrent threads allocate redundant DRM
devices, overwrite the rdev singleton (causing memory leaks), and corrupt
the shared rdev->num_cores counter and rdev->cores array?

[Severity: Critical]
This is a pre-existing issue, but will kref_put() dereference a NULL pointer
when job initialization fails early or a BO is imported?

drivers/accel/rocket/rocket_job.c:rocket_ioctl_submit_job() {
	...
	goto out_cleanup_job;
	...
out_cleanup_job:
	if (ret)
		drm_sched_job_cleanup(&rjob->base);
out_put_job:
	rocket_job_put(rjob);
}

drivers/accel/rocket/rocket_drv.c:rocket_iommu_domain_put() {
	kref_put(&domain->kref, rocket_iommu_domain_destroy);
}

If an early failure occurs in rocket_ioctl_submit_job() (e.g., in
rocket_copy_tasks() or drm_gem_objects_lookup()), the error path calls
rocket_job_put(), which invokes rocket_job_cleanup(). At this point,
job->domain is still NULL.

rocket_job_cleanup() unconditionally passes this to rocket_iommu_domain_put(),
which immediately evaluates &domain->kref and causes a NULL pointer
dereference.

The same panic appears to occur when a BO is imported via PRIME
(rocket_gem_create_object() leaves domain NULL), and rocket_gem_bo_free()
subsequently accesses bo->domain->domain and bo->domain.

[Severity: High]
This is a pre-existing issue, but could the error path in BO creation free an
object that is already owned by the handle table?

drivers/accel/rocket/rocket_gem.c:rocket_ioctl_create_bo() {
	...
	ret = drm_gem_handle_create(file, gem_obj, &args->handle);
	drm_gem_object_put(gem_obj);
	if (ret)
		goto err;
	...
err:
	drm_gem_shmem_object_free(gem_obj);
}

drm_gem_handle_create() successfully exposes gem_obj to userspace and
increments its handle refcount, after which the local reference is correctly
dropped via drm_gem_object_put().

If a subsequent step fails (such as drm_gem_shmem_get_pages_sgt()), the error
path jumps to err: and manually calls drm_gem_shmem_object_free(gem_obj).

This explicitly frees the backing memory while the object is still actively
referenced by the handle table. When userspace later closes the handle, will
the GEM core double-free the object?

[Severity: High]
This is a pre-existing issue, but should the asynchronous reset work be
canceled before the DRM scheduler is destroyed?

drivers/accel/rocket/rocket_job.c:rocket_job_fini() {
	drm_sched_fini(&core->sched);

	cancel_work_sync(&core->reset.work);
	destroy_workqueue(core->reset.wq);
}

drm_sched_fini(&core->sched) is called before cancel_work_sync(&core->reset.work).

If the background reset task rocket_reset_work() is running concurrently,
it will execute rocket_reset(), which attempts to call drm_sched_stop() and
drm_sched_start() on the core->sched instance that has already been torn down
or freed.

[Severity: High]
This is a pre-existing issue, but could the shared interrupt handler access
hardware registers without checking the PM state, and does suspend execute
without synchronizing IRQs?

drivers/accel/rocket/rocket_job.c:rocket_job_irq_handler() {
	struct rocket_core *core = data;
	u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
	...
}

drivers/accel/rocket/rocket_drv.c:rocket_device_runtime_suspend() {
	...
	clk_bulk_disable_unprepare(ARRAY_SIZE(rdev->cores[core].clks), rdev->cores[core].clks);

	return 0;
}

The driver requests its interrupt using IRQF_SHARED, meaning
rocket_job_irq_handler() can execute at any time if another device on the same
line interrupts the CPU.

When called, the handler immediately reads hardware registers
(INTERRUPT_RAW_STATUS) without using pm_runtime_get_if_active() to verify the
device is powered. If the device was suspended, will this read return garbage
or cause a bus fault?

Additionally, the rocket_device_runtime_suspend() callback disables the
device's clocks without invoking synchronize_irq(), which might allow
hardware to lose power while an interrupt handler is executing mid-flight.

[Severity: High]
This is a pre-existing issue, but can a failure in rocket_job_run() leak
the PM usage counter and return an unsignaled hardware fence, hanging the
DRM scheduler?

drivers/accel/rocket/rocket_job.c:rocket_job_run() {
	...
	ret = pm_runtime_get_sync(core->dev);
	if (ret < 0)
		return fence;
	...
		ret = iommu_attach_group(job->domain->domain, core->iommu_group);
		if (ret < 0)
			return fence;
	...
}

drivers/accel/rocket/rocket_job.c:rocket_reset() {
	scoped_guard(mutex, &core->job_lock) {
		if (core->in_flight_job)
			pm_runtime_put_noidle(core->dev);

		core->in_flight_job = NULL;
	}
}

If pm_runtime_get_sync() returns an error, it returns the created fence
directly. Since pm_runtime_get_sync() increments the usage counter even upon
failure, this leaks the PM reference (missing pm_runtime_put_noidle()).

Similarly, if iommu_attach_group() fails shortly after, the function again
returns the fence without dropping the successfully acquired PM reference.

Because the hardware submission is skipped in both error paths, the returned
fence is never signaled. The DRM scheduler will block indefinitely waiting for
it, triggering a job timeout.

Will the timeout reset logic (rocket_reset()) fail to clean up these leaked
PM references because core->in_flight_job is evaluated as NULL (it was never
assigned before the early returns)?

> +static const struct rocket_soc_data rk3588_soc_data = {
> +	.dma_bits = 40,
> +};
> +
>  static const struct of_device_id dt_match[] = {
> -	{ .compatible = "rockchip,rk3588-rknn-core" },
> +	{ .compatible = "rockchip,rk3588-rknn-core", .data = &rk3588_soc_data },
>  	{}
>  };
>  MODULE_DEVICE_TABLE(of, dt_match);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260604135255.62682-1-midgy971@gmail.com?part=1

next prev parent reply	other threads:[~2026-06-04 14:08 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-04 13:52 [RFC PATCH v3 0/9] accel: rocket: Add RK3568 NPU support Midgy BALON
2026-06-04 13:52 ` [RFC PATCH v3 1/9] accel: rocket: Introduce per-SoC rocket_soc_data Midgy BALON
2026-06-04 14:08   ` sashiko-bot [this message]
2026-06-04 13:52 ` [RFC PATCH v3 2/9] accel: rocket: Derive DMA width and core count from match data Midgy BALON
2026-06-04 14:05   ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 3/9] accel: rocket: Add RK3568 SoC support Midgy BALON
2026-06-04 14:05   ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 4/9] accel: rocket: Reset the NPU before detaching the IOMMU on timeout Midgy BALON
2026-06-04 14:10   ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 5/9] accel: rocket: Keep the IOMMU domain attached across jobs Midgy BALON
2026-06-04 14:08   ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 6/9] iommu/rockchip: Clear AUTO_GATING bit 1 on the RK356x v1 IOMMU Midgy BALON
2026-06-04 14:04   ` sashiko-bot
2026-06-04 14:20   ` Tomeu Vizoso
2026-06-04 13:52 ` [RFC PATCH v3 7/9] dt-bindings: npu: rockchip,rk3588-rknn-core: Add RK3568 Midgy BALON
2026-06-04 14:08   ` sashiko-bot
2026-06-04 16:55     ` Conor Dooley
2026-06-04 13:52 ` [RFC PATCH v3 8/9] arm64: dts: rockchip: rk356x: Add the NPU and its IOMMU Midgy BALON
2026-06-04 14:11   ` sashiko-bot
2026-06-04 13:52 ` [RFC PATCH v3 9/9] arm64: dts: rockchip: rk3568-rock-3b: Enable the NPU Midgy BALON

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260604140851.85DB71F0089A@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=conor+dt@kernel.org \
    --cc=devicetree@vger.kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=midgy971@gmail.com \
    --cc=robh@kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox