* Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering)
@ 2025-11-28 18:41 Quentin Schulz
2025-12-01 6:24 ` Chaoyi Chen
0 siblings, 1 reply; 9+ messages in thread
From: Quentin Schulz @ 2025-11-28 18:41 UTC (permalink / raw)
To: Tomeu Vizoso, Sebastian Reichel, Heiko Stuebner
Cc: linux-arm-kernel, linux-rockchip
Hi all,
I've finally found the time to test the RK3588 NPU (aka Rocket,
upstream) on our products and I've been hit with Oopses when loosely
following the instructions from https://docs.mesa3d.org/teflon.html.
After debugging for a while this is often (not always!) reproducible
when DRM_ACCEL_ROCKET=y (but not when built as a module!). The fact it's
"not always" is a dead giveaway timing or probe order is involved, more
later on that.
I don't think the Oops is important as I believe it's a symptom of other
issues, so I won't post it.
I'll list the issue I've identified so far, some seem easier to fix than
others (the biggest issue of course being the hardest to fix).
1.
https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_core.c#L79
The error path after a failed pm_runtime_resume_and_get() probably
should be calling rocket_core_fini(core) and not rocket_job_fini() (the
former calls the latter) as to unwind everything that's been done in
rocket_core_init() before this failed.
The PM core complains about unbalanced something when the core fails to
init and is retried later.
I suggest the following:
"""
diff --git a/drivers/accel/rocket/rocket_core.c
b/drivers/accel/rocket/rocket_core.c
index abe7719c1db46..95a732ab418c3 100644
--- a/drivers/accel/rocket/rocket_core.c
+++ b/drivers/accel/rocket/rocket_core.c
@@ -76,7 +76,7 @@ int rocket_core_init(struct rocket_core *core)
err = pm_runtime_resume_and_get(dev);
if (err) {
- rocket_job_fini(core);
+ rocket_core_fini(core);
return err;
}
"""
I can send a patch for that.
2. rocket_probe()
(https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_drv.c#L159)
can result in out-of-bounds access when rocket_core_init() fails as rdev
is a global variable (so rdev->num_cores persists between tries).
Considering rdev->cores is allocated based on the number of available
nodes in the DTB with the rknn compatible, if rocket_probe() is called
more than the number of available nodes, rdev->cores[core] will result
in out-of-bounds access.
I believe we could do:
"""
diff --git a/drivers/accel/rocket/rocket_drv.c
b/drivers/accel/rocket/rocket_drv.c
index 5c0b63f0a8f00..ad602ebd17c44 100644
--- a/drivers/accel/rocket/rocket_drv.c
+++ b/drivers/accel/rocket/rocket_drv.c
@@ -158,6 +158,8 @@ static const struct drm_driver rocket_drm_driver = {
static int rocket_probe(struct platform_device *pdev)
{
+ int ret;
+
if (rdev == NULL) {
/* First core probing, initialize DRM device. */
rdev = rocket_device_init(drm_dev, &rocket_drm_driver);
@@ -177,7 +179,15 @@ static int rocket_probe(struct platform_device *pdev)
rdev->num_cores++;
- return rocket_core_init(&rdev->cores[core]);
+ ret = rocket_core_init(&rdev->cores[core]);
+ if (ret) {
+ rdev->num_cores--;
+ if (rdev->num_cores == 0) {
+ rocket_device_fini(rdev);
+ }
+ }
+
+ return ret;
}
static void rocket_remove(struct platform_device *pdev)
"""
I can send a patch for that.
3. I believe (but I may be wrong) that core 0 is special (do we need to
schedule through core 0 maybe?). There's a hardcoded fetch of the device
associated with core 0 via
rocket_iommu_domain_create(rdev->cores[0].dev);
(https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_drv.c#L88).
The issue is that this assumes that core 0 is always rdev->cores[0],
which may not be true as it depends on the probe order (which is not
guaranteed), which can be exacerbated by core 0 (HW index) probe being
deferred for some reason. Is it necessary for core 0 to be the one
initializing the DRM device or could we simply use any core for that
(just the first to probe, whichever it actually is HW-wise)?
If there's a need to know which of the three cores is core 0, we can
have a vendor property for it or maybe aliases for the cores? We can
also do a check on the base register address of the core in DT and then
do a mapping in the driver. Any other suggestion?
4. The actual issue I'm unable to suggest something for comes now. The
following is all happening in the context of DRM_ACCEL_ROCKET being
built-in (that is DRM_ACCEL_ROCKET=y). You can use aarch64's default
defconfig, then DRM=y, DRM_ACCEL=y and DRM_ACCEL_ROCKET=y via
menuconfig. Remember that this is not 100% reproducible (more often
happening than not).
Points 1, 2 and 3 only are an issue if you hit some error-path during
the probe of one of the cores, which is triggered by this point 4 here.
rknn_core_0 depends on rknn_mmu_0 as highlighted by the iommus phandle
property. rknn_mmu_0 depends on RK3588_PD_NPUTOP power-domain. The
parent domain (RK3588_PD_NPU) of RK3588_PD_NPUTOP power-domain has a
domain-supply on a regulator on i2c (i2c0@42 for Jaguar, patches on the
ML at
https://lore.kernel.org/linux-rockchip/20250812085213.1071106-2-heiko@sntech.de/).
When the rocket driver calls platform_driver_register() in its
module_init() function, the kernel core will try to probe the associated
device, via driver_probe_device(), then __driver_probe_device() and more
specifically pm_runtime_get_suppliers(dev) in there
(https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/base/dd.c#L793).
Then this calls pm_runtime_get_sync() on all links, c.f.
https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/base/power/runtime.c#L1914.
Note that pm_runtime_get_suppliers() returns void. One of the links is
the rknn_mmu_0.
Follow the next series of function calls, pm_runtime_get_sync() on
rknn_mmu_0, rpm_resume(), rpm_callback(), __rpm_callback(). This then
calls cb(). cb() is set via RPM_GET_CALLBACK(dev, runtime_resume). Due
to rknn_mmu_0 device being probed with dev_pm_domain_attach() attaching
the power-domain device to iommu via the pm_domain member,
genpd_runtime_resume() (which goes to rockchip_pd_power_on()) is
actually called. This then calls rockchip_pd_regulator_enable() which
returns -EPROBE_DEFER. Going up in the call stack up to rpm_callback()
you can see that retval = __rpm_callback(cb, dev); now is EPROBE_DEFER
and thus, dev->power.runtime_error = EPROBE_DEFER at the end of the
function.
Then once pm_runtime_resume_and_get(dev); is called in
rocket_core_init() during probe of the rocket device, it'll call into
rpm_get_suppliers() in __rpm_callback() which in turn will call
pm_runtime_get_sync() on rknn_mmu_0 which will go up to rpm_resume()
where dev->power.runtime_error will be checked and if non-zero will
return -EINVAL, effectively eating -EPROBE_DEFER and failing the probe
of the rocket device instead of deferring the probe.
Applying the below patch:
"""
diff --git a/drivers/accel/rocket/rocket_core.c
b/drivers/accel/rocket/rocket_core.c
index abe7719c1db46..c4692189bbfc6 100644
--- a/drivers/accel/rocket/rocket_core.c
+++ b/drivers/accel/rocket/rocket_core.c
@@ -72,22 +72,9 @@ int rocket_core_init(struct rocket_core *core)
*/
pm_runtime_set_autosuspend_delay(dev, 50);
+ pm_runtime_set_suspended(dev);
pm_runtime_enable(dev);
- err = pm_runtime_resume_and_get(dev);
- if (err) {
- rocket_job_fini(core);
- return err;
- }
-
- version = rocket_pc_readl(core, VERSION);
- version += rocket_pc_readl(core, VERSION_NUM) & 0xffff;
-
- pm_runtime_mark_last_busy(dev);
- pm_runtime_put_autosuspend(dev);
-
- dev_info(dev, "Rockchip NPU core %d version: %d\n", core->index, version);
-
return 0;
}
"""
seems to make it possible to run mesa's example. If
pm_runtime_set_suspended(dev); isn't added, we'll only ever get job
timeouts.
Note that I'm pretty sure this isn't the proper way to do it but wanted
to give some of my debugging attempts at that.
I'm not sure how much further I can go and specifically I have no clue
how to handle that properly. Please also verify all the claims I'm
making, I'm not entirely sure I got every call stack right, it is quite
convoluted and I don't have experience with those frameworks.
Does anyone have a clue how we can deal with that?
The driver built as a module doesn't have this issue because the
regulator has enough time to probe so that PD doesn't return
EPROBE_DEFER when the rocket device gets probed once reaching userspace
(where the module is). I'm not entirely sure if this could be reproduced
by having the regulator driver as a module too for example?
Cheers,
Quentin
_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering)
2025-11-28 18:41 Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering) Quentin Schulz
@ 2025-12-01 6:24 ` Chaoyi Chen
2025-12-02 7:00 ` Chaoyi Chen
0 siblings, 1 reply; 9+ messages in thread
From: Chaoyi Chen @ 2025-12-01 6:24 UTC (permalink / raw)
To: Quentin Schulz, Tomeu Vizoso, Sebastian Reichel, Heiko Stuebner
Cc: linux-arm-kernel, linux-rockchip
Hi Quentin,
On 11/29/2025 2:41 AM, Quentin Schulz wrote:
> Hi all,
>
> I've finally found the time to test the RK3588 NPU (aka Rocket, upstream) on our products and I've been hit with Oopses when loosely following the instructions from https://docs.mesa3d.org/teflon.html. After debugging for a while this is often (not always!) reproducible when DRM_ACCEL_ROCKET=y (but not when built as a module!). The fact it's "not always" is a dead giveaway timing or probe order is involved, more later on that.
>
This looks like a power domain problem. What would happen if you tried
enabling the regulator for the NPU power domain in U-Boot?
Or did you properly configure the corresponding regulator supply in
the power domain?
> I don't think the Oops is important as I believe it's a symptom of other issues, so I won't post it.
>
> I'll list the issue I've identified so far, some seem easier to fix than others (the biggest issue of course being the hardest to fix).
>
> 1. https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_core.c#L79
>
> The error path after a failed pm_runtime_resume_and_get() probably should be calling rocket_core_fini(core) and not rocket_job_fini() (the former calls the latter) as to unwind everything that's been done in rocket_core_init() before this failed.
>
> The PM core complains about unbalanced something when the core fails to init and is retried later.
>
> I suggest the following:
>
> """
> diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
> index abe7719c1db46..95a732ab418c3 100644
> --- a/drivers/accel/rocket/rocket_core.c
> +++ b/drivers/accel/rocket/rocket_core.c
> @@ -76,7 +76,7 @@ int rocket_core_init(struct rocket_core *core)
>
> err = pm_runtime_resume_and_get(dev);
> if (err) {
> - rocket_job_fini(core);
> + rocket_core_fini(core);
> return err;
> }
>
> """
>
> I can send a patch for that.
>
> 2. rocket_probe() (https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_drv.c#L159) can result in out-of-bounds access when rocket_core_init() fails as rdev is a global variable (so rdev->num_cores persists between tries). Considering rdev->cores is allocated based on the number of available nodes in the DTB with the rknn compatible, if rocket_probe() is called more than the number of available nodes, rdev->cores[core] will result in out-of-bounds access.
>
> I believe we could do:
>
> """
> diff --git a/drivers/accel/rocket/rocket_drv.c b/drivers/accel/rocket/rocket_drv.c
> index 5c0b63f0a8f00..ad602ebd17c44 100644
> --- a/drivers/accel/rocket/rocket_drv.c
> +++ b/drivers/accel/rocket/rocket_drv.c
> @@ -158,6 +158,8 @@ static const struct drm_driver rocket_drm_driver = {
>
> static int rocket_probe(struct platform_device *pdev)
> {
> + int ret;
> +
> if (rdev == NULL) {
> /* First core probing, initialize DRM device. */
> rdev = rocket_device_init(drm_dev, &rocket_drm_driver);
> @@ -177,7 +179,15 @@ static int rocket_probe(struct platform_device *pdev)
>
> rdev->num_cores++;
>
> - return rocket_core_init(&rdev->cores[core]);
> + ret = rocket_core_init(&rdev->cores[core]);
> + if (ret) {
> + rdev->num_cores--;
> + if (rdev->num_cores == 0) {
> + rocket_device_fini(rdev);
> + }
> + }
> +
> + return ret;
> }
>
> static void rocket_remove(struct platform_device *pdev)
>
> """
>
It seems hacky. I'm not sure whether this would be better
implemented as a component driver.
> I can send a patch for that.
>
> 3. I believe (but I may be wrong) that core 0 is special (do we need to schedule through core 0 maybe?). There's a hardcoded fetch of the device associated with core 0 via rocket_iommu_domain_create(rdev->cores[0].dev); (https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_drv.c#L88). The issue is that this assumes that core 0 is always rdev->cores[0], which may not be true as it depends on the probe order (which is not guaranteed), which can be exacerbated by core 0 (HW index) probe being deferred for some reason. Is it necessary for core 0 to be the one initializing the DRM device or could we simply use any core for that (just the first to probe, whichever it actually is HW-wise)?
I think this is because multiple cores share the same IOMMU device.
>
> If there's a need to know which of the three cores is core 0, we can have a vendor property for it or maybe aliases for the cores? We can also do a check on the base register address of the core in DT and then do a mapping in the driver. Any other suggestion?
>
> 4. The actual issue I'm unable to suggest something for comes now. The following is all happening in the context of DRM_ACCEL_ROCKET being built-in (that is DRM_ACCEL_ROCKET=y). You can use aarch64's default defconfig, then DRM=y, DRM_ACCEL=y and DRM_ACCEL_ROCKET=y via menuconfig. Remember that this is not 100% reproducible (more often happening than not).
>
> Points 1, 2 and 3 only are an issue if you hit some error-path during the probe of one of the cores, which is triggered by this point 4 here.
>
> rknn_core_0 depends on rknn_mmu_0 as highlighted by the iommus phandle property. rknn_mmu_0 depends on RK3588_PD_NPUTOP power-domain. The parent domain (RK3588_PD_NPU) of RK3588_PD_NPUTOP power-domain has a domain-supply on a regulator on i2c (i2c0@42 for Jaguar, patches on the ML at https://lore.kernel.org/linux-rockchip/20250812085213.1071106-2-heiko@sntech.de/). When the rocket driver calls platform_driver_register() in its module_init() function, the kernel core will try to probe the associated device, via driver_probe_device(), then __driver_probe_device() and more specifically pm_runtime_get_suppliers(dev) in there (https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/base/dd.c#L793). Then this calls pm_runtime_get_sync() on all links, c.f. https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/base/power/runtime.c#L1914. Note that pm_runtime_get_suppliers() returns void. One of the links is the rknn_mmu_0.
>
> Follow the next series of function calls, pm_runtime_get_sync() on rknn_mmu_0, rpm_resume(), rpm_callback(), __rpm_callback(). This then calls cb(). cb() is set via RPM_GET_CALLBACK(dev, runtime_resume). Due to rknn_mmu_0 device being probed with dev_pm_domain_attach() attaching the power-domain device to iommu via the pm_domain member, genpd_runtime_resume() (which goes to rockchip_pd_power_on()) is actually called. This then calls rockchip_pd_regulator_enable() which returns -EPROBE_DEFER. Going up in the call stack up to rpm_callback() you can see that retval = __rpm_callback(cb, dev); now is EPROBE_DEFER and thus, dev->power.runtime_error = EPROBE_DEFER at the end of the function.
>
> Then once pm_runtime_resume_and_get(dev); is called in rocket_core_init() during probe of the rocket device, it'll call into rpm_get_suppliers() in __rpm_callback() which in turn will call pm_runtime_get_sync() on rknn_mmu_0 which will go up to rpm_resume() where dev->power.runtime_error will be checked and if non-zero will return -EINVAL, effectively eating -EPROBE_DEFER and failing the probe of the rocket device instead of deferring the probe.
>
> Applying the below patch:
>
> """
> diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
> index abe7719c1db46..c4692189bbfc6 100644
> --- a/drivers/accel/rocket/rocket_core.c
> +++ b/drivers/accel/rocket/rocket_core.c
> @@ -72,22 +72,9 @@ int rocket_core_init(struct rocket_core *core)
> */
> pm_runtime_set_autosuspend_delay(dev, 50);
>
> + pm_runtime_set_suspended(dev);
> pm_runtime_enable(dev);
>
> - err = pm_runtime_resume_and_get(dev);
> - if (err) {
> - rocket_job_fini(core);
> - return err;
> - }
> -
> - version = rocket_pc_readl(core, VERSION);
> - version += rocket_pc_readl(core, VERSION_NUM) & 0xffff;
> -
> - pm_runtime_mark_last_busy(dev);
> - pm_runtime_put_autosuspend(dev);
> -
> - dev_info(dev, "Rockchip NPU core %d version: %d\n", core->index, version);
> -
> return 0;
> }
>
> """
>
> seems to make it possible to run mesa's example. If pm_runtime_set_suspended(dev); isn't added, we'll only ever get job timeouts.
>
> Note that I'm pretty sure this isn't the proper way to do it but wanted to give some of my debugging attempts at that.
>
> I'm not sure how much further I can go and specifically I have no clue how to handle that properly. Please also verify all the claims I'm making, I'm not entirely sure I got every call stack right, it is quite convoluted and I don't have experience with those frameworks.
>
> Does anyone have a clue how we can deal with that?
>
> The driver built as a module doesn't have this issue because the regulator has enough time to probe so that PD doesn't return EPROBE_DEFER when the rocket device gets probed once reaching userspace (where the module is). I'm not entirely sure if this could be reproduced by having the regulator driver as a module too for example?
>
> Cheers,
> Quentin
>
> _______________________________________________
> Linux-rockchip mailing list
> Linux-rockchip@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-rockchip
--
Best,
Chaoyi
_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering)
2025-12-01 6:24 ` Chaoyi Chen
@ 2025-12-02 7:00 ` Chaoyi Chen
2025-12-09 6:55 ` Tomeu Vizoso
0 siblings, 1 reply; 9+ messages in thread
From: Chaoyi Chen @ 2025-12-02 7:00 UTC (permalink / raw)
To: Quentin Schulz, Tomeu Vizoso, Sebastian Reichel, Heiko Stuebner
Cc: linux-arm-kernel, linux-rockchip
Hi Quentin,
I finally had time to look into this today, and I think I've run into
the same issue you did. Please see the comment below :)
On 12/1/2025 2:24 PM, Chaoyi Chen wrote:
> Hi Quentin,
>
> On 11/29/2025 2:41 AM, Quentin Schulz wrote:
>> Hi all,
>>
>> I've finally found the time to test the RK3588 NPU (aka Rocket, upstream) on our products and I've been hit with Oopses when loosely following the instructions from https://docs.mesa3d.org/teflon.html. After debugging for a while this is often (not always!) reproducible when DRM_ACCEL_ROCKET=y (but not when built as a module!). The fact it's "not always" is a dead giveaway timing or probe order is involved, more later on that.
>>
>
> This looks like a power domain problem. What would happen if you tried
> enabling the regulator for the NPU power domain in U-Boot?
>
> Or did you properly configure the corresponding regulator supply in
> the power domain?
>
>
>> I don't think the Oops is important as I believe it's a symptom of other issues, so I won't post it.
>>
>> I'll list the issue I've identified so far, some seem easier to fix than others (the biggest issue of course being the hardest to fix).
>>
>> 1. https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_core.c#L79
>>
>> The error path after a failed pm_runtime_resume_and_get() probably should be calling rocket_core_fini(core) and not rocket_job_fini() (the former calls the latter) as to unwind everything that's been done in rocket_core_init() before this failed.
>>
>> The PM core complains about unbalanced something when the core fails to init and is retried later.
>>
>> I suggest the following:
>>
>> """
>> diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
>> index abe7719c1db46..95a732ab418c3 100644
>> --- a/drivers/accel/rocket/rocket_core.c
>> +++ b/drivers/accel/rocket/rocket_core.c
>> @@ -76,7 +76,7 @@ int rocket_core_init(struct rocket_core *core)
>>
>> err = pm_runtime_resume_and_get(dev);
>> if (err) {
>> - rocket_job_fini(core);
>> + rocket_core_fini(core);
>> return err;
>> }
>>
pm_runtime_resume_and_get() will fail here because the pmdomain
isn't ready yet.
I think we should return -EPROBE_DEFER in this case.
>> """
>>
>> I can send a patch for that.
>>
>> 2. rocket_probe() (https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_drv.c#L159) can result in out-of-bounds access when rocket_core_init() fails as rdev is a global variable (so rdev->num_cores persists between tries). Considering rdev->cores is allocated based on the number of available nodes in the DTB with the rknn compatible, if rocket_probe() is called more than the number of available nodes, rdev->cores[core] will result in out-of-bounds access.
>>
>> I believe we could do:
>>
>> """
>> diff --git a/drivers/accel/rocket/rocket_drv.c b/drivers/accel/rocket/rocket_drv.c
>> index 5c0b63f0a8f00..ad602ebd17c44 100644
>> --- a/drivers/accel/rocket/rocket_drv.c
>> +++ b/drivers/accel/rocket/rocket_drv.c
>> @@ -158,6 +158,8 @@ static const struct drm_driver rocket_drm_driver = {
>>
>> static int rocket_probe(struct platform_device *pdev)
>> {
>> + int ret;
>> +
>> if (rdev == NULL) {
>> /* First core probing, initialize DRM device. */
>> rdev = rocket_device_init(drm_dev, &rocket_drm_driver);
>> @@ -177,7 +179,15 @@ static int rocket_probe(struct platform_device *pdev)
>>
>> rdev->num_cores++;
>>
>> - return rocket_core_init(&rdev->cores[core]);
>> + ret = rocket_core_init(&rdev->cores[core]);
>> + if (ret) {
>> + rdev->num_cores--;
>> + if (rdev->num_cores == 0) {
>> + rocket_device_fini(rdev);
>> + }
>> + }
>> +
>> + return ret;
>> }
>>
>> static void rocket_remove(struct platform_device *pdev)
>>
>> """
>>
>
> It seems hacky. I'm not sure whether this would be better
> implemented as a component driver.
>
When I was looking at this code, it hadn't been merged yet,
and there is a core0 device called "rknn_top". Likewise, there's
a component over there. But in the latest version, it's gone :(
And yes, core 0 is special [0]. So I think this change is make sense.
But that's still not enough, because we may need to defer the probe.
We have to make sure that when we probe next time, core 0 still gets
index 0 here.
[0] https://lore.kernel.org/all/20250721-6-10-rocket-v9-8-77ebd484941e@tomeuvizoso.net/
>> I can send a patch for that.
>>
>> 3. I believe (but I may be wrong) that core 0 is special (do we need to schedule through core 0 maybe?). There's a hardcoded fetch of the device associated with core 0 via rocket_iommu_domain_create(rdev->cores[0].dev); (https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_drv.c#L88). The issue is that this assumes that core 0 is always rdev->cores[0], which may not be true as it depends on the probe order (which is not guaranteed), which can be exacerbated by core 0 (HW index) probe being deferred for some reason. Is it necessary for core 0 to be the one initializing the DRM device or could we simply use any core for that (just the first to probe, whichever it actually is HW-wise)?
I agree with you about the probe order problem. As I mentioned,
in earlier versions core 0 was still "rknn_top". It looks like this
would address many of the problem you raised. I don't know why
that was dropped in the merged version.
>
> I think this is because multiple cores share the same IOMMU device.
>
>>
>> If there's a need to know which of the three cores is core 0, we can have a vendor property for it or maybe aliases for the cores? We can also do a check on the base register address of the core in DT and then do a mapping in the driver. Any other suggestion?
>>
>> 4. The actual issue I'm unable to suggest something for comes now. The following is all happening in the context of DRM_ACCEL_ROCKET being built-in (that is DRM_ACCEL_ROCKET=y). You can use aarch64's default defconfig, then DRM=y, DRM_ACCEL=y and DRM_ACCEL_ROCKET=y via menuconfig. Remember that this is not 100% reproducible (more often happening than not).
>>
>> Points 1, 2 and 3 only are an issue if you hit some error-path during the probe of one of the cores, which is triggered by this point 4 here.
>>
>> rknn_core_0 depends on rknn_mmu_0 as highlighted by the iommus phandle property. rknn_mmu_0 depends on RK3588_PD_NPUTOP power-domain. The parent domain (RK3588_PD_NPU) of RK3588_PD_NPUTOP power-domain has a domain-supply on a regulator on i2c (i2c0@42 for Jaguar, patches on the ML at https://lore.kernel.org/linux-rockchip/20250812085213.1071106-2-heiko@sntech.de/). When the rocket driver calls platform_driver_register() in its module_init() function, the kernel core will try to probe the associated device, via driver_probe_device(), then __driver_probe_device() and more specifically pm_runtime_get_suppliers(dev) in there (https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/base/dd.c#L793). Then this calls pm_runtime_get_sync() on all links, c.f. https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/base/power/runtime.c#L1914. Note that pm_runtime_get_suppliers() returns void. One of the links is the rknn_mmu_0.
>>
>> Follow the next series of function calls, pm_runtime_get_sync() on rknn_mmu_0, rpm_resume(), rpm_callback(), __rpm_callback(). This then calls cb(). cb() is set via RPM_GET_CALLBACK(dev, runtime_resume). Due to rknn_mmu_0 device being probed with dev_pm_domain_attach() attaching the power-domain device to iommu via the pm_domain member, genpd_runtime_resume() (which goes to rockchip_pd_power_on()) is actually called. This then calls rockchip_pd_regulator_enable() which returns -EPROBE_DEFER. Going up in the call stack up to rpm_callback() you can see that retval = __rpm_callback(cb, dev); now is EPROBE_DEFER and thus, dev->power.runtime_error = EPROBE_DEFER at the end of the function.
>>
>> Then once pm_runtime_resume_and_get(dev); is called in rocket_core_init() during probe of the rocket device, it'll call into rpm_get_suppliers() in __rpm_callback() which in turn will call pm_runtime_get_sync() on rknn_mmu_0 which will go up to rpm_resume() where dev->power.runtime_error will be checked and if non-zero will return -EINVAL, effectively eating -EPROBE_DEFER and failing the probe of the rocket device instead of deferring the probe.
>>
>> Applying the below patch:
>>
>> """
>> diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
>> index abe7719c1db46..c4692189bbfc6 100644
>> --- a/drivers/accel/rocket/rocket_core.c
>> +++ b/drivers/accel/rocket/rocket_core.c
>> @@ -72,22 +72,9 @@ int rocket_core_init(struct rocket_core *core)
>> */
>> pm_runtime_set_autosuspend_delay(dev, 50);
>>
>> + pm_runtime_set_suspended(dev);
>> pm_runtime_enable(dev);
>>
>> - err = pm_runtime_resume_and_get(dev);
>> - if (err) {
>> - rocket_job_fini(core);
>> - return err;
>> - }
>> -
>> - version = rocket_pc_readl(core, VERSION);
>> - version += rocket_pc_readl(core, VERSION_NUM) & 0xffff;
>> -
>> - pm_runtime_mark_last_busy(dev);
>> - pm_runtime_put_autosuspend(dev);
>> -
>> - dev_info(dev, "Rockchip NPU core %d version: %d\n", core->index, version);
>> -
>> return 0;
>> }
>>
>> """
>>
I believe that once the problems above are resolved, the complexity
around point 4 will disappear as well.
>> seems to make it possible to run mesa's example. If pm_runtime_set_suspended(dev); isn't added, we'll only ever get job timeouts.
>>
>> Note that I'm pretty sure this isn't the proper way to do it but wanted to give some of my debugging attempts at that.
>>
>> I'm not sure how much further I can go and specifically I have no clue how to handle that properly. Please also verify all the claims I'm making, I'm not entirely sure I got every call stack right, it is quite convoluted and I don't have experience with those frameworks.
>>
>> Does anyone have a clue how we can deal with that?
>>
>> The driver built as a module doesn't have this issue because the regulator has enough time to probe so that PD doesn't return EPROBE_DEFER when the rocket device gets probed once reaching userspace (where the module is). I'm not entirely sure if this could be reproduced by having the regulator driver as a module too for example?
>>
>> Cheers,
>> Quentin
>>
>> _______________________________________________
>> Linux-rockchip mailing list
>> Linux-rockchip@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-rockchip
>
--
Best,
Chaoyi
_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering)
2025-12-02 7:00 ` Chaoyi Chen
@ 2025-12-09 6:55 ` Tomeu Vizoso
2025-12-09 9:31 ` Chaoyi Chen
0 siblings, 1 reply; 9+ messages in thread
From: Tomeu Vizoso @ 2025-12-09 6:55 UTC (permalink / raw)
To: Chaoyi Chen
Cc: Quentin Schulz, Sebastian Reichel, Heiko Stuebner,
linux-arm-kernel, linux-rockchip
On Tue, Dec 2, 2025 at 8:00 AM Chaoyi Chen <chaoyi.chen@rock-chips.com> wrote:
>
> Hi Quentin,
>
> I finally had time to look into this today, and I think I've run into
> the same issue you did. Please see the comment below :)
>
> On 12/1/2025 2:24 PM, Chaoyi Chen wrote:
> > Hi Quentin,
> >
> > On 11/29/2025 2:41 AM, Quentin Schulz wrote:
> >> Hi all,
> >>
> >> I've finally found the time to test the RK3588 NPU (aka Rocket, upstream) on our products and I've been hit with Oopses when loosely following the instructions from https://docs.mesa3d.org/teflon.html. After debugging for a while this is often (not always!) reproducible when DRM_ACCEL_ROCKET=y (but not when built as a module!). The fact it's "not always" is a dead giveaway timing or probe order is involved, more later on that.
> >>
> >
> > This looks like a power domain problem. What would happen if you tried
> > enabling the regulator for the NPU power domain in U-Boot?
> >
> > Or did you properly configure the corresponding regulator supply in
> > the power domain?
> >
> >
> >> I don't think the Oops is important as I believe it's a symptom of other issues, so I won't post it.
> >>
> >> I'll list the issue I've identified so far, some seem easier to fix than others (the biggest issue of course being the hardest to fix).
> >>
> >> 1. https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_core.c#L79
> >>
> >> The error path after a failed pm_runtime_resume_and_get() probably should be calling rocket_core_fini(core) and not rocket_job_fini() (the former calls the latter) as to unwind everything that's been done in rocket_core_init() before this failed.
> >>
> >> The PM core complains about unbalanced something when the core fails to init and is retried later.
> >>
> >> I suggest the following:
> >>
> >> """
> >> diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
> >> index abe7719c1db46..95a732ab418c3 100644
> >> --- a/drivers/accel/rocket/rocket_core.c
> >> +++ b/drivers/accel/rocket/rocket_core.c
> >> @@ -76,7 +76,7 @@ int rocket_core_init(struct rocket_core *core)
> >>
> >> err = pm_runtime_resume_and_get(dev);
> >> if (err) {
> >> - rocket_job_fini(core);
> >> + rocket_core_fini(core);
> >> return err;
> >> }
> >>
>
> pm_runtime_resume_and_get() will fail here because the pmdomain
> isn't ready yet.
>
> I think we should return -EPROBE_DEFER in this case.
>
> >> """
> >>
> >> I can send a patch for that.
> >>
> >> 2. rocket_probe() (https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_drv.c#L159) can result in out-of-bounds access when rocket_core_init() fails as rdev is a global variable (so rdev->num_cores persists between tries). Considering rdev->cores is allocated based on the number of available nodes in the DTB with the rknn compatible, if rocket_probe() is called more than the number of available nodes, rdev->cores[core] will result in out-of-bounds access.
> >>
> >> I believe we could do:
> >>
> >> """
> >> diff --git a/drivers/accel/rocket/rocket_drv.c b/drivers/accel/rocket/rocket_drv.c
> >> index 5c0b63f0a8f00..ad602ebd17c44 100644
> >> --- a/drivers/accel/rocket/rocket_drv.c
> >> +++ b/drivers/accel/rocket/rocket_drv.c
> >> @@ -158,6 +158,8 @@ static const struct drm_driver rocket_drm_driver = {
> >>
> >> static int rocket_probe(struct platform_device *pdev)
> >> {
> >> + int ret;
> >> +
> >> if (rdev == NULL) {
> >> /* First core probing, initialize DRM device. */
> >> rdev = rocket_device_init(drm_dev, &rocket_drm_driver);
> >> @@ -177,7 +179,15 @@ static int rocket_probe(struct platform_device *pdev)
> >>
> >> rdev->num_cores++;
> >>
> >> - return rocket_core_init(&rdev->cores[core]);
> >> + ret = rocket_core_init(&rdev->cores[core]);
> >> + if (ret) {
> >> + rdev->num_cores--;
> >> + if (rdev->num_cores == 0) {
> >> + rocket_device_fini(rdev);
> >> + }
> >> + }
> >> +
> >> + return ret;
> >> }
> >>
> >> static void rocket_remove(struct platform_device *pdev)
> >>
> >> """
> >>
> >
> > It seems hacky. I'm not sure whether this would be better
> > implemented as a component driver.
> >
>
> When I was looking at this code, it hadn't been merged yet,
> and there is a core0 device called "rknn_top". Likewise, there's
> a component over there. But in the latest version, it's gone :(
>
> And yes, core 0 is special [0]. So I think this change is make sense.
> But that's still not enough, because we may need to defer the probe.
> We have to make sure that when we probe next time, core 0 still gets
> index 0 here.
>
> [0] https://lore.kernel.org/all/20250721-6-10-rocket-v9-8-77ebd484941e@tomeuvizoso.net/
>
> >> I can send a patch for that.
> >>
> >> 3. I believe (but I may be wrong) that core 0 is special (do we need to schedule through core 0 maybe?). There's a hardcoded fetch of the device associated with core 0 via rocket_iommu_domain_create(rdev->cores[0].dev); (https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_drv.c#L88). The issue is that this assumes that core 0 is always rdev->cores[0], which may not be true as it depends on the probe order (which is not guaranteed), which can be exacerbated by core 0 (HW index) probe being deferred for some reason. Is it necessary for core 0 to be the one initializing the DRM device or could we simply use any core for that (just the first to probe, whichever it actually is HW-wise)?
>
> I agree with you about the probe order problem. As I mentioned,
> in earlier versions core 0 was still "rknn_top". It looks like this
> would address many of the problem you raised. I don't know why
> that was dropped in the merged version.
This need was questioned during the review process, and during my
stress testing I didn't find a need for core 0 to be treated
specially. So it was considered unneeded complexity and the code was
dropped.
My testing involves running the supported models as a whole, and also
their individual operations as individual tests. I run them in 8
parallel processes, in batches so we have continuous bring up and down
of clients.
I'm a bit surprised that that testing worked stably but we have still
failures in some setups.
Regards,
Tomeu
> >
> > I think this is because multiple cores share the same IOMMU device.
> >
> >>
> >> If there's a need to know which of the three cores is core 0, we can have a vendor property for it or maybe aliases for the cores? We can also do a check on the base register address of the core in DT and then do a mapping in the driver. Any other suggestion?
> >>
> >> 4. The actual issue I'm unable to suggest something for comes now. The following is all happening in the context of DRM_ACCEL_ROCKET being built-in (that is DRM_ACCEL_ROCKET=y). You can use aarch64's default defconfig, then DRM=y, DRM_ACCEL=y and DRM_ACCEL_ROCKET=y via menuconfig. Remember that this is not 100% reproducible (more often happening than not).
> >>
> >> Points 1, 2 and 3 only are an issue if you hit some error-path during the probe of one of the cores, which is triggered by this point 4 here.
> >>
> >> rknn_core_0 depends on rknn_mmu_0 as highlighted by the iommus phandle property. rknn_mmu_0 depends on RK3588_PD_NPUTOP power-domain. The parent domain (RK3588_PD_NPU) of RK3588_PD_NPUTOP power-domain has a domain-supply on a regulator on i2c (i2c0@42 for Jaguar, patches on the ML at https://lore.kernel.org/linux-rockchip/20250812085213.1071106-2-heiko@sntech.de/). When the rocket driver calls platform_driver_register() in its module_init() function, the kernel core will try to probe the associated device, via driver_probe_device(), then __driver_probe_device() and more specifically pm_runtime_get_suppliers(dev) in there (https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/base/dd.c#L793). Then this calls pm_runtime_get_sync() on all links, c.f. https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/base/power/runtime.c#L1914. Note that pm_runtime_get_suppliers() returns void. One of the links is the rknn_mmu_0.
> >>
> >> Follow the next series of function calls, pm_runtime_get_sync() on rknn_mmu_0, rpm_resume(), rpm_callback(), __rpm_callback(). This then calls cb(). cb() is set via RPM_GET_CALLBACK(dev, runtime_resume). Due to rknn_mmu_0 device being probed with dev_pm_domain_attach() attaching the power-domain device to iommu via the pm_domain member, genpd_runtime_resume() (which goes to rockchip_pd_power_on()) is actually called. This then calls rockchip_pd_regulator_enable() which returns -EPROBE_DEFER. Going up in the call stack up to rpm_callback() you can see that retval = __rpm_callback(cb, dev); now is EPROBE_DEFER and thus, dev->power.runtime_error = EPROBE_DEFER at the end of the function.
> >>
> >> Then once pm_runtime_resume_and_get(dev); is called in rocket_core_init() during probe of the rocket device, it'll call into rpm_get_suppliers() in __rpm_callback() which in turn will call pm_runtime_get_sync() on rknn_mmu_0 which will go up to rpm_resume() where dev->power.runtime_error will be checked and if non-zero will return -EINVAL, effectively eating -EPROBE_DEFER and failing the probe of the rocket device instead of deferring the probe.
> >>
> >> Applying the below patch:
> >>
> >> """
> >> diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
> >> index abe7719c1db46..c4692189bbfc6 100644
> >> --- a/drivers/accel/rocket/rocket_core.c
> >> +++ b/drivers/accel/rocket/rocket_core.c
> >> @@ -72,22 +72,9 @@ int rocket_core_init(struct rocket_core *core)
> >> */
> >> pm_runtime_set_autosuspend_delay(dev, 50);
> >>
> >> + pm_runtime_set_suspended(dev);
> >> pm_runtime_enable(dev);
> >>
> >> - err = pm_runtime_resume_and_get(dev);
> >> - if (err) {
> >> - rocket_job_fini(core);
> >> - return err;
> >> - }
> >> -
> >> - version = rocket_pc_readl(core, VERSION);
> >> - version += rocket_pc_readl(core, VERSION_NUM) & 0xffff;
> >> -
> >> - pm_runtime_mark_last_busy(dev);
> >> - pm_runtime_put_autosuspend(dev);
> >> -
> >> - dev_info(dev, "Rockchip NPU core %d version: %d\n", core->index, version);
> >> -
> >> return 0;
> >> }
> >>
> >> """
> >>
>
> I believe that once the problems above are resolved, the complexity
> around point 4 will disappear as well.
>
> >> seems to make it possible to run mesa's example. If pm_runtime_set_suspended(dev); isn't added, we'll only ever get job timeouts.
> >>
> >> Note that I'm pretty sure this isn't the proper way to do it but wanted to give some of my debugging attempts at that.
> >>
> >> I'm not sure how much further I can go and specifically I have no clue how to handle that properly. Please also verify all the claims I'm making, I'm not entirely sure I got every call stack right, it is quite convoluted and I don't have experience with those frameworks.
> >>
> >> Does anyone have a clue how we can deal with that?
> >>
> >> The driver built as a module doesn't have this issue because the regulator has enough time to probe so that PD doesn't return EPROBE_DEFER when the rocket device gets probed once reaching userspace (where the module is). I'm not entirely sure if this could be reproduced by having the regulator driver as a module too for example?
> >>
> >> Cheers,
> >> Quentin
> >>
> >> _______________________________________________
> >> Linux-rockchip mailing list
> >> Linux-rockchip@lists.infradead.org
> >> http://lists.infradead.org/mailman/listinfo/linux-rockchip
> >
>
> --
> Best,
> Chaoyi
_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering)
2025-12-09 6:55 ` Tomeu Vizoso
@ 2025-12-09 9:31 ` Chaoyi Chen
2025-12-10 6:50 ` Tomeu Vizoso
0 siblings, 1 reply; 9+ messages in thread
From: Chaoyi Chen @ 2025-12-09 9:31 UTC (permalink / raw)
To: Tomeu Vizoso
Cc: Quentin Schulz, Sebastian Reichel, Heiko Stuebner,
linux-arm-kernel, linux-rockchip
Hello Tomeu,
On 12/9/2025 2:55 PM, Tomeu Vizoso wrote:
> On Tue, Dec 2, 2025 at 8:00 AM Chaoyi Chen <chaoyi.chen@rock-chips.com> wrote:
[...]
>
> This need was questioned during the review process, and during my
> stress testing I didn't find a need for core 0 to be treated
> specially. So it was considered unneeded complexity and the code was
> dropped.
>
Oh, I just noticed that the commit comment in the devicetree still
contains a relevant description.
> My testing involves running the supported models as a whole, and also
> their individual operations as individual tests. I run them in 8
> parallel processes, in batches so we have continuous bring up and down
> of clients.
>
> I'm a bit surprised that that testing worked stably but we have still
> failures in some setups.
Could you please describe the failures that still exist? Thank you.
--
Best,
Chaoyi
_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering)
2025-12-09 9:31 ` Chaoyi Chen
@ 2025-12-10 6:50 ` Tomeu Vizoso
2025-12-10 10:47 ` Chaoyi Chen
0 siblings, 1 reply; 9+ messages in thread
From: Tomeu Vizoso @ 2025-12-10 6:50 UTC (permalink / raw)
To: Chaoyi Chen
Cc: Quentin Schulz, Sebastian Reichel, Heiko Stuebner,
linux-arm-kernel, linux-rockchip
On Tue, Dec 9, 2025 at 10:31 AM Chaoyi Chen <chaoyi.chen@rock-chips.com> wrote:
>
> Hello Tomeu,
>
> On 12/9/2025 2:55 PM, Tomeu Vizoso wrote:
> > On Tue, Dec 2, 2025 at 8:00 AM Chaoyi Chen <chaoyi.chen@rock-chips.com> wrote:
>
> [...]
>
> >
> > This need was questioned during the review process, and during my
> > stress testing I didn't find a need for core 0 to be treated
> > specially. So it was considered unneeded complexity and the code was
> > dropped.
> >
>
> Oh, I just noticed that the commit comment in the devicetree still
> contains a relevant description.
>
> > My testing involves running the supported models as a whole, and also
> > their individual operations as individual tests. I run them in 8
> > parallel processes, in batches so we have continuous bring up and down
> > of clients.
> >
> > I'm a bit surprised that that testing worked stably but we have still
> > failures in some setups.
>
> Could you please describe the failures that still exist? Thank you.
I was referring to the ones that Quentin reported when starting this thread.
Regards,
Tomeu
_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering)
2025-12-10 6:50 ` Tomeu Vizoso
@ 2025-12-10 10:47 ` Chaoyi Chen
2025-12-10 16:39 ` Tomeu Vizoso
0 siblings, 1 reply; 9+ messages in thread
From: Chaoyi Chen @ 2025-12-10 10:47 UTC (permalink / raw)
To: Tomeu Vizoso
Cc: Quentin Schulz, Sebastian Reichel, Heiko Stuebner,
linux-arm-kernel, linux-rockchip
Hello Tomeu,
On 12/10/2025 2:50 PM, Tomeu Vizoso wrote:
> On Tue, Dec 9, 2025 at 10:31 AM Chaoyi Chen <chaoyi.chen@rock-chips.com> wrote:
>>
>> Hello Tomeu,
>>
>> On 12/9/2025 2:55 PM, Tomeu Vizoso wrote:
>>> On Tue, Dec 2, 2025 at 8:00 AM Chaoyi Chen <chaoyi.chen@rock-chips.com> wrote:
>>
>> [...]
>>
>>>
>>> This need was questioned during the review process, and during my
>>> stress testing I didn't find a need for core 0 to be treated
>>> specially. So it was considered unneeded complexity and the code was
>>> dropped.
>>>
>>
>> Oh, I just noticed that the commit comment in the devicetree still
>> contains a relevant description.
>>
>>> My testing involves running the supported models as a whole, and also
>>> their individual operations as individual tests. I run them in 8
>>> parallel processes, in batches so we have continuous bring up and down
>>> of clients.
>>>
>>> I'm a bit surprised that that testing worked stably but we have still
>>> failures in some setups.
>>
>> Could you please describe the failures that still exist? Thank you.
>
> I was referring to the ones that Quentin reported when starting this thread.
>
I believe this pmdomain patch can alleviate some of the problems [0].
However, the problem Quentin mentioned still exists. For example, if
there is an invalid device on core0, while core1 and core2 are valid,
accessing the relevant data on core0 will cause problems. Quentin's
patch can fix this, but he doesn't seem to have time for it lately...
[0] https://lore.kernel.org/all/20251205064739.20270-1-rmxpzlb@gmail.com/
> Regards,
>
> Tomeu
>
>
--
Best,
Chaoyi
_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering)
2025-12-10 10:47 ` Chaoyi Chen
@ 2025-12-10 16:39 ` Tomeu Vizoso
2025-12-12 10:25 ` Quentin Schulz
0 siblings, 1 reply; 9+ messages in thread
From: Tomeu Vizoso @ 2025-12-10 16:39 UTC (permalink / raw)
To: Chaoyi Chen
Cc: Quentin Schulz, Sebastian Reichel, Heiko Stuebner,
linux-arm-kernel, linux-rockchip
On Wed, Dec 10, 2025 at 11:47 AM Chaoyi Chen <chaoyi.chen@rock-chips.com> wrote:
>
> Hello Tomeu,
>
> On 12/10/2025 2:50 PM, Tomeu Vizoso wrote:
> > On Tue, Dec 9, 2025 at 10:31 AM Chaoyi Chen <chaoyi.chen@rock-chips.com> wrote:
> >>
> >> Hello Tomeu,
> >>
> >> On 12/9/2025 2:55 PM, Tomeu Vizoso wrote:
> >>> On Tue, Dec 2, 2025 at 8:00 AM Chaoyi Chen <chaoyi.chen@rock-chips.com> wrote:
> >>
> >> [...]
> >>
> >>>
> >>> This need was questioned during the review process, and during my
> >>> stress testing I didn't find a need for core 0 to be treated
> >>> specially. So it was considered unneeded complexity and the code was
> >>> dropped.
> >>>
> >>
> >> Oh, I just noticed that the commit comment in the devicetree still
> >> contains a relevant description.
> >>
> >>> My testing involves running the supported models as a whole, and also
> >>> their individual operations as individual tests. I run them in 8
> >>> parallel processes, in batches so we have continuous bring up and down
> >>> of clients.
> >>>
> >>> I'm a bit surprised that that testing worked stably but we have still
> >>> failures in some setups.
> >>
> >> Could you please describe the failures that still exist? Thank you.
> >
> > I was referring to the ones that Quentin reported when starting this thread.
> >
>
> I believe this pmdomain patch can alleviate some of the problems [0].
>
> However, the problem Quentin mentioned still exists. For example, if
> there is an invalid device on core0, while core1 and core2 are valid,
> accessing the relevant data on core0 will cause problems. Quentin's
> patch can fix this, but he doesn't seem to have time for it lately...
Has anybody reproduced this problem with the pmdomain fix applied?
Thanks,
Tomeu
> [0] https://lore.kernel.org/all/20251205064739.20270-1-rmxpzlb@gmail.com/
>
> > Regards,
> >
> > Tomeu
> >
> >
>
> --
> Best,
> Chaoyi
_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering)
2025-12-10 16:39 ` Tomeu Vizoso
@ 2025-12-12 10:25 ` Quentin Schulz
0 siblings, 0 replies; 9+ messages in thread
From: Quentin Schulz @ 2025-12-12 10:25 UTC (permalink / raw)
To: Tomeu Vizoso, Chaoyi Chen
Cc: Sebastian Reichel, Heiko Stuebner, linux-arm-kernel,
linux-rockchip
Hi Tomeu,
On 12/10/25 5:39 PM, Tomeu Vizoso wrote:
> [You don't often get email from tomeu@tomeuvizoso.net. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Wed, Dec 10, 2025 at 11:47 AM Chaoyi Chen <chaoyi.chen@rock-chips.com> wrote:
>>
>> Hello Tomeu,
>>
>> On 12/10/2025 2:50 PM, Tomeu Vizoso wrote:
>>> On Tue, Dec 9, 2025 at 10:31 AM Chaoyi Chen <chaoyi.chen@rock-chips.com> wrote:
>>>>
>>>> Hello Tomeu,
>>>>
>>>> On 12/9/2025 2:55 PM, Tomeu Vizoso wrote:
>>>>> On Tue, Dec 2, 2025 at 8:00 AM Chaoyi Chen <chaoyi.chen@rock-chips.com> wrote:
>>>>
>>>> [...]
>>>>
>>>>>
>>>>> This need was questioned during the review process, and during my
>>>>> stress testing I didn't find a need for core 0 to be treated
>>>>> specially. So it was considered unneeded complexity and the code was
>>>>> dropped.
>>>>>
>>>>
>>>> Oh, I just noticed that the commit comment in the devicetree still
>>>> contains a relevant description.
>>>>
>>>>> My testing involves running the supported models as a whole, and also
>>>>> their individual operations as individual tests. I run them in 8
>>>>> parallel processes, in batches so we have continuous bring up and down
>>>>> of clients.
>>>>>
>>>>> I'm a bit surprised that that testing worked stably but we have still
>>>>> failures in some setups.
>>>>
>>>> Could you please describe the failures that still exist? Thank you.
>>>
>>> I was referring to the ones that Quentin reported when starting this thread.
>>>
>>
>> I believe this pmdomain patch can alleviate some of the problems [0].
>>
>> However, the problem Quentin mentioned still exists. For example, if
>> there is an invalid device on core0, while core1 and core2 are valid,
>> accessing the relevant data on core0 will cause problems. Quentin's
>> patch can fix this, but he doesn't seem to have time for it lately...
>
> Has anybody reproduced this problem with the pmdomain fix applied?
>
If I've written my reboot-loop script properly, 715 successful boots and
0 failed, so I guess this works.
The script:
"""
#!/usr/bin/env bash
set -ux
myreboot() {
/bin/sync
echo rebooting
echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger
}
mount -t proc none /proc
mount -t sysfs none /sys
FAIL_TXT=/root/fail.txt
SUCCESS_TXT=/root/success.txt
[ -d /sys/bus/platform/drivers/rocket/fdab0000.npu/ ] && [ -d
/sys/bus/platform/drivers/rocket/fdac0000.npu/ ] && [ -d
/sys/bus/platform/drivers/rocket/fdad0000.npu/ ]
RES=$?
if [ $RES -ne 0 ]; then
VAL=$(cat $FAIL_TXT)
VAL=$(( VAL + 1 ))
echo -n $VAL > $FAIL_TXT
echo "FAILED!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
myreboot
fi
TEFLON_DEBUG=verbose /root/mesa/venv/bin/python3
/root/mesa/src/gallium/frontends/teflon/tests/classification.py -i
/root/mesa/grace_hopper.bmp -m
/root/mesa/src/gallium/targets/teflon/tests/models/mobilenetv1/mobilenet_v1_1_224_quant.tflite
-l
/root/mesa/src/gallium/frontends/teflon/tests/labels_mobilenet_quant_v1_224.txt
-e /root/mesa/builddir/src/gallium/targets/teflon/libteflon.so
RES=$?
if [ $RES -ne 0 ]; then
VAL=$(cat $FAIL_TXT)
VAL=$(( VAL + 1 ))
echo -n $VAL > $FAIL_TXT
echo "FAILED!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
else
VAL=$(cat $SUCCESS_TXT)
VAL=$(( VAL + 1 ))
echo -n $VAL > $SUCCESS_TXT
fi
myreboot
"""
Seems like the cores are always listed in
/sys/bus/platform/drivers/rocket/ even when one fails to probe, but then
classification.py triggers an Oops and it seems the exit code of that
script is not 0 when it happens. This info was gathered from the one
boot I tested *without* the patch applied.
Cheers,
Quentin
_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-12-12 10:26 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-28 18:41 Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering) Quentin Schulz
2025-12-01 6:24 ` Chaoyi Chen
2025-12-02 7:00 ` Chaoyi Chen
2025-12-09 6:55 ` Tomeu Vizoso
2025-12-09 9:31 ` Chaoyi Chen
2025-12-10 6:50 ` Tomeu Vizoso
2025-12-10 10:47 ` Chaoyi Chen
2025-12-10 16:39 ` Tomeu Vizoso
2025-12-12 10:25 ` Quentin Schulz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox