* [PATCH 0/2] remoteproc: improve robustness for rproc_attach fail cases
@ 2026-04-09 8:46 Jingyi Wang
2026-04-09 8:46 ` [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path Jingyi Wang
2026-04-09 8:46 ` [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop() Jingyi Wang
0 siblings, 2 replies; 9+ messages in thread
From: Jingyi Wang @ 2026-04-09 8:46 UTC (permalink / raw)
To: Bjorn Andersson, Mathieu Poirier
Cc: aiqun.yu, tingwei.zhang, trilok.soni, yijie.yang,
linux-remoteproc, linux-kernel, linux-arm-msm, Jingyi Wang
Failure in rproc_attach() path can cause issues like NULL pointer
dereference when doing recovery and rproc_add() fail, improve the
robustness for rproc_attach fail cases.
Previous discussion:
https://lore.kernel.org/all/9bdc6b6d-ddf0-47af-b1ed-8d1e75bf30c2@oss.qualcomm.com/
Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
---
Jingyi Wang (2):
remoteproc: core: Attach rproc asynchronously in rproc_add() path
remoteproc: qcom: Check glink->edge in glink_subdev_stop()
drivers/remoteproc/qcom_common.c | 3 +++
drivers/remoteproc/remoteproc_core.c | 20 ++++++++++++--------
include/linux/remoteproc.h | 1 +
3 files changed, 16 insertions(+), 8 deletions(-)
---
base-commit: db7efce4ae23ad5e42f5f55428f529ff62b86fab
change-id: 20260409-rproc-attach-issue-2e3290d6d013
Best regards,
--
Jingyi Wang <jingyi.wang@oss.qualcomm.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path
2026-04-09 8:46 [PATCH 0/2] remoteproc: improve robustness for rproc_attach fail cases Jingyi Wang
@ 2026-04-09 8:46 ` Jingyi Wang
2026-04-10 14:28 ` Stephan Gerhold
2026-04-09 8:46 ` [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop() Jingyi Wang
1 sibling, 1 reply; 9+ messages in thread
From: Jingyi Wang @ 2026-04-09 8:46 UTC (permalink / raw)
To: Bjorn Andersson, Mathieu Poirier
Cc: aiqun.yu, tingwei.zhang, trilok.soni, yijie.yang,
linux-remoteproc, linux-kernel, linux-arm-msm, Jingyi Wang
For rproc with state RPROC_DETACHED and auto_boot enabled, the attach
callback will be called in the rproc_add()->rproc_trigger_auto_boot()->
rproc_boot() path, the failure in this path will cause the rproc_add()
fail and the resource release, which will cause issue like rproc recovery
or falling back to firmware load fail. Add attach_work for rproc and call
it asynchronously in rproc_add() path like what rproc_start() do.
Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
---
drivers/remoteproc/remoteproc_core.c | 20 ++++++++++++--------
include/linux/remoteproc.h | 1 +
2 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index b087ed21858a..f02db1113fae 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1673,18 +1673,21 @@ static void rproc_auto_boot_callback(const struct firmware *fw, void *context)
release_firmware(fw);
}
+static void rproc_attach_work(struct work_struct *work)
+{
+ struct rproc *rproc = container_of(work, struct rproc, attach_work);
+
+ rproc_boot(rproc);
+}
+
static int rproc_trigger_auto_boot(struct rproc *rproc)
{
int ret;
- /*
- * Since the remote processor is in a detached state, it has already
- * been booted by another entity. As such there is no point in waiting
- * for a firmware image to be loaded, we can simply initiate the process
- * of attaching to it immediately.
- */
- if (rproc->state == RPROC_DETACHED)
- return rproc_boot(rproc);
+ if (rproc->state == RPROC_DETACHED) {
+ schedule_work(&rproc->attach_work);
+ return 0;
+ }
/*
* We're initiating an asynchronous firmware loading, so we can
@@ -2512,6 +2515,7 @@ struct rproc *rproc_alloc(struct device *dev, const char *name,
INIT_LIST_HEAD(&rproc->dump_segments);
INIT_WORK(&rproc->crash_handler, rproc_crash_handler_work);
+ INIT_WORK(&rproc->attach_work, rproc_attach_work);
rproc->state = RPROC_OFFLINE;
diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
index b4795698d8c2..9b3f748e9eca 100644
--- a/include/linux/remoteproc.h
+++ b/include/linux/remoteproc.h
@@ -568,6 +568,7 @@ struct rproc {
struct list_head subdevs;
struct idr notifyids;
int index;
+ struct work_struct attach_work;
struct work_struct crash_handler;
unsigned int crash_cnt;
bool recovery_disabled;
--
2.34.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop()
2026-04-09 8:46 [PATCH 0/2] remoteproc: improve robustness for rproc_attach fail cases Jingyi Wang
2026-04-09 8:46 ` [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path Jingyi Wang
@ 2026-04-09 8:46 ` Jingyi Wang
2026-04-10 14:15 ` Stephan Gerhold
1 sibling, 1 reply; 9+ messages in thread
From: Jingyi Wang @ 2026-04-09 8:46 UTC (permalink / raw)
To: Bjorn Andersson, Mathieu Poirier
Cc: aiqun.yu, tingwei.zhang, trilok.soni, yijie.yang,
linux-remoteproc, linux-kernel, linux-arm-msm, Jingyi Wang
For rproc that doing attach, glink_subdev_start() is called only when
attach successfully. If rproc_report_crash() is called in the attach
function, rproc_boot_recovery()->rproc_stop()->glink_subdev_stop() could
be called and cause NULL pointer dereference:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
Mem abort info:
...
pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
...
Call trace:
qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
glink_subdev_stop+0x1c/0x30 [qcom_common]
rproc_stop+0x58/0x17c
rproc_trigger_recovery+0xb0/0x150
rproc_crash_handler_work+0xa4/0xc4
process_scheduled_works+0x18c/0x2d8
worker_thread+0x144/0x280
kthread+0x124/0x138
ret_from_fork+0x10/0x20
Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
---[ end trace 0000000000000000 ]---
Add NULL pointer check in the glink_subdev_stop() to make sure
qcom_glink_smem_unregister() will not be called if glink_subdev_start()
is not called.
Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
---
drivers/remoteproc/qcom_common.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c
index fd2b6824ad26..79d9d45e0b81 100644
--- a/drivers/remoteproc/qcom_common.c
+++ b/drivers/remoteproc/qcom_common.c
@@ -220,6 +220,9 @@ static void glink_subdev_stop(struct rproc_subdev *subdev, bool crashed)
{
struct qcom_rproc_glink *glink = to_glink_subdev(subdev);
+ if (!glink->edge)
+ return;
+
qcom_glink_smem_unregister(glink->edge);
glink->edge = NULL;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop()
2026-04-09 8:46 ` [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop() Jingyi Wang
@ 2026-04-10 14:15 ` Stephan Gerhold
2026-04-14 3:23 ` Jingyi Wang
0 siblings, 1 reply; 9+ messages in thread
From: Stephan Gerhold @ 2026-04-10 14:15 UTC (permalink / raw)
To: Jingyi Wang
Cc: Bjorn Andersson, Mathieu Poirier, aiqun.yu, tingwei.zhang,
trilok.soni, yijie.yang, linux-remoteproc, linux-kernel,
linux-arm-msm
On Thu, Apr 09, 2026 at 01:46:22AM -0700, Jingyi Wang wrote:
> For rproc that doing attach, glink_subdev_start() is called only when
> attach successfully. If rproc_report_crash() is called in the attach
> function, rproc_boot_recovery()->rproc_stop()->glink_subdev_stop() could
> be called and cause NULL pointer dereference:
>
> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
> Mem abort info:
> ...
> pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
> lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
> ...
> Call trace:
> qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
> glink_subdev_stop+0x1c/0x30 [qcom_common]
> rproc_stop+0x58/0x17c
> rproc_trigger_recovery+0xb0/0x150
> rproc_crash_handler_work+0xa4/0xc4
> process_scheduled_works+0x18c/0x2d8
> worker_thread+0x144/0x280
> kthread+0x124/0x138
> ret_from_fork+0x10/0x20
> Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
> ---[ end trace 0000000000000000 ]---
>
> Add NULL pointer check in the glink_subdev_stop() to make sure
> qcom_glink_smem_unregister() will not be called if glink_subdev_start()
> is not called.
>
You mention the actual root problem here: Why is glink_subdev_stop()
called if glink_subdev_start() wasn't called?
The call to rproc_start_subdevices() in __rproc_attach() makes sure that
all subdevices are in consistent state when exiting the function (either
prepared+started or stopped+unprepared). Only if all subdevices were
started successfully, the rproc->state is changed to RPROC_ATTACHED.
In your case, attaching the rproc failed so the rproc->state should be
still RPROC_DETACHED. All subdevices should be stopped+unprepared. We
shouldn't stop/unprepare any subdevices again in this state, they all
might crash like glink does here.
We know that subdevices are already stopped+unprepared in RPROC_DETACHED
state, so I think you just need to skip rproc_stop_subdevices() and
rproc_unprepare_subdevices() inside rproc_stop() in this case, see diff
below.
Thanks,
Stephan
@@ -1708,8 +1709,9 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
if (!rproc->ops->stop)
return -EINVAL;
- /* Stop any subdevices for the remote processor */
- rproc_stop_subdevices(rproc, crashed);
+ /* Stop any subdevices for the remote processor if it was attached */
+ if (rproc->state != RPROC_DETACHED)
+ rproc_stop_subdevices(rproc, crashed);
/* the installed resource table is no longer accessible */
ret = rproc_reset_rsc_table_on_stop(rproc);
@@ -1726,7 +1728,8 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
return ret;
}
- rproc_unprepare_subdevices(rproc);
+ if (rproc->state != RPROC_DETACHED)
+ rproc_unprepare_subdevices(rproc);
rproc->state = RPROC_OFFLINE;
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path
2026-04-09 8:46 ` [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path Jingyi Wang
@ 2026-04-10 14:28 ` Stephan Gerhold
2026-04-14 3:41 ` Jingyi Wang
0 siblings, 1 reply; 9+ messages in thread
From: Stephan Gerhold @ 2026-04-10 14:28 UTC (permalink / raw)
To: Jingyi Wang
Cc: Bjorn Andersson, Mathieu Poirier, aiqun.yu, tingwei.zhang,
trilok.soni, yijie.yang, linux-remoteproc, linux-kernel,
linux-arm-msm, Bartosz Golaszewski, Dmitry Baryshkov
+Cc Bartosz, Dmitry
On Thu, Apr 09, 2026 at 01:46:21AM -0700, Jingyi Wang wrote:
> For rproc with state RPROC_DETACHED and auto_boot enabled, the attach
> callback will be called in the rproc_add()->rproc_trigger_auto_boot()->
> rproc_boot() path, the failure in this path will cause the rproc_add()
> fail and the resource release, which will cause issue like rproc recovery
> or falling back to firmware load fail. Add attach_work for rproc and call
> it asynchronously in rproc_add() path like what rproc_start() do.
>
> Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
> ---
> drivers/remoteproc/remoteproc_core.c | 20 ++++++++++++--------
> include/linux/remoteproc.h | 1 +
> 2 files changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index b087ed21858a..f02db1113fae 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1673,18 +1673,21 @@ static void rproc_auto_boot_callback(const struct firmware *fw, void *context)
> release_firmware(fw);
> }
>
> +static void rproc_attach_work(struct work_struct *work)
> +{
> + struct rproc *rproc = container_of(work, struct rproc, attach_work);
> +
> + rproc_boot(rproc);
> +}
> +
> static int rproc_trigger_auto_boot(struct rproc *rproc)
> {
> int ret;
>
> - /*
> - * Since the remote processor is in a detached state, it has already
> - * been booted by another entity. As such there is no point in waiting
> - * for a firmware image to be loaded, we can simply initiate the process
> - * of attaching to it immediately.
> - */
> - if (rproc->state == RPROC_DETACHED)
> - return rproc_boot(rproc);
> + if (rproc->state == RPROC_DETACHED) {
> + schedule_work(&rproc->attach_work);
> + return 0;
> + }
I think the change itself is reasonable to make "auto-attach" behavior
consistent with "auto-boot". The commit message is a bit misleading
though:
- You're really doing two separate functional changes here:
(1) Ignore the return value of rproc_boot() during auto-boot attach,
to keep the remoteproc registered and available in sysfs even if
attaching fails.
(2) Run the rproc_boot() in the background using schedule_work().
[To improve boot performance? To work around some locking issues?]
- The actual issue you are seeing sounds like a use-after-free in the
remoteproc core error cleanup path. I think this one is still
present, we should really have a call to
cancel_work_sync(&rproc->crash_handler) as Dmitry wrote in the
previous discussion [1].
Thanks,
Stephan
[1]: https://lore.kernel.org/all/ce24a2sgg4b6wymoxwgl2ve6np2nxn2wuxfqxfpmvqqrpvgouf@xihd6ziqwu4m/
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop()
2026-04-10 14:15 ` Stephan Gerhold
@ 2026-04-14 3:23 ` Jingyi Wang
2026-04-14 8:27 ` Stephan Gerhold
0 siblings, 1 reply; 9+ messages in thread
From: Jingyi Wang @ 2026-04-14 3:23 UTC (permalink / raw)
To: Stephan Gerhold
Cc: Bjorn Andersson, Mathieu Poirier, aiqun.yu, tingwei.zhang,
trilok.soni, yijie.yang, linux-remoteproc, linux-kernel,
linux-arm-msm
On 4/10/2026 10:15 PM, Stephan Gerhold wrote:
> On Thu, Apr 09, 2026 at 01:46:22AM -0700, Jingyi Wang wrote:
>> For rproc that doing attach, glink_subdev_start() is called only when
>> attach successfully. If rproc_report_crash() is called in the attach
>> function, rproc_boot_recovery()->rproc_stop()->glink_subdev_stop() could
>> be called and cause NULL pointer dereference:
>>
>> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
>> Mem abort info:
>> ...
>> pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
>> lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
>> ...
>> Call trace:
>> qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
>> glink_subdev_stop+0x1c/0x30 [qcom_common]
>> rproc_stop+0x58/0x17c
>> rproc_trigger_recovery+0xb0/0x150
>> rproc_crash_handler_work+0xa4/0xc4
>> process_scheduled_works+0x18c/0x2d8
>> worker_thread+0x144/0x280
>> kthread+0x124/0x138
>> ret_from_fork+0x10/0x20
>> Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
>> ---[ end trace 0000000000000000 ]---
>>
>> Add NULL pointer check in the glink_subdev_stop() to make sure
>> qcom_glink_smem_unregister() will not be called if glink_subdev_start()
>> is not called.
>>
>
> You mention the actual root problem here: Why is glink_subdev_stop()
> called if glink_subdev_start() wasn't called?
>
> The call to rproc_start_subdevices() in __rproc_attach() makes sure that
> all subdevices are in consistent state when exiting the function (either
> prepared+started or stopped+unprepared). Only if all subdevices were
> started successfully, the rproc->state is changed to RPROC_ATTACHED.
>
> In your case, attaching the rproc failed so the rproc->state should be
> still RPROC_DETACHED. All subdevices should be stopped+unprepared. We
> shouldn't stop/unprepare any subdevices again in this state, they all
> might crash like glink does here.
>
> We know that subdevices are already stopped+unprepared in RPROC_DETACHED
> state, so I think you just need to skip rproc_stop_subdevices() and
> rproc_unprepare_subdevices() inside rproc_stop() in this case, see diff
> below.
>
> Thanks,
> Stephan
>
> @@ -1708,8 +1709,9 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> if (!rproc->ops->stop)
> return -EINVAL;
>
> - /* Stop any subdevices for the remote processor */
> - rproc_stop_subdevices(rproc, crashed);
> + /* Stop any subdevices for the remote processor if it was attached */
> + if (rproc->state != RPROC_DETACHED)
> + rproc_stop_subdevices(rproc, crashed);
>
> /* the installed resource table is no longer accessible */
> ret = rproc_reset_rsc_table_on_stop(rproc);
> @@ -1726,7 +1728,8 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> return ret;
> }
>
> - rproc_unprepare_subdevices(rproc);
> + if (rproc->state != RPROC_DETACHED)
> + rproc_unprepare_subdevices(rproc);
>
> rproc->state = RPROC_OFFLINE;
>
Hi Stephen,
In this case, rproc_crash_handler_work()->rproc_trigger_recovery()->
rproc_boot_recovery()->rproc_stop()->glink_subdev_stop() is called,
"rproc->state = RPROC_CRASHED" is set in the rproc_crash_handler_work
before rproc_boot_recovery is called, so checking RPROC_DETACHED can
not work for this case.
Thanks,
Jingyi
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path
2026-04-10 14:28 ` Stephan Gerhold
@ 2026-04-14 3:41 ` Jingyi Wang
2026-04-14 8:13 ` Stephan Gerhold
0 siblings, 1 reply; 9+ messages in thread
From: Jingyi Wang @ 2026-04-14 3:41 UTC (permalink / raw)
To: Stephan Gerhold
Cc: Bjorn Andersson, Mathieu Poirier, aiqun.yu, tingwei.zhang,
trilok.soni, yijie.yang, linux-remoteproc, linux-kernel,
linux-arm-msm, Bartosz Golaszewski, Dmitry Baryshkov
On 4/10/2026 10:28 PM, Stephan Gerhold wrote:
> +Cc Bartosz, Dmitry
>
> On Thu, Apr 09, 2026 at 01:46:21AM -0700, Jingyi Wang wrote:
>> For rproc with state RPROC_DETACHED and auto_boot enabled, the attach
>> callback will be called in the rproc_add()->rproc_trigger_auto_boot()->
>> rproc_boot() path, the failure in this path will cause the rproc_add()
>> fail and the resource release, which will cause issue like rproc recovery
>> or falling back to firmware load fail. Add attach_work for rproc and call
>> it asynchronously in rproc_add() path like what rproc_start() do.
>>
>> Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
>> ---
>> drivers/remoteproc/remoteproc_core.c | 20 ++++++++++++--------
>> include/linux/remoteproc.h | 1 +
>> 2 files changed, 13 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
>> index b087ed21858a..f02db1113fae 100644
>> --- a/drivers/remoteproc/remoteproc_core.c
>> +++ b/drivers/remoteproc/remoteproc_core.c
>> @@ -1673,18 +1673,21 @@ static void rproc_auto_boot_callback(const struct firmware *fw, void *context)
>> release_firmware(fw);
>> }
>>
>> +static void rproc_attach_work(struct work_struct *work)
>> +{
>> + struct rproc *rproc = container_of(work, struct rproc, attach_work);
>> +
>> + rproc_boot(rproc);
>> +}
>> +
>> static int rproc_trigger_auto_boot(struct rproc *rproc)
>> {
>> int ret;
>>
>> - /*
>> - * Since the remote processor is in a detached state, it has already
>> - * been booted by another entity. As such there is no point in waiting
>> - * for a firmware image to be loaded, we can simply initiate the process
>> - * of attaching to it immediately.
>> - */
>> - if (rproc->state == RPROC_DETACHED)
>> - return rproc_boot(rproc);
>> + if (rproc->state == RPROC_DETACHED) {
>> + schedule_work(&rproc->attach_work);
>> + return 0;
>> + }
>
> I think the change itself is reasonable to make "auto-attach" behavior
> consistent with "auto-boot". The commit message is a bit misleading
> though:
>
> - You're really doing two separate functional changes here:
>
> (1) Ignore the return value of rproc_boot() during auto-boot attach,
> to keep the remoteproc registered and available in sysfs even if
> attaching fails.
> (2) Run the rproc_boot() in the background using schedule_work().
> [To improve boot performance? To work around some locking issues?]
>
> - The actual issue you are seeing sounds like a use-after-free in the
> remoteproc core error cleanup path. I think this one is still
> present, we should really have a call to
> cancel_work_sync(&rproc->crash_handler) as Dmitry wrote in the
> previous discussion [1].
>
> Thanks,
> Stephan
>
> [1]: https://lore.kernel.org/all/ce24a2sgg4b6wymoxwgl2ve6np2nxn2wuxfqxfpmvqqrpvgouf@xihd6ziqwu4m/
Hi Stephan,
Exactly as you say, what this change do is allowing rproc_attach return false.
It should be okay to keep this change and describe it more clear in commit msg
in next version?
And the use-after-free issue is what we want to resolve in the patch2
in this series, I think cancel_work_sync() is a reasonable change
but it cannot resolve this issue as the worker could be executing when
we call this(and this is what it behaves when I did local test) and
the use-after-free issue still exists. Shall we send a separate patch
for this cancel_work_sync?
Thanks,
Jingyi
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path
2026-04-14 3:41 ` Jingyi Wang
@ 2026-04-14 8:13 ` Stephan Gerhold
0 siblings, 0 replies; 9+ messages in thread
From: Stephan Gerhold @ 2026-04-14 8:13 UTC (permalink / raw)
To: Jingyi Wang
Cc: Bjorn Andersson, Mathieu Poirier, aiqun.yu, tingwei.zhang,
trilok.soni, yijie.yang, linux-remoteproc, linux-kernel,
linux-arm-msm, Bartosz Golaszewski, Dmitry Baryshkov
On Tue, Apr 14, 2026 at 11:41:39AM +0800, Jingyi Wang wrote:
>
>
> On 4/10/2026 10:28 PM, Stephan Gerhold wrote:
> > +Cc Bartosz, Dmitry
> >
> > On Thu, Apr 09, 2026 at 01:46:21AM -0700, Jingyi Wang wrote:
> > > For rproc with state RPROC_DETACHED and auto_boot enabled, the attach
> > > callback will be called in the rproc_add()->rproc_trigger_auto_boot()->
> > > rproc_boot() path, the failure in this path will cause the rproc_add()
> > > fail and the resource release, which will cause issue like rproc recovery
> > > or falling back to firmware load fail. Add attach_work for rproc and call
> > > it asynchronously in rproc_add() path like what rproc_start() do.
> > >
> > > Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
> > > ---
> > > drivers/remoteproc/remoteproc_core.c | 20 ++++++++++++--------
> > > include/linux/remoteproc.h | 1 +
> > > 2 files changed, 13 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > > index b087ed21858a..f02db1113fae 100644
> > > --- a/drivers/remoteproc/remoteproc_core.c
> > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > @@ -1673,18 +1673,21 @@ static void rproc_auto_boot_callback(const struct firmware *fw, void *context)
> > > release_firmware(fw);
> > > }
> > > +static void rproc_attach_work(struct work_struct *work)
> > > +{
> > > + struct rproc *rproc = container_of(work, struct rproc, attach_work);
> > > +
> > > + rproc_boot(rproc);
> > > +}
> > > +
> > > static int rproc_trigger_auto_boot(struct rproc *rproc)
> > > {
> > > int ret;
> > > - /*
> > > - * Since the remote processor is in a detached state, it has already
> > > - * been booted by another entity. As such there is no point in waiting
> > > - * for a firmware image to be loaded, we can simply initiate the process
> > > - * of attaching to it immediately.
> > > - */
> > > - if (rproc->state == RPROC_DETACHED)
> > > - return rproc_boot(rproc);
> > > + if (rproc->state == RPROC_DETACHED) {
> > > + schedule_work(&rproc->attach_work);
> > > + return 0;
> > > + }
> >
> > I think the change itself is reasonable to make "auto-attach" behavior
> > consistent with "auto-boot". The commit message is a bit misleading
> > though:
> >
> > - You're really doing two separate functional changes here:
> >
> > (1) Ignore the return value of rproc_boot() during auto-boot attach,
> > to keep the remoteproc registered and available in sysfs even if
> > attaching fails.
> > (2) Run the rproc_boot() in the background using schedule_work().
> > [To improve boot performance? To work around some locking issues?]
> >
> > - The actual issue you are seeing sounds like a use-after-free in the
> > remoteproc core error cleanup path. I think this one is still
> > present, we should really have a call to
> > cancel_work_sync(&rproc->crash_handler) as Dmitry wrote in the
> > previous discussion [1].
> >
> > Thanks,
> > Stephan
> >
> > [1]: https://lore.kernel.org/all/ce24a2sgg4b6wymoxwgl2ve6np2nxn2wuxfqxfpmvqqrpvgouf@xihd6ziqwu4m/
>
> Hi Stephan,
>
> Exactly as you say, what this change do is allowing rproc_attach return false.
> It should be okay to keep this change and describe it more clear in commit msg
> in next version?
>
That's fine for me.
> And the use-after-free issue is what we want to resolve in the patch2
> in this series, I think cancel_work_sync() is a reasonable change
> but it cannot resolve this issue as the worker could be executing when
> we call this(and this is what it behaves when I did local test) and
> the use-after-free issue still exists. Shall we send a separate patch
> for this cancel_work_sync?
>
cancel_work_sync() should wait until the worker execution has finished.
If you call it before freeing the resources (= deleting the remoteproc),
I would expect it should work as expected.
It makes sense to have separate patches for this, one is about fixing
the use-after-free issue, the other is more about the behavior when the
initial auto-boot fails.
Thanks,
Stephan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop()
2026-04-14 3:23 ` Jingyi Wang
@ 2026-04-14 8:27 ` Stephan Gerhold
0 siblings, 0 replies; 9+ messages in thread
From: Stephan Gerhold @ 2026-04-14 8:27 UTC (permalink / raw)
To: Jingyi Wang
Cc: Bjorn Andersson, Mathieu Poirier, aiqun.yu, tingwei.zhang,
trilok.soni, yijie.yang, linux-remoteproc, linux-kernel,
linux-arm-msm
On Tue, Apr 14, 2026 at 11:23:50AM +0800, Jingyi Wang wrote:
> On 4/10/2026 10:15 PM, Stephan Gerhold wrote:
> > On Thu, Apr 09, 2026 at 01:46:22AM -0700, Jingyi Wang wrote:
> > > For rproc that doing attach, glink_subdev_start() is called only when
> > > attach successfully. If rproc_report_crash() is called in the attach
> > > function, rproc_boot_recovery()->rproc_stop()->glink_subdev_stop() could
> > > be called and cause NULL pointer dereference:
> > >
> > > Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
> > > Mem abort info:
> > > ...
> > > pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
> > > lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
> > > ...
> > > Call trace:
> > > qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
> > > glink_subdev_stop+0x1c/0x30 [qcom_common]
> > > rproc_stop+0x58/0x17c
> > > rproc_trigger_recovery+0xb0/0x150
> > > rproc_crash_handler_work+0xa4/0xc4
> > > process_scheduled_works+0x18c/0x2d8
> > > worker_thread+0x144/0x280
> > > kthread+0x124/0x138
> > > ret_from_fork+0x10/0x20
> > > Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
> > > ---[ end trace 0000000000000000 ]---
> > >
> > > Add NULL pointer check in the glink_subdev_stop() to make sure
> > > qcom_glink_smem_unregister() will not be called if glink_subdev_start()
> > > is not called.
> > >
> >
> > You mention the actual root problem here: Why is glink_subdev_stop()
> > called if glink_subdev_start() wasn't called?
> >
> > The call to rproc_start_subdevices() in __rproc_attach() makes sure that
> > all subdevices are in consistent state when exiting the function (either
> > prepared+started or stopped+unprepared). Only if all subdevices were
> > started successfully, the rproc->state is changed to RPROC_ATTACHED.
> >
> > In your case, attaching the rproc failed so the rproc->state should be
> > still RPROC_DETACHED. All subdevices should be stopped+unprepared. We
> > shouldn't stop/unprepare any subdevices again in this state, they all
> > might crash like glink does here.
> >
> > We know that subdevices are already stopped+unprepared in RPROC_DETACHED
> > state, so I think you just need to skip rproc_stop_subdevices() and
> > rproc_unprepare_subdevices() inside rproc_stop() in this case, see diff
> > below.
> >
> > @@ -1708,8 +1709,9 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> > if (!rproc->ops->stop)
> > return -EINVAL;
> > - /* Stop any subdevices for the remote processor */
> > - rproc_stop_subdevices(rproc, crashed);
> > + /* Stop any subdevices for the remote processor if it was attached */
> > + if (rproc->state != RPROC_DETACHED)
> > + rproc_stop_subdevices(rproc, crashed);
> > /* the installed resource table is no longer accessible */
> > ret = rproc_reset_rsc_table_on_stop(rproc);
> > @@ -1726,7 +1728,8 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> > return ret;
> > }
> > - rproc_unprepare_subdevices(rproc);
> > + if (rproc->state != RPROC_DETACHED)
> > + rproc_unprepare_subdevices(rproc);
> > rproc->state = RPROC_OFFLINE;
>
> In this case, rproc_crash_handler_work()->rproc_trigger_recovery()->
> rproc_boot_recovery()->rproc_stop()->glink_subdev_stop() is called,
> "rproc->state = RPROC_CRASHED" is set in the rproc_crash_handler_work
> before rproc_boot_recovery is called, so checking RPROC_DETACHED can
> not work for this case.
>
You're right, I forgot about that. I think we need a more generic
solution for this though. rproc_stop_subdevices() should not be called
without a prior call to rproc_start_subdevices().
I think there are a couple of options for this:
- Add a bool "subdevs_started" to struct rproc and manage that
separately from the rproc->state.
- Track the rproc state before the crash separately (something like
rproc->state_before_crash) and check that in the stop path.
- Add a new state RPROC_CRASHED_DETACHED to make sure the states are
unique.
- ...
Does the same issue also exist in qcom_pas_stop() of "[PATCH v5 4/5]
remoteproc: qcom: pas: Add late attach support for subsystems" [1]?
There you check for pas->rproc->state != RPROC_ATTACHED, wouldn't this
also fail for the RPROC_CRASHED case?
Thanks,
Stephan
[1]: https://lore.kernel.org/linux-arm-msm/20260409-knp-soccp-v5-4-805a492124da@oss.qualcomm.com/
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-04-14 8:27 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09 8:46 [PATCH 0/2] remoteproc: improve robustness for rproc_attach fail cases Jingyi Wang
2026-04-09 8:46 ` [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path Jingyi Wang
2026-04-10 14:28 ` Stephan Gerhold
2026-04-14 3:41 ` Jingyi Wang
2026-04-14 8:13 ` Stephan Gerhold
2026-04-09 8:46 ` [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop() Jingyi Wang
2026-04-10 14:15 ` Stephan Gerhold
2026-04-14 3:23 ` Jingyi Wang
2026-04-14 8:27 ` Stephan Gerhold
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox