[PATCH 0/2] remoteproc: improve robustness for rproc

public inbox for linux-remoteproc@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] remoteproc: improve robustness for rproc_attach fail cases
@ 2026-04-09  8:46 Jingyi Wang
  2026-04-09  8:46 ` [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path Jingyi Wang
  2026-04-09  8:46 ` [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop() Jingyi Wang
  0 siblings, 2 replies; 9+ messages in thread
From: Jingyi Wang @ 2026-04-09  8:46 UTC (permalink / raw)
  To: Bjorn Andersson, Mathieu Poirier
  Cc: aiqun.yu, tingwei.zhang, trilok.soni, yijie.yang,
	linux-remoteproc, linux-kernel, linux-arm-msm, Jingyi Wang

Failure in rproc_attach() path can cause issues like NULL pointer
dereference when doing recovery and rproc_add() fail, improve the
robustness for rproc_attach fail cases.

Previous discussion:
https://lore.kernel.org/all/9bdc6b6d-ddf0-47af-b1ed-8d1e75bf30c2@oss.qualcomm.com/

Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
---
Jingyi Wang (2):
      remoteproc: core: Attach rproc asynchronously in rproc_add() path
      remoteproc: qcom: Check glink->edge in glink_subdev_stop()

 drivers/remoteproc/qcom_common.c     |  3 +++
 drivers/remoteproc/remoteproc_core.c | 20 ++++++++++++--------
 include/linux/remoteproc.h           |  1 +
 3 files changed, 16 insertions(+), 8 deletions(-)
---
base-commit: db7efce4ae23ad5e42f5f55428f529ff62b86fab
change-id: 20260409-rproc-attach-issue-2e3290d6d013

Best regards,
-- 
Jingyi Wang <jingyi.wang@oss.qualcomm.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path
  2026-04-09  8:46 [PATCH 0/2] remoteproc: improve robustness for rproc_attach fail cases Jingyi Wang
@ 2026-04-09  8:46 ` Jingyi Wang
  2026-04-10 14:28   ` Stephan Gerhold
  2026-04-09  8:46 ` [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop() Jingyi Wang
  1 sibling, 1 reply; 9+ messages in thread
From: Jingyi Wang @ 2026-04-09  8:46 UTC (permalink / raw)
  To: Bjorn Andersson, Mathieu Poirier
  Cc: aiqun.yu, tingwei.zhang, trilok.soni, yijie.yang,
	linux-remoteproc, linux-kernel, linux-arm-msm, Jingyi Wang

For rproc with state RPROC_DETACHED and auto_boot enabled, the attach
callback will be called in the rproc_add()->rproc_trigger_auto_boot()->
rproc_boot() path, the failure in this path will cause the rproc_add()
fail and the resource release, which will cause issue like rproc recovery
or falling back to firmware load fail. Add attach_work for rproc and call
it asynchronously in rproc_add() path like what rproc_start() do.

Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
---
 drivers/remoteproc/remoteproc_core.c | 20 ++++++++++++--------
 include/linux/remoteproc.h           |  1 +
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index b087ed21858a..f02db1113fae 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1673,18 +1673,21 @@ static void rproc_auto_boot_callback(const struct firmware *fw, void *context)
 	release_firmware(fw);
 }
 
+static void rproc_attach_work(struct work_struct *work)
+{
+	struct rproc *rproc = container_of(work, struct rproc, attach_work);
+
+	rproc_boot(rproc);
+}
+
 static int rproc_trigger_auto_boot(struct rproc *rproc)
 {
 	int ret;
 
-	/*
-	 * Since the remote processor is in a detached state, it has already
-	 * been booted by another entity.  As such there is no point in waiting
-	 * for a firmware image to be loaded, we can simply initiate the process
-	 * of attaching to it immediately.
-	 */
-	if (rproc->state == RPROC_DETACHED)
-		return rproc_boot(rproc);
+	if (rproc->state == RPROC_DETACHED) {
+		schedule_work(&rproc->attach_work);
+		return 0;
+	}
 
 	/*
 	 * We're initiating an asynchronous firmware loading, so we can
@@ -2512,6 +2515,7 @@ struct rproc *rproc_alloc(struct device *dev, const char *name,
 	INIT_LIST_HEAD(&rproc->dump_segments);
 
 	INIT_WORK(&rproc->crash_handler, rproc_crash_handler_work);
+	INIT_WORK(&rproc->attach_work, rproc_attach_work);
 
 	rproc->state = RPROC_OFFLINE;
 
diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
index b4795698d8c2..9b3f748e9eca 100644
--- a/include/linux/remoteproc.h
+++ b/include/linux/remoteproc.h
@@ -568,6 +568,7 @@ struct rproc {
 	struct list_head subdevs;
 	struct idr notifyids;
 	int index;
+	struct work_struct attach_work;
 	struct work_struct crash_handler;
 	unsigned int crash_cnt;
 	bool recovery_disabled;

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop()
  2026-04-09  8:46 [PATCH 0/2] remoteproc: improve robustness for rproc_attach fail cases Jingyi Wang
  2026-04-09  8:46 ` [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path Jingyi Wang
@ 2026-04-09  8:46 ` Jingyi Wang
  2026-04-10 14:15   ` Stephan Gerhold
  1 sibling, 1 reply; 9+ messages in thread
From: Jingyi Wang @ 2026-04-09  8:46 UTC (permalink / raw)
  To: Bjorn Andersson, Mathieu Poirier
  Cc: aiqun.yu, tingwei.zhang, trilok.soni, yijie.yang,
	linux-remoteproc, linux-kernel, linux-arm-msm, Jingyi Wang

For rproc that doing attach, glink_subdev_start() is called only when
attach successfully. If rproc_report_crash() is called in the attach
function, rproc_boot_recovery()->rproc_stop()->glink_subdev_stop() could
be called and cause NULL pointer dereference:

 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
 Mem abort info:
 ...
 pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
 lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
 ...
 Call trace:
  qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
  glink_subdev_stop+0x1c/0x30 [qcom_common]
  rproc_stop+0x58/0x17c
  rproc_trigger_recovery+0xb0/0x150
  rproc_crash_handler_work+0xa4/0xc4
  process_scheduled_works+0x18c/0x2d8
  worker_thread+0x144/0x280
  kthread+0x124/0x138
  ret_from_fork+0x10/0x20
 Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
 ---[ end trace 0000000000000000 ]---

Add NULL pointer check in the glink_subdev_stop() to make sure
qcom_glink_smem_unregister() will not be called if glink_subdev_start()
is not called.

Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
---
 drivers/remoteproc/qcom_common.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c
index fd2b6824ad26..79d9d45e0b81 100644
--- a/drivers/remoteproc/qcom_common.c
+++ b/drivers/remoteproc/qcom_common.c
@@ -220,6 +220,9 @@ static void glink_subdev_stop(struct rproc_subdev *subdev, bool crashed)
 {
 	struct qcom_rproc_glink *glink = to_glink_subdev(subdev);
 
+	if (!glink->edge)
+		return;
+
 	qcom_glink_smem_unregister(glink->edge);
 	glink->edge = NULL;
 }

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop()
  2026-04-09  8:46 ` [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop() Jingyi Wang
@ 2026-04-10 14:15   ` Stephan Gerhold
  2026-04-14  3:23     ` Jingyi Wang
  0 siblings, 1 reply; 9+ messages in thread
From: Stephan Gerhold @ 2026-04-10 14:15 UTC (permalink / raw)
  To: Jingyi Wang
  Cc: Bjorn Andersson, Mathieu Poirier, aiqun.yu, tingwei.zhang,
	trilok.soni, yijie.yang, linux-remoteproc, linux-kernel,
	linux-arm-msm

On Thu, Apr 09, 2026 at 01:46:22AM -0700, Jingyi Wang wrote:
> For rproc that doing attach, glink_subdev_start() is called only when
> attach successfully. If rproc_report_crash() is called in the attach
> function, rproc_boot_recovery()->rproc_stop()->glink_subdev_stop() could
> be called and cause NULL pointer dereference:
> 
>  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
>  Mem abort info:
>  ...
>  pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
>  lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
>  ...
>  Call trace:
>   qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
>   glink_subdev_stop+0x1c/0x30 [qcom_common]
>   rproc_stop+0x58/0x17c
>   rproc_trigger_recovery+0xb0/0x150
>   rproc_crash_handler_work+0xa4/0xc4
>   process_scheduled_works+0x18c/0x2d8
>   worker_thread+0x144/0x280
>   kthread+0x124/0x138
>   ret_from_fork+0x10/0x20
>  Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
>  ---[ end trace 0000000000000000 ]---
> 
> Add NULL pointer check in the glink_subdev_stop() to make sure
> qcom_glink_smem_unregister() will not be called if glink_subdev_start()
> is not called.
> 

You mention the actual root problem here: Why is glink_subdev_stop()
called if glink_subdev_start() wasn't called?

The call to rproc_start_subdevices() in __rproc_attach() makes sure that
all subdevices are in consistent state when exiting the function (either
prepared+started or stopped+unprepared). Only if all subdevices were
started successfully, the rproc->state is changed to RPROC_ATTACHED.

In your case, attaching the rproc failed so the rproc->state should be
still RPROC_DETACHED. All subdevices should be stopped+unprepared. We
shouldn't stop/unprepare any subdevices again in this state, they all
might crash like glink does here.

We know that subdevices are already stopped+unprepared in RPROC_DETACHED
state, so I think you just need to skip rproc_stop_subdevices() and
rproc_unprepare_subdevices() inside rproc_stop() in this case, see diff
below.

Thanks,
Stephan

@@ -1708,8 +1709,9 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
 	if (!rproc->ops->stop)
 		return -EINVAL;
 
-	/* Stop any subdevices for the remote processor */
-	rproc_stop_subdevices(rproc, crashed);
+	/* Stop any subdevices for the remote processor if it was attached */
+	if (rproc->state != RPROC_DETACHED)
+		rproc_stop_subdevices(rproc, crashed);
 
 	/* the installed resource table is no longer accessible */
 	ret = rproc_reset_rsc_table_on_stop(rproc);
@@ -1726,7 +1728,8 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
 		return ret;
 	}
 
-	rproc_unprepare_subdevices(rproc);
+	if (rproc->state != RPROC_DETACHED)
+		rproc_unprepare_subdevices(rproc);
 
 	rproc->state = RPROC_OFFLINE;
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path
  2026-04-09  8:46 ` [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path Jingyi Wang
@ 2026-04-10 14:28   ` Stephan Gerhold
  2026-04-14  3:41     ` Jingyi Wang
  0 siblings, 1 reply; 9+ messages in thread
From: Stephan Gerhold @ 2026-04-10 14:28 UTC (permalink / raw)
  To: Jingyi Wang
  Cc: Bjorn Andersson, Mathieu Poirier, aiqun.yu, tingwei.zhang,
	trilok.soni, yijie.yang, linux-remoteproc, linux-kernel,
	linux-arm-msm, Bartosz Golaszewski, Dmitry Baryshkov

+Cc Bartosz, Dmitry

On Thu, Apr 09, 2026 at 01:46:21AM -0700, Jingyi Wang wrote:
> For rproc with state RPROC_DETACHED and auto_boot enabled, the attach
> callback will be called in the rproc_add()->rproc_trigger_auto_boot()->
> rproc_boot() path, the failure in this path will cause the rproc_add()
> fail and the resource release, which will cause issue like rproc recovery
> or falling back to firmware load fail. Add attach_work for rproc and call
> it asynchronously in rproc_add() path like what rproc_start() do.
> 
> Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
> ---
>  drivers/remoteproc/remoteproc_core.c | 20 ++++++++++++--------
>  include/linux/remoteproc.h           |  1 +
>  2 files changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index b087ed21858a..f02db1113fae 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1673,18 +1673,21 @@ static void rproc_auto_boot_callback(const struct firmware *fw, void *context)
>  	release_firmware(fw);
>  }
>  
> +static void rproc_attach_work(struct work_struct *work)
> +{
> +	struct rproc *rproc = container_of(work, struct rproc, attach_work);
> +
> +	rproc_boot(rproc);
> +}
> +
>  static int rproc_trigger_auto_boot(struct rproc *rproc)
>  {
>  	int ret;
>  
> -	/*
> -	 * Since the remote processor is in a detached state, it has already
> -	 * been booted by another entity.  As such there is no point in waiting
> -	 * for a firmware image to be loaded, we can simply initiate the process
> -	 * of attaching to it immediately.
> -	 */
> -	if (rproc->state == RPROC_DETACHED)
> -		return rproc_boot(rproc);
> +	if (rproc->state == RPROC_DETACHED) {
> +		schedule_work(&rproc->attach_work);
> +		return 0;
> +	}

I think the change itself is reasonable to make "auto-attach" behavior
consistent with "auto-boot". The commit message is a bit misleading
though:

 - You're really doing two separate functional changes here:

   (1) Ignore the return value of rproc_boot() during auto-boot attach,
       to keep the remoteproc registered and available in sysfs even if
       attaching fails.
   (2) Run the rproc_boot() in the background using schedule_work().
       [To improve boot performance? To work around some locking issues?]

 - The actual issue you are seeing sounds like a use-after-free in the
   remoteproc core error cleanup path. I think this one is still
   present, we should really have a call to
   cancel_work_sync(&rproc->crash_handler) as Dmitry wrote in the
   previous discussion [1].

Thanks,
Stephan

[1]: https://lore.kernel.org/all/ce24a2sgg4b6wymoxwgl2ve6np2nxn2wuxfqxfpmvqqrpvgouf@xihd6ziqwu4m/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop()
  2026-04-10 14:15   ` Stephan Gerhold
@ 2026-04-14  3:23     ` Jingyi Wang
  2026-04-14  8:27       ` Stephan Gerhold
  0 siblings, 1 reply; 9+ messages in thread
From: Jingyi Wang @ 2026-04-14  3:23 UTC (permalink / raw)
  To: Stephan Gerhold
  Cc: Bjorn Andersson, Mathieu Poirier, aiqun.yu, tingwei.zhang,
	trilok.soni, yijie.yang, linux-remoteproc, linux-kernel,
	linux-arm-msm



On 4/10/2026 10:15 PM, Stephan Gerhold wrote:
> On Thu, Apr 09, 2026 at 01:46:22AM -0700, Jingyi Wang wrote:
>> For rproc that doing attach, glink_subdev_start() is called only when
>> attach successfully. If rproc_report_crash() is called in the attach
>> function, rproc_boot_recovery()->rproc_stop()->glink_subdev_stop() could
>> be called and cause NULL pointer dereference:
>>
>>   Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
>>   Mem abort info:
>>   ...
>>   pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
>>   lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
>>   ...
>>   Call trace:
>>    qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
>>    glink_subdev_stop+0x1c/0x30 [qcom_common]
>>    rproc_stop+0x58/0x17c
>>    rproc_trigger_recovery+0xb0/0x150
>>    rproc_crash_handler_work+0xa4/0xc4
>>    process_scheduled_works+0x18c/0x2d8
>>    worker_thread+0x144/0x280
>>    kthread+0x124/0x138
>>    ret_from_fork+0x10/0x20
>>   Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
>>   ---[ end trace 0000000000000000 ]---
>>
>> Add NULL pointer check in the glink_subdev_stop() to make sure
>> qcom_glink_smem_unregister() will not be called if glink_subdev_start()
>> is not called.
>>
> 
> You mention the actual root problem here: Why is glink_subdev_stop()
> called if glink_subdev_start() wasn't called?
> 
> The call to rproc_start_subdevices() in __rproc_attach() makes sure that
> all subdevices are in consistent state when exiting the function (either
> prepared+started or stopped+unprepared). Only if all subdevices were
> started successfully, the rproc->state is changed to RPROC_ATTACHED.
> 
> In your case, attaching the rproc failed so the rproc->state should be
> still RPROC_DETACHED. All subdevices should be stopped+unprepared. We
> shouldn't stop/unprepare any subdevices again in this state, they all
> might crash like glink does here.
> 
> We know that subdevices are already stopped+unprepared in RPROC_DETACHED
> state, so I think you just need to skip rproc_stop_subdevices() and
> rproc_unprepare_subdevices() inside rproc_stop() in this case, see diff
> below.
> 
> Thanks,
> Stephan
> 
> @@ -1708,8 +1709,9 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
>   	if (!rproc->ops->stop)
>   		return -EINVAL;
>   
> -	/* Stop any subdevices for the remote processor */
> -	rproc_stop_subdevices(rproc, crashed);
> +	/* Stop any subdevices for the remote processor if it was attached */
> +	if (rproc->state != RPROC_DETACHED)
> +		rproc_stop_subdevices(rproc, crashed);
>   
>   	/* the installed resource table is no longer accessible */
>   	ret = rproc_reset_rsc_table_on_stop(rproc);
> @@ -1726,7 +1728,8 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
>   		return ret;
>   	}
>   
> -	rproc_unprepare_subdevices(rproc);
> +	if (rproc->state != RPROC_DETACHED)
> +		rproc_unprepare_subdevices(rproc);
>   
>   	rproc->state = RPROC_OFFLINE;
>   

Hi Stephen,

In this case, rproc_crash_handler_work()->rproc_trigger_recovery()->
rproc_boot_recovery()->rproc_stop()->glink_subdev_stop() is called,
"rproc->state = RPROC_CRASHED" is set in the rproc_crash_handler_work
before rproc_boot_recovery is called, so checking RPROC_DETACHED can
not work for this case.

Thanks,
Jingyi



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path
  2026-04-10 14:28   ` Stephan Gerhold
@ 2026-04-14  3:41     ` Jingyi Wang
  2026-04-14  8:13       ` Stephan Gerhold
  0 siblings, 1 reply; 9+ messages in thread
From: Jingyi Wang @ 2026-04-14  3:41 UTC (permalink / raw)
  To: Stephan Gerhold
  Cc: Bjorn Andersson, Mathieu Poirier, aiqun.yu, tingwei.zhang,
	trilok.soni, yijie.yang, linux-remoteproc, linux-kernel,
	linux-arm-msm, Bartosz Golaszewski, Dmitry Baryshkov



On 4/10/2026 10:28 PM, Stephan Gerhold wrote:
> +Cc Bartosz, Dmitry
> 
> On Thu, Apr 09, 2026 at 01:46:21AM -0700, Jingyi Wang wrote:
>> For rproc with state RPROC_DETACHED and auto_boot enabled, the attach
>> callback will be called in the rproc_add()->rproc_trigger_auto_boot()->
>> rproc_boot() path, the failure in this path will cause the rproc_add()
>> fail and the resource release, which will cause issue like rproc recovery
>> or falling back to firmware load fail. Add attach_work for rproc and call
>> it asynchronously in rproc_add() path like what rproc_start() do.
>>
>> Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
>> ---
>>   drivers/remoteproc/remoteproc_core.c | 20 ++++++++++++--------
>>   include/linux/remoteproc.h           |  1 +
>>   2 files changed, 13 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
>> index b087ed21858a..f02db1113fae 100644
>> --- a/drivers/remoteproc/remoteproc_core.c
>> +++ b/drivers/remoteproc/remoteproc_core.c
>> @@ -1673,18 +1673,21 @@ static void rproc_auto_boot_callback(const struct firmware *fw, void *context)
>>   	release_firmware(fw);
>>   }
>>   
>> +static void rproc_attach_work(struct work_struct *work)
>> +{
>> +	struct rproc *rproc = container_of(work, struct rproc, attach_work);
>> +
>> +	rproc_boot(rproc);
>> +}
>> +
>>   static int rproc_trigger_auto_boot(struct rproc *rproc)
>>   {
>>   	int ret;
>>   
>> -	/*
>> -	 * Since the remote processor is in a detached state, it has already
>> -	 * been booted by another entity.  As such there is no point in waiting
>> -	 * for a firmware image to be loaded, we can simply initiate the process
>> -	 * of attaching to it immediately.
>> -	 */
>> -	if (rproc->state == RPROC_DETACHED)
>> -		return rproc_boot(rproc);
>> +	if (rproc->state == RPROC_DETACHED) {
>> +		schedule_work(&rproc->attach_work);
>> +		return 0;
>> +	}
> 
> I think the change itself is reasonable to make "auto-attach" behavior
> consistent with "auto-boot". The commit message is a bit misleading
> though:
> 
>   - You're really doing two separate functional changes here:
> 
>     (1) Ignore the return value of rproc_boot() during auto-boot attach,
>         to keep the remoteproc registered and available in sysfs even if
>         attaching fails.
>     (2) Run the rproc_boot() in the background using schedule_work().
>         [To improve boot performance? To work around some locking issues?]
> 
>   - The actual issue you are seeing sounds like a use-after-free in the
>     remoteproc core error cleanup path. I think this one is still
>     present, we should really have a call to
>     cancel_work_sync(&rproc->crash_handler) as Dmitry wrote in the
>     previous discussion [1].
> 
> Thanks,
> Stephan
> 
> [1]: https://lore.kernel.org/all/ce24a2sgg4b6wymoxwgl2ve6np2nxn2wuxfqxfpmvqqrpvgouf@xihd6ziqwu4m/

Hi Stephan,

Exactly as you say, what this change do is allowing rproc_attach return false.
It should be okay to keep this change and describe it more clear in commit msg
in next version?

And the use-after-free issue is what we want to resolve in the patch2
in this series, I think cancel_work_sync() is a reasonable change
but it cannot resolve this issue as the worker could be executing when
we call this(and this is what it behaves when I did local test) and
the use-after-free issue still exists. Shall we send a separate patch
for this cancel_work_sync?

Thanks,
Jingyi





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path
  2026-04-14  3:41     ` Jingyi Wang
@ 2026-04-14  8:13       ` Stephan Gerhold
  0 siblings, 0 replies; 9+ messages in thread
From: Stephan Gerhold @ 2026-04-14  8:13 UTC (permalink / raw)
  To: Jingyi Wang
  Cc: Bjorn Andersson, Mathieu Poirier, aiqun.yu, tingwei.zhang,
	trilok.soni, yijie.yang, linux-remoteproc, linux-kernel,
	linux-arm-msm, Bartosz Golaszewski, Dmitry Baryshkov

On Tue, Apr 14, 2026 at 11:41:39AM +0800, Jingyi Wang wrote:
> 
> 
> On 4/10/2026 10:28 PM, Stephan Gerhold wrote:
> > +Cc Bartosz, Dmitry
> > 
> > On Thu, Apr 09, 2026 at 01:46:21AM -0700, Jingyi Wang wrote:
> > > For rproc with state RPROC_DETACHED and auto_boot enabled, the attach
> > > callback will be called in the rproc_add()->rproc_trigger_auto_boot()->
> > > rproc_boot() path, the failure in this path will cause the rproc_add()
> > > fail and the resource release, which will cause issue like rproc recovery
> > > or falling back to firmware load fail. Add attach_work for rproc and call
> > > it asynchronously in rproc_add() path like what rproc_start() do.
> > > 
> > > Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
> > > ---
> > >   drivers/remoteproc/remoteproc_core.c | 20 ++++++++++++--------
> > >   include/linux/remoteproc.h           |  1 +
> > >   2 files changed, 13 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > > index b087ed21858a..f02db1113fae 100644
> > > --- a/drivers/remoteproc/remoteproc_core.c
> > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > @@ -1673,18 +1673,21 @@ static void rproc_auto_boot_callback(const struct firmware *fw, void *context)
> > >   	release_firmware(fw);
> > >   }
> > > +static void rproc_attach_work(struct work_struct *work)
> > > +{
> > > +	struct rproc *rproc = container_of(work, struct rproc, attach_work);
> > > +
> > > +	rproc_boot(rproc);
> > > +}
> > > +
> > >   static int rproc_trigger_auto_boot(struct rproc *rproc)
> > >   {
> > >   	int ret;
> > > -	/*
> > > -	 * Since the remote processor is in a detached state, it has already
> > > -	 * been booted by another entity.  As such there is no point in waiting
> > > -	 * for a firmware image to be loaded, we can simply initiate the process
> > > -	 * of attaching to it immediately.
> > > -	 */
> > > -	if (rproc->state == RPROC_DETACHED)
> > > -		return rproc_boot(rproc);
> > > +	if (rproc->state == RPROC_DETACHED) {
> > > +		schedule_work(&rproc->attach_work);
> > > +		return 0;
> > > +	}
> > 
> > I think the change itself is reasonable to make "auto-attach" behavior
> > consistent with "auto-boot". The commit message is a bit misleading
> > though:
> > 
> >   - You're really doing two separate functional changes here:
> > 
> >     (1) Ignore the return value of rproc_boot() during auto-boot attach,
> >         to keep the remoteproc registered and available in sysfs even if
> >         attaching fails.
> >     (2) Run the rproc_boot() in the background using schedule_work().
> >         [To improve boot performance? To work around some locking issues?]
> > 
> >   - The actual issue you are seeing sounds like a use-after-free in the
> >     remoteproc core error cleanup path. I think this one is still
> >     present, we should really have a call to
> >     cancel_work_sync(&rproc->crash_handler) as Dmitry wrote in the
> >     previous discussion [1].
> > 
> > Thanks,
> > Stephan
> > 
> > [1]: https://lore.kernel.org/all/ce24a2sgg4b6wymoxwgl2ve6np2nxn2wuxfqxfpmvqqrpvgouf@xihd6ziqwu4m/
> 
> Hi Stephan,
> 
> Exactly as you say, what this change do is allowing rproc_attach return false.
> It should be okay to keep this change and describe it more clear in commit msg
> in next version?
> 

That's fine for me.

> And the use-after-free issue is what we want to resolve in the patch2
> in this series, I think cancel_work_sync() is a reasonable change
> but it cannot resolve this issue as the worker could be executing when
> we call this(and this is what it behaves when I did local test) and
> the use-after-free issue still exists. Shall we send a separate patch
> for this cancel_work_sync?
> 

cancel_work_sync() should wait until the worker execution has finished.
If you call it before freeing the resources (= deleting the remoteproc),
I would expect it should work as expected.

It makes sense to have separate patches for this, one is about fixing
the use-after-free issue, the other is more about the behavior when the
initial auto-boot fails.

Thanks,
Stephan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop()
  2026-04-14  3:23     ` Jingyi Wang
@ 2026-04-14  8:27       ` Stephan Gerhold
  0 siblings, 0 replies; 9+ messages in thread
From: Stephan Gerhold @ 2026-04-14  8:27 UTC (permalink / raw)
  To: Jingyi Wang
  Cc: Bjorn Andersson, Mathieu Poirier, aiqun.yu, tingwei.zhang,
	trilok.soni, yijie.yang, linux-remoteproc, linux-kernel,
	linux-arm-msm

On Tue, Apr 14, 2026 at 11:23:50AM +0800, Jingyi Wang wrote:
> On 4/10/2026 10:15 PM, Stephan Gerhold wrote:
> > On Thu, Apr 09, 2026 at 01:46:22AM -0700, Jingyi Wang wrote:
> > > For rproc that doing attach, glink_subdev_start() is called only when
> > > attach successfully. If rproc_report_crash() is called in the attach
> > > function, rproc_boot_recovery()->rproc_stop()->glink_subdev_stop() could
> > > be called and cause NULL pointer dereference:
> > > 
> > >   Unable to handle kernel NULL pointer dereference at virtual address 0000000000000300
> > >   Mem abort info:
> > >   ...
> > >   pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
> > >   lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
> > >   ...
> > >   Call trace:
> > >    qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
> > >    glink_subdev_stop+0x1c/0x30 [qcom_common]
> > >    rproc_stop+0x58/0x17c
> > >    rproc_trigger_recovery+0xb0/0x150
> > >    rproc_crash_handler_work+0xa4/0xc4
> > >    process_scheduled_works+0x18c/0x2d8
> > >    worker_thread+0x144/0x280
> > >    kthread+0x124/0x138
> > >    ret_from_fork+0x10/0x20
> > >   Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
> > >   ---[ end trace 0000000000000000 ]---
> > > 
> > > Add NULL pointer check in the glink_subdev_stop() to make sure
> > > qcom_glink_smem_unregister() will not be called if glink_subdev_start()
> > > is not called.
> > > 
> > 
> > You mention the actual root problem here: Why is glink_subdev_stop()
> > called if glink_subdev_start() wasn't called?
> > 
> > The call to rproc_start_subdevices() in __rproc_attach() makes sure that
> > all subdevices are in consistent state when exiting the function (either
> > prepared+started or stopped+unprepared). Only if all subdevices were
> > started successfully, the rproc->state is changed to RPROC_ATTACHED.
> > 
> > In your case, attaching the rproc failed so the rproc->state should be
> > still RPROC_DETACHED. All subdevices should be stopped+unprepared. We
> > shouldn't stop/unprepare any subdevices again in this state, they all
> > might crash like glink does here.
> > 
> > We know that subdevices are already stopped+unprepared in RPROC_DETACHED
> > state, so I think you just need to skip rproc_stop_subdevices() and
> > rproc_unprepare_subdevices() inside rproc_stop() in this case, see diff
> > below.
> > 
> > @@ -1708,8 +1709,9 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> >   	if (!rproc->ops->stop)
> >   		return -EINVAL;
> > -	/* Stop any subdevices for the remote processor */
> > -	rproc_stop_subdevices(rproc, crashed);
> > +	/* Stop any subdevices for the remote processor if it was attached */
> > +	if (rproc->state != RPROC_DETACHED)
> > +		rproc_stop_subdevices(rproc, crashed);
> >   	/* the installed resource table is no longer accessible */
> >   	ret = rproc_reset_rsc_table_on_stop(rproc);
> > @@ -1726,7 +1728,8 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> >   		return ret;
> >   	}
> > -	rproc_unprepare_subdevices(rproc);
> > +	if (rproc->state != RPROC_DETACHED)
> > +		rproc_unprepare_subdevices(rproc);
> >   	rproc->state = RPROC_OFFLINE;
> 
> In this case, rproc_crash_handler_work()->rproc_trigger_recovery()->
> rproc_boot_recovery()->rproc_stop()->glink_subdev_stop() is called,
> "rproc->state = RPROC_CRASHED" is set in the rproc_crash_handler_work
> before rproc_boot_recovery is called, so checking RPROC_DETACHED can
> not work for this case.
> 

You're right, I forgot about that. I think we need a more generic
solution for this though. rproc_stop_subdevices() should not be called
without a prior call to rproc_start_subdevices().

I think there are a couple of options for this:

 - Add a bool "subdevs_started" to struct rproc and manage that
   separately from the rproc->state.

 - Track the rproc state before the crash separately (something like
   rproc->state_before_crash) and check that in the stop path.

 - Add a new state RPROC_CRASHED_DETACHED to make sure the states are
   unique.

 - ...

Does the same issue also exist in qcom_pas_stop() of "[PATCH v5 4/5]
remoteproc: qcom: pas: Add late attach support for subsystems" [1]?
There you check for pas->rproc->state != RPROC_ATTACHED, wouldn't this
also fail for the RPROC_CRASHED case?

Thanks,
Stephan

[1]: https://lore.kernel.org/linux-arm-msm/20260409-knp-soccp-v5-4-805a492124da@oss.qualcomm.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-04-14  8:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09  8:46 [PATCH 0/2] remoteproc: improve robustness for rproc_attach fail cases Jingyi Wang
2026-04-09  8:46 ` [PATCH 1/2] remoteproc: core: Attach rproc asynchronously in rproc_add() path Jingyi Wang
2026-04-10 14:28   ` Stephan Gerhold
2026-04-14  3:41     ` Jingyi Wang
2026-04-14  8:13       ` Stephan Gerhold
2026-04-09  8:46 ` [PATCH 2/2] remoteproc: qcom: Check glink->edge in glink_subdev_stop() Jingyi Wang
2026-04-10 14:15   ` Stephan Gerhold
2026-04-14  3:23     ` Jingyi Wang
2026-04-14  8:27       ` Stephan Gerhold

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox