[PATCH 0/3] remoteproc: xlnx: remote crash recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] remoteproc: xlnx: remote crash recovery
@ 2025-10-28  4:57 Tanmay Shah
  2025-10-28  4:57 ` [PATCH 1/3] remoteproc: xlnx: enable boot recovery Tanmay Shah
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Tanmay Shah @ 2025-10-28  4:57 UTC (permalink / raw)
  To: andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel, Tanmay Shah

Remote processor can crash or hang during normal execution. Linux
remoteproc framework supports different mechanisms to recover the
remote processor and re-establish the RPMsg communication in such case.

Crash reporting:

1) Using debugfs node

User can report the crash to the core framework via debugfs node using
following command:

echo 1 > /sys/kernel/debug/remoteproc/remoteproc0/crash

2) Remoteproc notify to the host about crash state and crash reason
via the resource table

This is a platform specific method where the remote firmware contains
vendor specific resource to update the crash state and the crash
reason. Then the remote notifies the crash to the host via mailbox
notification. The host then will check this resource on every mbox
notification and reports the crash to the core framework if needed.

Crash recovery mechanism:

There are two mechanisms available to recover the remote processor from
the crash. 1) boot recovery, 2) attach on recovery

Remoteproc core framework will choose proper mechanism based on the
rproc features set by the platform driver.

1) Boot recovery

This is the default mechanism to recover the remote processor.
In this method core framework will first stop the remote processor,
load the firmware again and then starts the remote processor. On
AMD-Xilinx platforms this method is supported. The coredump callback in
the platform driver isn't implemented so far, but that shouldn't cause
the recovery failure.

2) Attach on recovery

If RPROC_ATTACH_ON_RECOVERY feature is enabled by the platform driver,
then the core framework will choose this method for recovery.

On zynqmp platform following is the sequence of events expected during
remoteproc crash and attach on recovery:

a) rproc attach/detach flow is working, and RPMsg comm is established
b) Remote processor (RPU) crashed (crash not reported yet)
c) Platform management controller stops and reloads elf on inactive
   remote processor before reboot
d) platform management controller reboots the remote processor
e) Remote processor boots again, and detects previous crash (platform
   specific mechanism to detect the crash)
f) Remote processor Reports crash to the Linux (Host) and wait for
   the recovery.
g) Linux performs full detach and reattach to remote processor.
h) Normal RPMsg communication is established.

It is required to destroy all RPMsg related resource and re-create them
during recovery to establish successful RPMsg communication. To achieve
this complete rproc_detach followed by rproc_attach calls are needed.

Tanmay Shah (3):
  remoteproc: xlnx: enable boot recovery
  remoteproc: core: full attach detach during recovery
  remoteproc: xlnx: add crash detection mechanism

 drivers/remoteproc/remoteproc_core.c    | 22 +++++----
 drivers/remoteproc/xlnx_r5_remoteproc.c | 66 ++++++++++++++++++++++++-
 2 files changed, 77 insertions(+), 11 deletions(-)

base-commit: 5f4d69f0ef4f09cd926d193fe0df0c84d53e8252
-- 
2.34.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] remoteproc: xlnx: enable boot recovery
  2025-10-28  4:57 [PATCH 0/3] remoteproc: xlnx: remote crash recovery Tanmay Shah
@ 2025-10-28  4:57 ` Tanmay Shah
  2025-10-28  4:57 ` [PATCH 2/3] remoteproc: core: full attach detach during recovery Tanmay Shah
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Tanmay Shah @ 2025-10-28  4:57 UTC (permalink / raw)
  To: andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel, Tanmay Shah

This is the default method to recover the remote processor from crash.
During this recovery the Linux will stop the remote, load the same
firmware again and start the remote processor. As of now, coredump
callback doest not contain any useful implementation, but this can be
changed as required

Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
---
 drivers/remoteproc/xlnx_r5_remoteproc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c
index 0b7b173d0d26..8677b732ad14 100644
--- a/drivers/remoteproc/xlnx_r5_remoteproc.c
+++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
@@ -891,6 +891,11 @@ static int zynqmp_r5_detach(struct rproc *rproc)
 	return 0;
 }
 
+static void zynqmp_r5_coredump(struct rproc *rproc)
+{
+	(void)rproc;
+}
+
 static const struct rproc_ops zynqmp_r5_rproc_ops = {
 	.prepare	= zynqmp_r5_rproc_prepare,
 	.unprepare	= zynqmp_r5_rproc_unprepare,
@@ -905,6 +910,7 @@ static const struct rproc_ops zynqmp_r5_rproc_ops = {
 	.get_loaded_rsc_table = zynqmp_r5_get_loaded_rsc_table,
 	.attach		= zynqmp_r5_attach,
 	.detach		= zynqmp_r5_detach,
+	.coredump	= zynqmp_r5_coredump,
 };
 
 /**
@@ -938,7 +944,7 @@ static struct zynqmp_r5_core *zynqmp_r5_add_rproc_core(struct device *cdev)
 
 	rproc_coredump_set_elf_info(r5_rproc, ELFCLASS32, EM_ARM);
 
-	r5_rproc->recovery_disabled = true;
+	r5_rproc->recovery_disabled = false;
 	r5_rproc->has_iommu = false;
 	r5_rproc->auto_boot = false;
 	r5_core = r5_rproc->priv;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/3] remoteproc: core: full attach detach during recovery
  2025-10-28  4:57 [PATCH 0/3] remoteproc: xlnx: remote crash recovery Tanmay Shah
  2025-10-28  4:57 ` [PATCH 1/3] remoteproc: xlnx: enable boot recovery Tanmay Shah
@ 2025-10-28  4:57 ` Tanmay Shah
  2025-10-29 22:49   ` Iuliana Prodan
  2025-11-02  8:54   ` Zhongqiu Han
  2025-10-28  4:57 ` [PATCH 3/3] remoteproc: xlnx: add crash detection mechanism Tanmay Shah
  2025-10-29  3:24 ` [PATCH 0/3] remoteproc: xlnx: remote crash recovery Peng Fan
  3 siblings, 2 replies; 17+ messages in thread
From: Tanmay Shah @ 2025-10-28  4:57 UTC (permalink / raw)
  To: andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel, Tanmay Shah

Current attach on recovery mechanism loads the clean resource table
during recovery, but doesn't re-allocate the resources. RPMsg
communication will fail after recovery due to this. Fix this
incorrect behavior by doing the full detach and attach of remote
processor during the recovery. This will load the clean resource table
and re-allocate all the resources, which will set up correct vring
information in the resource table.

Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
---
 drivers/remoteproc/remoteproc_core.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index aada2780b343..f5b078fe056a 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1777,11 +1777,11 @@ static int rproc_attach_recovery(struct rproc *rproc)
 {
 	int ret;
 
-	ret = __rproc_detach(rproc);
+	ret = rproc_detach(rproc);
 	if (ret)
 		return ret;
 
-	return __rproc_attach(rproc);
+	return rproc_attach(rproc);
 }
 
 static int rproc_boot_recovery(struct rproc *rproc)
@@ -1829,6 +1829,9 @@ int rproc_trigger_recovery(struct rproc *rproc)
 	struct device *dev = &rproc->dev;
 	int ret;
 
+	if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
+		return rproc_attach_recovery(rproc);
+
 	ret = mutex_lock_interruptible(&rproc->lock);
 	if (ret)
 		return ret;
@@ -1839,10 +1842,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
 
 	dev_err(dev, "recovering %s\n", rproc->name);
 
-	if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
-		ret = rproc_attach_recovery(rproc);
-	else
-		ret = rproc_boot_recovery(rproc);
+	ret = rproc_boot_recovery(rproc);
 
 unlock_mutex:
 	mutex_unlock(&rproc->lock);
@@ -1860,6 +1860,7 @@ static void rproc_crash_handler_work(struct work_struct *work)
 {
 	struct rproc *rproc = container_of(work, struct rproc, crash_handler);
 	struct device *dev = &rproc->dev;
+	int ret;
 
 	dev_dbg(dev, "enter %s\n", __func__);
 
@@ -1883,8 +1884,11 @@ static void rproc_crash_handler_work(struct work_struct *work)
 
 	mutex_unlock(&rproc->lock);
 
-	if (!rproc->recovery_disabled)
-		rproc_trigger_recovery(rproc);
+	if (!rproc->recovery_disabled) {
+		ret = rproc_trigger_recovery(rproc);
+		if (ret)
+			dev_warn(dev, "rproc recovery failed, err %d\n", ret);
+	}
 
 out:
 	pm_relax(rproc->dev.parent);
@@ -2057,7 +2061,7 @@ int rproc_detach(struct rproc *rproc)
 		return ret;
 	}
 
-	if (rproc->state != RPROC_ATTACHED) {
+	if (rproc->state != RPROC_ATTACHED && rproc->state != RPROC_CRASHED) {
 		ret = -EINVAL;
 		goto out;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/3] remoteproc: xlnx: add crash detection mechanism
  2025-10-28  4:57 [PATCH 0/3] remoteproc: xlnx: remote crash recovery Tanmay Shah
  2025-10-28  4:57 ` [PATCH 1/3] remoteproc: xlnx: enable boot recovery Tanmay Shah
  2025-10-28  4:57 ` [PATCH 2/3] remoteproc: core: full attach detach during recovery Tanmay Shah
@ 2025-10-28  4:57 ` Tanmay Shah
  2025-10-29  3:24 ` [PATCH 0/3] remoteproc: xlnx: remote crash recovery Peng Fan
  3 siblings, 0 replies; 17+ messages in thread
From: Tanmay Shah @ 2025-10-28  4:57 UTC (permalink / raw)
  To: andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel, Tanmay Shah

Remote processor will report the crash reason via the resource table
and notify the host via kick. The host checks this crash reason on
every kick notification from the remote and report to the core
framework. Then the rproc core framework will start the recovery
process.

Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
---
 drivers/remoteproc/xlnx_r5_remoteproc.c | 58 ++++++++++++++++++++++++-
 1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c
index 8677b732ad14..9c6b5f751824 100644
--- a/drivers/remoteproc/xlnx_r5_remoteproc.c
+++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
@@ -108,6 +108,10 @@ struct rsc_tbl_data {
 	const uintptr_t rsc_tbl;
 } __packed;
 
+enum fw_vendor_rsc {
+	FW_RSC_VENDOR_CRASH_REASON = RSC_VENDOR_START,
+};
+
 /*
  * Hardcoded TCM bank values. This will stay in driver to maintain backward
  * compatibility with device-tree that does not have TCM information.
@@ -127,9 +131,21 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
 	{0xffe30000UL, 0x30000, 0x10000UL, PD_R5_1_BTCM, "btcm1"},
 };
 
+/**
+ * struct xlnx_rproc_crash_report - resource to know crash status and reason
+ *
+ * @crash_state: if true, the rproc is notifying crash, time to recover
+ * @crash_reason: reason of crash
+ */
+struct xlnx_rproc_crash_report {
+	u32 crash_state;
+	u32 crash_reason;
+} __packed;
+
 /**
  * struct zynqmp_r5_core - remoteproc core's internal data
  *
+ * @crash_report: rproc crash state and reason
  * @rsc_tbl_va: resource table virtual address
  * @sram: Array of sram memories assigned to this core
  * @num_sram: number of sram for this core
@@ -143,6 +159,7 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
  * @ipi: pointer to mailbox information
  */
 struct zynqmp_r5_core {
+	struct xlnx_rproc_crash_report *crash_report;
 	void __iomem *rsc_tbl_va;
 	struct zynqmp_sram_bank *sram;
 	int num_sram;
@@ -227,10 +244,14 @@ static void handle_event_notified(struct work_struct *work)
 static void zynqmp_r5_mb_rx_cb(struct mbox_client *cl, void *msg)
 {
 	struct zynqmp_ipi_message *ipi_msg, *buf_msg;
+	struct zynqmp_r5_core *r5_core;
+	struct rproc *rproc;
 	struct mbox_info *ipi;
 	size_t len;
 
 	ipi = container_of(cl, struct mbox_info, mbox_cl);
+	r5_core = ipi->r5_core;
+	rproc = r5_core->rproc;
 
 	/* copy data from ipi buffer to r5_core */
 	ipi_msg = (struct zynqmp_ipi_message *)msg;
@@ -244,6 +265,13 @@ static void zynqmp_r5_mb_rx_cb(struct mbox_client *cl, void *msg)
 	buf_msg->len = len;
 	memcpy(buf_msg->data, ipi_msg->data, len);
 
+	/* Check for crash only if rproc crash is expected */
+	if (rproc->state == RPROC_ATTACHED || rproc->state == RPROC_RUNNING) {
+		if (r5_core->crash_report->crash_state)
+			rproc_report_crash(rproc,
+					   r5_core->crash_report->crash_reason);
+	}
+
 	/* received and processed interrupt ack */
 	if (mbox_send_message(ipi->rx_chan, NULL) < 0)
 		dev_err(cl->dev, "ack failed to mbox rx_chan\n");
@@ -394,9 +422,14 @@ static int zynqmp_r5_rproc_start(struct rproc *rproc)
 
 	ret = zynqmp_pm_request_wake(r5_core->pm_domain_id, 1,
 				     bootmem, ZYNQMP_PM_REQUEST_ACK_NO);
-	if (ret)
+	if (ret) {
 		dev_err(r5_core->dev,
 			"failed to start RPU = 0x%x\n", r5_core->pm_domain_id);
+		return ret;
+	}
+
+	clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
+
 	return ret;
 }
 
@@ -874,6 +907,8 @@ static int zynqmp_r5_get_rsc_table_va(struct zynqmp_r5_core *r5_core)
 
 static int zynqmp_r5_attach(struct rproc *rproc)
 {
+	rproc_set_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY);
+
 	dev_dbg(&rproc->dev, "rproc %d attached\n", rproc->index);
 
 	return 0;
@@ -896,6 +931,26 @@ static void zynqmp_r5_coredump(struct rproc *rproc)
 	(void)rproc;
 }
 
+static int zynqmp_r5_handle_crash_rsc(struct rproc *rproc, void *rsc,
+				      int offset, int avail)
+{
+	struct zynqmp_r5_core *r5_core = rproc->priv;
+
+	r5_core->crash_report =
+		(struct xlnx_rproc_crash_report *)(r5_core->rsc_tbl_va + offset);
+
+	return RSC_HANDLED;
+}
+
+static int zynqmp_r5_handle_rsc(struct rproc *rproc, u32 rsc_type, void *rsc,
+				int offset, int avail)
+{
+	if (rsc_type == FW_RSC_VENDOR_CRASH_REASON)
+		return zynqmp_r5_handle_crash_rsc(rproc, rsc, offset, avail);
+
+	return RSC_IGNORED;
+}
+
 static const struct rproc_ops zynqmp_r5_rproc_ops = {
 	.prepare	= zynqmp_r5_rproc_prepare,
 	.unprepare	= zynqmp_r5_rproc_unprepare,
@@ -911,6 +966,7 @@ static const struct rproc_ops zynqmp_r5_rproc_ops = {
 	.attach		= zynqmp_r5_attach,
 	.detach		= zynqmp_r5_detach,
 	.coredump	= zynqmp_r5_coredump,
+	.handle_rsc	= zynqmp_r5_handle_rsc,
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] remoteproc: xlnx: remote crash recovery
  2025-10-28  4:57 [PATCH 0/3] remoteproc: xlnx: remote crash recovery Tanmay Shah
                   ` (2 preceding siblings ...)
  2025-10-28  4:57 ` [PATCH 3/3] remoteproc: xlnx: add crash detection mechanism Tanmay Shah
@ 2025-10-29  3:24 ` Peng Fan
  2025-10-29  4:15   ` Tanmay Shah
  3 siblings, 1 reply; 17+ messages in thread
From: Peng Fan @ 2025-10-29  3:24 UTC (permalink / raw)
  To: Tanmay Shah; +Cc: andersson, mathieu.poirier, linux-remoteproc, linux-kernel

Hi Tanmay,

On Mon, Oct 27, 2025 at 09:57:28PM -0700, Tanmay Shah wrote:
>Remote processor can crash or hang during normal execution. Linux
>remoteproc framework supports different mechanisms to recover the
>remote processor and re-establish the RPMsg communication in such case.
>
>Crash reporting:
>
>1) Using debugfs node
>
>User can report the crash to the core framework via debugfs node using
>following command:
>
>echo 1 > /sys/kernel/debug/remoteproc/remoteproc0/crash
>
>2) Remoteproc notify to the host about crash state and crash reason
>via the resource table
>
>This is a platform specific method where the remote firmware contains
>vendor specific resource to update the crash state and the crash
>reason. Then the remote notifies the crash to the host via mailbox
>notification. The host then will check this resource on every mbox
>notification and reports the crash to the core framework if needed.
>
>Crash recovery mechanism:
>
>There are two mechanisms available to recover the remote processor from
>the crash. 1) boot recovery, 2) attach on recovery
>
>Remoteproc core framework will choose proper mechanism based on the
>rproc features set by the platform driver.
>
>1) Boot recovery
>
>This is the default mechanism to recover the remote processor.
>In this method core framework will first stop the remote processor,
>load the firmware again and then starts the remote processor. On
>AMD-Xilinx platforms this method is supported. The coredump callback in
>the platform driver isn't implemented so far, but that shouldn't cause
>the recovery failure.
>
>2) Attach on recovery
>
>If RPROC_ATTACH_ON_RECOVERY feature is enabled by the platform driver,
>then the core framework will choose this method for recovery.
>
>On zynqmp platform following is the sequence of events expected during
>remoteproc crash and attach on recovery:
>
>a) rproc attach/detach flow is working, and RPMsg comm is established
>b) Remote processor (RPU) crashed (crash not reported yet)
>c) Platform management controller stops and reloads elf on inactive
>   remote processor before reboot
>d) platform management controller reboots the remote processor
>e) Remote processor boots again, and detects previous crash (platform
>   specific mechanism to detect the crash)
>f) Remote processor Reports crash to the Linux (Host) and wait for
>   the recovery.
>g) Linux performs full detach and reattach to remote processor.
>h) Normal RPMsg communication is established.
>
>It is required to destroy all RPMsg related resource and re-create them
>during recovery to establish successful RPMsg communication. To achieve
>this complete rproc_detach followed by rproc_attach calls are needed.
>
>
>Tanmay Shah (3):
>  remoteproc: xlnx: enable boot recovery
>  remoteproc: core: full attach detach during recovery
>  remoteproc: xlnx: add crash detection mechanism
>

I gave a test on i.MX8QM-MEK, there are failures, 1st test pass, 2nd test fail.
Without this patch, I not see failures.
root@imx8qmmek:~#
remoteproc remoteproc0: crash detected in imx-rproc: type watchdog
Partition3 reset!
remoteproc remoteproc0: handling crash #1 in imx-rproc
remoteproc remoteproc0: detached remote processor imx-rproc
rproc-virtio rproc-virtio.1.auto: assigned reserved memory node vdevbuffer@90400000
virtio_rpmsg_bus virtio0: rpmsg host is online
rproc-virtio rproc-virtio.1.auto: registered virtio0 (type 7)
rproc-virtio rproc-virtio.2.auto: assigned reserved memory node vdevbuffer@90400000
virtio_rpmsg_bus virtio1: rpmsg host is online
rproc-virtio rproc-virtio.2.auto: registered virtio1 (type 7)
remoteproc remoteproc0: remote processor imx-rproc is now attached
virtio_rpmsg_bus virtio1: creating channel rpmsg-openamp-demo-channel addr 0x1e

remoteproc remoteproc0: crash detected in imx-rproc: type watchdog
Partition3 reset!
remoteproc remoteproc0: handling crash #2 in imx-rproc
rproc-virtio rproc-virtio.1.auto: assigned reserved memory node vdevbuffer@90400000
virtio_rpmsg_bus virtio4: probe with driver virtio_rpmsg_bus failed with error -12
rproc-virtio rproc-virtio.1.auto: registered virtio4 (type 7)
rproc-virtio rproc-virtio.2.auto: assigned reserved memory node vdevbuffer@90400000
virtio_rpmsg_bus virtio5: probe with driver virtio_rpmsg_bus failed with error -12
rproc-virtio rproc-virtio.2.auto: registered virtio5 (type 7)
rproc-virtio rproc-virtio.5.auto: assigned reserved memory node vdevbuffer@90400000
virtio_rpmsg_bus virtio6: probe with driver virtio_rpmsg_bus failed with error -12
rproc-virtio rproc-virtio.5.auto: registered virtio6 (type 7)
rproc-virtio rproc-virtio.6.auto: assigned reserved memory node vdevbuffer@90400000
virtio_rpmsg_bus virtio7: probe with driver virtio_rpmsg_bus failed with error -12
rproc-virtio rproc-virtio.6.auto: registered virtio7 (type 7)
remoteproc remoteproc0: remote processor imx-rproc is now attached

Thanks,
Peng

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] remoteproc: xlnx: remote crash recovery
  2025-10-29  3:24 ` [PATCH 0/3] remoteproc: xlnx: remote crash recovery Peng Fan
@ 2025-10-29  4:15   ` Tanmay Shah
  2025-10-29 23:51     ` Tanmay Shah
  0 siblings, 1 reply; 17+ messages in thread
From: Tanmay Shah @ 2025-10-29  4:15 UTC (permalink / raw)
  To: Peng Fan; +Cc: andersson, mathieu.poirier, linux-remoteproc, linux-kernel



On 10/28/25 10:24 PM, Peng Fan wrote:
> Hi Tanmay,
> 
> On Mon, Oct 27, 2025 at 09:57:28PM -0700, Tanmay Shah wrote:
>> Remote processor can crash or hang during normal execution. Linux
>> remoteproc framework supports different mechanisms to recover the
>> remote processor and re-establish the RPMsg communication in such case.
>>
>> Crash reporting:
>>
>> 1) Using debugfs node
>>
>> User can report the crash to the core framework via debugfs node using
>> following command:
>>
>> echo 1 > /sys/kernel/debug/remoteproc/remoteproc0/crash
>>
>> 2) Remoteproc notify to the host about crash state and crash reason
>> via the resource table
>>
>> This is a platform specific method where the remote firmware contains
>> vendor specific resource to update the crash state and the crash
>> reason. Then the remote notifies the crash to the host via mailbox
>> notification. The host then will check this resource on every mbox
>> notification and reports the crash to the core framework if needed.
>>
>> Crash recovery mechanism:
>>
>> There are two mechanisms available to recover the remote processor from
>> the crash. 1) boot recovery, 2) attach on recovery
>>
>> Remoteproc core framework will choose proper mechanism based on the
>> rproc features set by the platform driver.
>>
>> 1) Boot recovery
>>
>> This is the default mechanism to recover the remote processor.
>> In this method core framework will first stop the remote processor,
>> load the firmware again and then starts the remote processor. On
>> AMD-Xilinx platforms this method is supported. The coredump callback in
>> the platform driver isn't implemented so far, but that shouldn't cause
>> the recovery failure.
>>
>> 2) Attach on recovery
>>
>> If RPROC_ATTACH_ON_RECOVERY feature is enabled by the platform driver,
>> then the core framework will choose this method for recovery.
>>
>> On zynqmp platform following is the sequence of events expected during
>> remoteproc crash and attach on recovery:
>>
>> a) rproc attach/detach flow is working, and RPMsg comm is established
>> b) Remote processor (RPU) crashed (crash not reported yet)
>> c) Platform management controller stops and reloads elf on inactive
>>    remote processor before reboot
>> d) platform management controller reboots the remote processor
>> e) Remote processor boots again, and detects previous crash (platform
>>    specific mechanism to detect the crash)
>> f) Remote processor Reports crash to the Linux (Host) and wait for
>>    the recovery.
>> g) Linux performs full detach and reattach to remote processor.
>> h) Normal RPMsg communication is established.
>>
>> It is required to destroy all RPMsg related resource and re-create them
>> during recovery to establish successful RPMsg communication. To achieve
>> this complete rproc_detach followed by rproc_attach calls are needed.
>>
>>
>> Tanmay Shah (3):
>>   remoteproc: xlnx: enable boot recovery
>>   remoteproc: core: full attach detach during recovery
>>   remoteproc: xlnx: add crash detection mechanism
>>
> 
> I gave a test on i.MX8QM-MEK, there are failures, 1st test pass, 2nd test fail.
> Without this patch, I not see failures.
> root@imx8qmmek:~#
> remoteproc remoteproc0: crash detected in imx-rproc: type watchdog
> Partition3 reset!
> remoteproc remoteproc0: handling crash #1 in imx-rproc
> remoteproc remoteproc0: detached remote processor imx-rproc
> rproc-virtio rproc-virtio.1.auto: assigned reserved memory node vdevbuffer@90400000
> virtio_rpmsg_bus virtio0: rpmsg host is online
> rproc-virtio rproc-virtio.1.auto: registered virtio0 (type 7)
> rproc-virtio rproc-virtio.2.auto: assigned reserved memory node vdevbuffer@90400000
> virtio_rpmsg_bus virtio1: rpmsg host is online
> rproc-virtio rproc-virtio.2.auto: registered virtio1 (type 7)
> remoteproc remoteproc0: remote processor imx-rproc is now attached
> virtio_rpmsg_bus virtio1: creating channel rpmsg-openamp-demo-channel addr 0x1e
> 
> remoteproc remoteproc0: crash detected in imx-rproc: type watchdog
> Partition3 reset!
> remoteproc remoteproc0: handling crash #2 in imx-rproc
> rproc-virtio rproc-virtio.1.auto: assigned reserved memory node vdevbuffer@90400000
> virtio_rpmsg_bus virtio4: probe with driver virtio_rpmsg_bus failed with error -12
> rproc-virtio rproc-virtio.1.auto: registered virtio4 (type 7)
> rproc-virtio rproc-virtio.2.auto: assigned reserved memory node vdevbuffer@90400000
> virtio_rpmsg_bus virtio5: probe with driver virtio_rpmsg_bus failed with error -12
> rproc-virtio rproc-virtio.2.auto: registered virtio5 (type 7)
> rproc-virtio rproc-virtio.5.auto: assigned reserved memory node vdevbuffer@90400000
> virtio_rpmsg_bus virtio6: probe with driver virtio_rpmsg_bus failed with error -12
> rproc-virtio rproc-virtio.5.auto: registered virtio6 (type 7)
> rproc-virtio rproc-virtio.6.auto: assigned reserved memory node vdevbuffer@90400000
> virtio_rpmsg_bus virtio7: probe with driver virtio_rpmsg_bus failed with error -12
> rproc-virtio rproc-virtio.6.auto: registered virtio7 (type 7)
> remoteproc remoteproc0: remote processor imx-rproc is now attached
> 

Hi Peng,

I don't understand why it should fail. The patch simply implements 
rproc_detach() -> rproc_attach() sequence.

In your case, when you do detach -> attach via sysfs that sequence 
works? If that works, then crash recovery should work as well.

Could you give steps how do you generate the crash?

Thanks,
Tanmay

> Thanks,
> Peng


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] remoteproc: core: full attach detach during recovery
  2025-10-28  4:57 ` [PATCH 2/3] remoteproc: core: full attach detach during recovery Tanmay Shah
@ 2025-10-29 22:49   ` Iuliana Prodan
  2025-10-29 23:41     ` Tanmay Shah
  2025-11-02  8:54   ` Zhongqiu Han
  1 sibling, 1 reply; 17+ messages in thread
From: Iuliana Prodan @ 2025-10-29 22:49 UTC (permalink / raw)
  To: Tanmay Shah, andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel

Hi Tanmay,

On 10/28/2025 6:57 AM, Tanmay Shah wrote:
> Current attach on recovery mechanism loads the clean resource table
> during recovery, but doesn't re-allocate the resources. RPMsg
> communication will fail after recovery due to this. Fix this
> incorrect behavior by doing the full detach and attach of remote
> processor during the recovery. This will load the clean resource table
> and re-allocate all the resources, which will set up correct vring
> information in the resource table.
> 
> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
> ---
>   drivers/remoteproc/remoteproc_core.c | 22 +++++++++++++---------
>   1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index aada2780b343..f5b078fe056a 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1777,11 +1777,11 @@ static int rproc_attach_recovery(struct rproc *rproc)
>   {
>   	int ret;
>   
> -	ret = __rproc_detach(rproc);
> +	ret = rproc_detach(rproc);
>   	if (ret)
>   		return ret;
>   
> -	return __rproc_attach(rproc);
> +	return rproc_attach(rproc);
>   }
>   
>   static int rproc_boot_recovery(struct rproc *rproc)
> @@ -1829,6 +1829,9 @@ int rproc_trigger_recovery(struct rproc *rproc)
>   	struct device *dev = &rproc->dev;
>   	int ret;
>   
> +	if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
> +		return rproc_attach_recovery(rproc);
> +
>   	ret = mutex_lock_interruptible(&rproc->lock);
>   	if (ret)
>   		return ret;
> @@ -1839,10 +1842,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
>   
>   	dev_err(dev, "recovering %s\n", rproc->name);

Please move the log message above the new early return so both paths log 
recovery.
>   
> -	if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
> -		ret = rproc_attach_recovery(rproc);
> -	else
> -		ret = rproc_boot_recovery(rproc);
> +	ret = rproc_boot_recovery(rproc);
>   
>   unlock_mutex:
>   	mutex_unlock(&rproc->lock);
> @@ -1860,6 +1860,7 @@ static void rproc_crash_handler_work(struct work_struct *work)
>   {
>   	struct rproc *rproc = container_of(work, struct rproc, crash_handler);
>   	struct device *dev = &rproc->dev;
> +	int ret;
>   
>   	dev_dbg(dev, "enter %s\n", __func__);
>   
> @@ -1883,8 +1884,11 @@ static void rproc_crash_handler_work(struct work_struct *work)
>   
>   	mutex_unlock(&rproc->lock);
>   
> -	if (!rproc->recovery_disabled)
> -		rproc_trigger_recovery(rproc);
> +	if (!rproc->recovery_disabled) {
> +		ret = rproc_trigger_recovery(rproc);
> +		if (ret)
> +			dev_warn(dev, "rproc recovery failed, err %d\n", ret);
> +	}
>   
>   out:
>   	pm_relax(rproc->dev.parent);
> @@ -2057,7 +2061,7 @@ int rproc_detach(struct rproc *rproc)
>   		return ret;
>   	}
>   
> -	if (rproc->state != RPROC_ATTACHED) {
> +	if (rproc->state != RPROC_ATTACHED && rproc->state != RPROC_CRASHED) {
>   		ret = -EINVAL;
>   		goto out;
>   	}
Tested this on i.MX8M Plus using the imx_dsp_rproc driver, which 
supports recovery.
Everything looks good, but on imx_dsp_rproc we use rproc_boot_recovery, 
not rproc_attach_recovery, where most of the changes happened.

Iulia

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] remoteproc: core: full attach detach during recovery
  2025-10-29 22:49   ` Iuliana Prodan
@ 2025-10-29 23:41     ` Tanmay Shah
  0 siblings, 0 replies; 17+ messages in thread
From: Tanmay Shah @ 2025-10-29 23:41 UTC (permalink / raw)
  To: Iuliana Prodan, andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel



On 10/29/25 5:49 PM, Iuliana Prodan wrote:
> Hi Tanmay,
> 
> On 10/28/2025 6:57 AM, Tanmay Shah wrote:
>> Current attach on recovery mechanism loads the clean resource table
>> during recovery, but doesn't re-allocate the resources. RPMsg
>> communication will fail after recovery due to this. Fix this
>> incorrect behavior by doing the full detach and attach of remote
>> processor during the recovery. This will load the clean resource table
>> and re-allocate all the resources, which will set up correct vring
>> information in the resource table.
>>
>> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
>> ---
>>   drivers/remoteproc/remoteproc_core.c | 22 +++++++++++++---------
>>   1 file changed, 13 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/ 
>> remoteproc/remoteproc_core.c
>> index aada2780b343..f5b078fe056a 100644
>> --- a/drivers/remoteproc/remoteproc_core.c
>> +++ b/drivers/remoteproc/remoteproc_core.c
>> @@ -1777,11 +1777,11 @@ static int rproc_attach_recovery(struct rproc 
>> *rproc)
>>   {
>>       int ret;
>> -    ret = __rproc_detach(rproc);
>> +    ret = rproc_detach(rproc);
>>       if (ret)
>>           return ret;
>> -    return __rproc_attach(rproc);
>> +    return rproc_attach(rproc);
>>   }
>>   static int rproc_boot_recovery(struct rproc *rproc)
>> @@ -1829,6 +1829,9 @@ int rproc_trigger_recovery(struct rproc *rproc)
>>       struct device *dev = &rproc->dev;
>>       int ret;
>> +    if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
>> +        return rproc_attach_recovery(rproc);
>> +
>>       ret = mutex_lock_interruptible(&rproc->lock);
>>       if (ret)
>>           return ret;
>> @@ -1839,10 +1842,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
>>       dev_err(dev, "recovering %s\n", rproc->name);
> 
> Please move the log message above the new early return so both paths log 
> recovery.
>> -    if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
>> -        ret = rproc_attach_recovery(rproc);
>> -    else
>> -        ret = rproc_boot_recovery(rproc);
>> +    ret = rproc_boot_recovery(rproc);
>>   unlock_mutex:
>>       mutex_unlock(&rproc->lock);
>> @@ -1860,6 +1860,7 @@ static void rproc_crash_handler_work(struct 
>> work_struct *work)
>>   {
>>       struct rproc *rproc = container_of(work, struct rproc, 
>> crash_handler);
>>       struct device *dev = &rproc->dev;
>> +    int ret;
>>       dev_dbg(dev, "enter %s\n", __func__);
>> @@ -1883,8 +1884,11 @@ static void rproc_crash_handler_work(struct 
>> work_struct *work)
>>       mutex_unlock(&rproc->lock);
>> -    if (!rproc->recovery_disabled)
>> -        rproc_trigger_recovery(rproc);
>> +    if (!rproc->recovery_disabled) {
>> +        ret = rproc_trigger_recovery(rproc);
>> +        if (ret)
>> +            dev_warn(dev, "rproc recovery failed, err %d\n", ret);
>> +    }
>>   out:
>>       pm_relax(rproc->dev.parent);
>> @@ -2057,7 +2061,7 @@ int rproc_detach(struct rproc *rproc)
>>           return ret;
>>       }
>> -    if (rproc->state != RPROC_ATTACHED) {
>> +    if (rproc->state != RPROC_ATTACHED && rproc->state != 
>> RPROC_CRASHED) {
>>           ret = -EINVAL;
>>           goto out;
>>       }
> Tested this on i.MX8M Plus using the imx_dsp_rproc driver, which 
> supports recovery.
> Everything looks good, but on imx_dsp_rproc we use rproc_boot_recovery, 
> not rproc_attach_recovery, where most of the changes happened.
> 

Hello,

Thanks for testing the patch. Correct, if attach recovery is not used 
then the patch shouldn't affect functionality of any platform driver.

Thanks,
Tanmay

> Iulia


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] remoteproc: xlnx: remote crash recovery
  2025-10-29  4:15   ` Tanmay Shah
@ 2025-10-29 23:51     ` Tanmay Shah
  2025-10-30  4:21       ` Peng Fan
  0 siblings, 1 reply; 17+ messages in thread
From: Tanmay Shah @ 2025-10-29 23:51 UTC (permalink / raw)
  To: Peng Fan; +Cc: andersson, mathieu.poirier, linux-remoteproc, linux-kernel



On 10/28/25 11:15 PM, Tanmay Shah wrote:
> 
> 
> On 10/28/25 10:24 PM, Peng Fan wrote:
>> Hi Tanmay,
>>
>> On Mon, Oct 27, 2025 at 09:57:28PM -0700, Tanmay Shah wrote:
>>> Remote processor can crash or hang during normal execution. Linux
>>> remoteproc framework supports different mechanisms to recover the
>>> remote processor and re-establish the RPMsg communication in such case.
>>>
>>> Crash reporting:
>>>
>>> 1) Using debugfs node
>>>
>>> User can report the crash to the core framework via debugfs node using
>>> following command:
>>>
>>> echo 1 > /sys/kernel/debug/remoteproc/remoteproc0/crash
>>>
>>> 2) Remoteproc notify to the host about crash state and crash reason
>>> via the resource table
>>>
>>> This is a platform specific method where the remote firmware contains
>>> vendor specific resource to update the crash state and the crash
>>> reason. Then the remote notifies the crash to the host via mailbox
>>> notification. The host then will check this resource on every mbox
>>> notification and reports the crash to the core framework if needed.
>>>
>>> Crash recovery mechanism:
>>>
>>> There are two mechanisms available to recover the remote processor from
>>> the crash. 1) boot recovery, 2) attach on recovery
>>>
>>> Remoteproc core framework will choose proper mechanism based on the
>>> rproc features set by the platform driver.
>>>
>>> 1) Boot recovery
>>>
>>> This is the default mechanism to recover the remote processor.
>>> In this method core framework will first stop the remote processor,
>>> load the firmware again and then starts the remote processor. On
>>> AMD-Xilinx platforms this method is supported. The coredump callback in
>>> the platform driver isn't implemented so far, but that shouldn't cause
>>> the recovery failure.
>>>
>>> 2) Attach on recovery
>>>
>>> If RPROC_ATTACH_ON_RECOVERY feature is enabled by the platform driver,
>>> then the core framework will choose this method for recovery.
>>>
>>> On zynqmp platform following is the sequence of events expected during
>>> remoteproc crash and attach on recovery:
>>>
>>> a) rproc attach/detach flow is working, and RPMsg comm is established
>>> b) Remote processor (RPU) crashed (crash not reported yet)
>>> c) Platform management controller stops and reloads elf on inactive
>>>    remote processor before reboot
>>> d) platform management controller reboots the remote processor
>>> e) Remote processor boots again, and detects previous crash (platform
>>>    specific mechanism to detect the crash)
>>> f) Remote processor Reports crash to the Linux (Host) and wait for
>>>    the recovery.
>>> g) Linux performs full detach and reattach to remote processor.
>>> h) Normal RPMsg communication is established.
>>>
>>> It is required to destroy all RPMsg related resource and re-create them
>>> during recovery to establish successful RPMsg communication. To achieve
>>> this complete rproc_detach followed by rproc_attach calls are needed.
>>>
>>>
>>> Tanmay Shah (3):
>>>   remoteproc: xlnx: enable boot recovery
>>>   remoteproc: core: full attach detach during recovery
>>>   remoteproc: xlnx: add crash detection mechanism
>>>
>>
>> I gave a test on i.MX8QM-MEK, there are failures, 1st test pass, 2nd 
>> test fail.
>> Without this patch, I not see failures.
>> root@imx8qmmek:~#
>> remoteproc remoteproc0: crash detected in imx-rproc: type watchdog
>> Partition3 reset!
>> remoteproc remoteproc0: handling crash #1 in imx-rproc
>> remoteproc remoteproc0: detached remote processor imx-rproc
>> rproc-virtio rproc-virtio.1.auto: assigned reserved memory node 
>> vdevbuffer@90400000
>> virtio_rpmsg_bus virtio0: rpmsg host is online
>> rproc-virtio rproc-virtio.1.auto: registered virtio0 (type 7)
>> rproc-virtio rproc-virtio.2.auto: assigned reserved memory node 
>> vdevbuffer@90400000
>> virtio_rpmsg_bus virtio1: rpmsg host is online
>> rproc-virtio rproc-virtio.2.auto: registered virtio1 (type 7)
>> remoteproc remoteproc0: remote processor imx-rproc is now attached
>> virtio_rpmsg_bus virtio1: creating channel rpmsg-openamp-demo-channel 
>> addr 0x1e
>>
>> remoteproc remoteproc0: crash detected in imx-rproc: type watchdog
>> Partition3 reset!
>> remoteproc remoteproc0: handling crash #2 in imx-rproc
>> rproc-virtio rproc-virtio.1.auto: assigned reserved memory node 
>> vdevbuffer@90400000
>> virtio_rpmsg_bus virtio4: probe with driver virtio_rpmsg_bus failed 
>> with error -12
>> rproc-virtio rproc-virtio.1.auto: registered virtio4 (type 7)
>> rproc-virtio rproc-virtio.2.auto: assigned reserved memory node 
>> vdevbuffer@90400000
>> virtio_rpmsg_bus virtio5: probe with driver virtio_rpmsg_bus failed 
>> with error -12
>> rproc-virtio rproc-virtio.2.auto: registered virtio5 (type 7)
>> rproc-virtio rproc-virtio.5.auto: assigned reserved memory node 
>> vdevbuffer@90400000
>> virtio_rpmsg_bus virtio6: probe with driver virtio_rpmsg_bus failed 
>> with error -12
>> rproc-virtio rproc-virtio.5.auto: registered virtio6 (type 7)
>> rproc-virtio rproc-virtio.6.auto: assigned reserved memory node 
>> vdevbuffer@90400000
>> virtio_rpmsg_bus virtio7: probe with driver virtio_rpmsg_bus failed 
>> with error -12
>> rproc-virtio rproc-virtio.6.auto: registered virtio7 (type 7)
>> remoteproc remoteproc0: remote processor imx-rproc is now attached
>>
> 
> Hi Peng,
> 
> I don't understand why it should fail. The patch simply implements 
> rproc_detach() -> rproc_attach() sequence.
> 

Hi Peng,

Thanks for testing the patch. I appreciate your quick response. I think 
rproc_boot() should be used instead of rproc_attach(). That should 
probably solve the issue you are facing. I will send v2 with this change 
for you to try.

Thanks,
Tanmay

> In your case, when you do detach -> attach via sysfs that sequence 
> works? If that works, then crash recovery should work as well.
> 
> Could you give steps how do you generate the crash?
> 
> Thanks,
> Tanmay
> 
>> Thanks,
>> Peng
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] remoteproc: xlnx: remote crash recovery
  2025-10-29 23:51     ` Tanmay Shah
@ 2025-10-30  4:21       ` Peng Fan
  2025-11-10 18:03         ` Mathieu Poirier
  0 siblings, 1 reply; 17+ messages in thread
From: Peng Fan @ 2025-10-30  4:21 UTC (permalink / raw)
  To: Tanmay Shah; +Cc: andersson, mathieu.poirier, linux-remoteproc, linux-kernel

Hi Tanmay,

On Wed, Oct 29, 2025 at 06:51:51PM -0500, Tanmay Shah wrote:
...
>> > 
>> 
>> Hi Peng,
>> 
>> I don't understand why it should fail. The patch simply implements
>> rproc_detach() -> rproc_attach() sequence.
>> 
>
>Hi Peng,
>
>Thanks for testing the patch. I appreciate your quick response. I think
>rproc_boot() should be used instead of rproc_attach(). That should probably
>solve the issue you are facing. I will send v2 with this change for you to
>try.
>
>Thanks,
>Tanmay
>
>> In your case, when you do detach -> attach via sysfs that sequence works?
>> If that works, then crash recovery should work as well.

sysfs does not have attach option, only start/stop/detach are there.

>> 
>> Could you give steps how do you generate the crash?

I have not look into the details on why it fails at my side for the 2nd time.

On my board, the M4 core use watchdog to reset itself and notify Linux, then
linux side imx_rproc driver will do
"rproc_report_crash(priv->rproc, RPROC_WATCHDOG);"

I will give a debug on the failures in a few days.

Thanks,
Peng

>> 
>> Thanks,
>> Tanmay
>> 
>> > Thanks,
>> > Peng
>> 
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] remoteproc: core: full attach detach during recovery
  2025-10-28  4:57 ` [PATCH 2/3] remoteproc: core: full attach detach during recovery Tanmay Shah
  2025-10-29 22:49   ` Iuliana Prodan
@ 2025-11-02  8:54   ` Zhongqiu Han
  2025-11-03 17:22     ` Tanmay Shah
  1 sibling, 1 reply; 17+ messages in thread
From: Zhongqiu Han @ 2025-11-02  8:54 UTC (permalink / raw)
  To: Tanmay Shah, andersson, mathieu.poirier
  Cc: linux-remoteproc, linux-kernel, zhongqiu.han

On 10/28/2025 12:57 PM, Tanmay Shah wrote:
> Current attach on recovery mechanism loads the clean resource table
> during recovery, but doesn't re-allocate the resources. RPMsg
> communication will fail after recovery due to this. Fix this
> incorrect behavior by doing the full detach and attach of remote
> processor during the recovery. This will load the clean resource table
> and re-allocate all the resources, which will set up correct vring
> information in the resource table.
> 
> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
> ---
>   drivers/remoteproc/remoteproc_core.c | 22 +++++++++++++---------
>   1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index aada2780b343..f5b078fe056a 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1777,11 +1777,11 @@ static int rproc_attach_recovery(struct rproc *rproc)
>   {
>   	int ret;
>   
> -	ret = __rproc_detach(rproc);
> +	ret = rproc_detach(rproc);
>   	if (ret)
>   		return ret;
>   
> -	return __rproc_attach(rproc);
> +	return rproc_attach(rproc);
>   }
>   
>   static int rproc_boot_recovery(struct rproc *rproc)
> @@ -1829,6 +1829,9 @@ int rproc_trigger_recovery(struct rproc *rproc)
>   	struct device *dev = &rproc->dev;
>   	int ret;
>   
> +	if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
> +		return rproc_attach_recovery(rproc);
> +
>   	ret = mutex_lock_interruptible(&rproc->lock);

Hi Tanmay,

I have a concern about this patch, specifically regarding the locking
behavior and potential race conditions introduced by the early return in
rproc_trigger_recovery(), by calling rproc_attach_recovery() directly
and bypassing the original mutex_lock_interruptible() in
rproc_trigger_recovery(), the recovery flow now executes rproc_attach()
without holding rproc->lock.

This could potentially lead to race conditions if other threads are
accessing or modifying shared resources concurrently.For example, one
possible scenario is:


state_store/rproc_trigger_auto_boot
-->rproc_boot
    -->ret = mutex_lock_interruptible(&rproc->lock);    <--(4)
    -->if (rproc->state == RPROC_DETACHED) {
        -->ret = rproc_attach(rproc);                   <--(5)
       }
    -->mutex_unlock(&rproc->lock);
	

rproc_trigger_recovery
-->rproc_attach_recovery
    -->rproc_detach
       -->ret = mutex_lock_interruptible(&rproc->lock); <--(1)
       -->ret = __rproc_detach(rproc);
       -->rproc->state = RPROC_DETACHED;                <--(2)
       -->mutex_unlock(&rproc->lock);                   <--(3)
   -->return rproc_attach(rproc);                       <--(6)

As shown in stack (5) and (6), two threads may simultaneously
execute the rproc_attach() function, which could lead to a race
condition.

Please feel free to correct me, thanks~


>   	if (ret)
>   		return ret;
> @@ -1839,10 +1842,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
>   
>   	dev_err(dev, "recovering %s\n", rproc->name);
>   
> -	if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
> -		ret = rproc_attach_recovery(rproc);
> -	else
> -		ret = rproc_boot_recovery(rproc);
> +	ret = rproc_boot_recovery(rproc);
>   
>   unlock_mutex:
>   	mutex_unlock(&rproc->lock);
> @@ -1860,6 +1860,7 @@ static void rproc_crash_handler_work(struct work_struct *work)
>   {
>   	struct rproc *rproc = container_of(work, struct rproc, crash_handler);
>   	struct device *dev = &rproc->dev;
> +	int ret;
>   
>   	dev_dbg(dev, "enter %s\n", __func__);
>   
> @@ -1883,8 +1884,11 @@ static void rproc_crash_handler_work(struct work_struct *work)
>   
>   	mutex_unlock(&rproc->lock);
>   
> -	if (!rproc->recovery_disabled)
> -		rproc_trigger_recovery(rproc);
> +	if (!rproc->recovery_disabled) {
> +		ret = rproc_trigger_recovery(rproc);
> +		if (ret)
> +			dev_warn(dev, "rproc recovery failed, err %d\n", ret);
> +	}
>   
>   out:
>   	pm_relax(rproc->dev.parent);
> @@ -2057,7 +2061,7 @@ int rproc_detach(struct rproc *rproc)
>   		return ret;
>   	}
>   
> -	if (rproc->state != RPROC_ATTACHED) {
> +	if (rproc->state != RPROC_ATTACHED && rproc->state != RPROC_CRASHED) {
>   		ret = -EINVAL;
>   		goto out;
>   	}


-- 
Thx and BRs,
Zhongqiu Han

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] remoteproc: core: full attach detach during recovery
  2025-11-02  8:54   ` Zhongqiu Han
@ 2025-11-03 17:22     ` Tanmay Shah
  0 siblings, 0 replies; 17+ messages in thread
From: Tanmay Shah @ 2025-11-03 17:22 UTC (permalink / raw)
  To: Zhongqiu Han, andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel



On 11/2/25 2:54 AM, Zhongqiu Han wrote:
> On 10/28/2025 12:57 PM, Tanmay Shah wrote:
>> Current attach on recovery mechanism loads the clean resource table
>> during recovery, but doesn't re-allocate the resources. RPMsg
>> communication will fail after recovery due to this. Fix this
>> incorrect behavior by doing the full detach and attach of remote
>> processor during the recovery. This will load the clean resource table
>> and re-allocate all the resources, which will set up correct vring
>> information in the resource table.
>>
>> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
>> ---
>>   drivers/remoteproc/remoteproc_core.c | 22 +++++++++++++---------
>>   1 file changed, 13 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/ 
>> remoteproc/remoteproc_core.c
>> index aada2780b343..f5b078fe056a 100644
>> --- a/drivers/remoteproc/remoteproc_core.c
>> +++ b/drivers/remoteproc/remoteproc_core.c
>> @@ -1777,11 +1777,11 @@ static int rproc_attach_recovery(struct rproc 
>> *rproc)
>>   {
>>       int ret;
>> -    ret = __rproc_detach(rproc);
>> +    ret = rproc_detach(rproc);
>>       if (ret)
>>           return ret;
>> -    return __rproc_attach(rproc);
>> +    return rproc_attach(rproc);
>>   }
>>   static int rproc_boot_recovery(struct rproc *rproc)
>> @@ -1829,6 +1829,9 @@ int rproc_trigger_recovery(struct rproc *rproc)
>>       struct device *dev = &rproc->dev;
>>       int ret;
>> +    if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
>> +        return rproc_attach_recovery(rproc);
>> +
>>       ret = mutex_lock_interruptible(&rproc->lock);
> 
> Hi Tanmay,
> 
> I have a concern about this patch, specifically regarding the locking
> behavior and potential race conditions introduced by the early return in
> rproc_trigger_recovery(), by calling rproc_attach_recovery() directly
> and bypassing the original mutex_lock_interruptible() in
> rproc_trigger_recovery(), the recovery flow now executes rproc_attach()
> without holding rproc->lock.
> 
> This could potentially lead to race conditions if other threads are
> accessing or modifying shared resources concurrently.For example, one
> possible scenario is:
> 
> 
> state_store/rproc_trigger_auto_boot
> -->rproc_boot
>     -->ret = mutex_lock_interruptible(&rproc->lock);    <--(4)
>     -->if (rproc->state == RPROC_DETACHED) {
>         -->ret = rproc_attach(rproc);                   <--(5)
>        }
>     -->mutex_unlock(&rproc->lock);
> 
> 
> rproc_trigger_recovery
> -->rproc_attach_recovery
>     -->rproc_detach
>        -->ret = mutex_lock_interruptible(&rproc->lock); <--(1)
>        -->ret = __rproc_detach(rproc);
>        -->rproc->state = RPROC_DETACHED;                <--(2)
>        -->mutex_unlock(&rproc->lock);                   <--(3)
>    -->return rproc_attach(rproc);                       <--(6)
> 
> As shown in stack (5) and (6), two threads may simultaneously
> execute the rproc_attach() function, which could lead to a race
> condition.
> 
> Please feel free to correct me, thanks~

Hello,

Thanks for your reviews.

I did not face this problem during verification on board, but
your assessment is correct. I have been thinking to use rproc_boot 
instead of rproc_attach. It provides locking mechanism, and then also 
increases power refcount.

https://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux.git/tree/drivers/remoteproc/remoteproc_core.c?h=for-next#n1904

I hope that will address your concern.

Thanks,
Tanmay


> 
> 
>>       if (ret)
>>           return ret;
>> @@ -1839,10 +1842,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
>>       dev_err(dev, "recovering %s\n", rproc->name);
>> -    if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
>> -        ret = rproc_attach_recovery(rproc);
>> -    else
>> -        ret = rproc_boot_recovery(rproc);
>> +    ret = rproc_boot_recovery(rproc);
>>   unlock_mutex:
>>       mutex_unlock(&rproc->lock);
>> @@ -1860,6 +1860,7 @@ static void rproc_crash_handler_work(struct 
>> work_struct *work)
>>   {
>>       struct rproc *rproc = container_of(work, struct rproc, 
>> crash_handler);
>>       struct device *dev = &rproc->dev;
>> +    int ret;
>>       dev_dbg(dev, "enter %s\n", __func__);
>> @@ -1883,8 +1884,11 @@ static void rproc_crash_handler_work(struct 
>> work_struct *work)
>>       mutex_unlock(&rproc->lock);
>> -    if (!rproc->recovery_disabled)
>> -        rproc_trigger_recovery(rproc);
>> +    if (!rproc->recovery_disabled) {
>> +        ret = rproc_trigger_recovery(rproc);
>> +        if (ret)
>> +            dev_warn(dev, "rproc recovery failed, err %d\n", ret);
>> +    }
>>   out:
>>       pm_relax(rproc->dev.parent);
>> @@ -2057,7 +2061,7 @@ int rproc_detach(struct rproc *rproc)
>>           return ret;
>>       }
>> -    if (rproc->state != RPROC_ATTACHED) {
>> +    if (rproc->state != RPROC_ATTACHED && rproc->state != 
>> RPROC_CRASHED) {
>>           ret = -EINVAL;
>>           goto out;
>>       }
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] remoteproc: xlnx: remote crash recovery
  2025-10-30  4:21       ` Peng Fan
@ 2025-11-10 18:03         ` Mathieu Poirier
  2025-11-10 18:39           ` Tanmay Shah
  2025-11-11  7:12           ` Peng Fan
  0 siblings, 2 replies; 17+ messages in thread
From: Mathieu Poirier @ 2025-11-10 18:03 UTC (permalink / raw)
  To: Peng Fan; +Cc: Tanmay Shah, andersson, linux-remoteproc, linux-kernel

On Thu, Oct 30, 2025 at 12:21:24PM +0800, Peng Fan wrote:
> Hi Tanmay,
> 
> On Wed, Oct 29, 2025 at 06:51:51PM -0500, Tanmay Shah wrote:
> ...
> >> > 
> >> 
> >> Hi Peng,
> >> 
> >> I don't understand why it should fail. The patch simply implements
> >> rproc_detach() -> rproc_attach() sequence.
> >> 
> >
> >Hi Peng,
> >
> >Thanks for testing the patch. I appreciate your quick response. I think
> >rproc_boot() should be used instead of rproc_attach(). That should probably
> >solve the issue you are facing. I will send v2 with this change for you to
> >try.
> >
> >Thanks,
> >Tanmay
> >
> >> In your case, when you do detach -> attach via sysfs that sequence works?
> >> If that works, then crash recovery should work as well.
> 
> sysfs does not have attach option, only start/stop/detach are there.
> 
> >> 
> >> Could you give steps how do you generate the crash?
> 
> I have not look into the details on why it fails at my side for the 2nd time.
> 
> On my board, the M4 core use watchdog to reset itself and notify Linux, then
> linux side imx_rproc driver will do
> "rproc_report_crash(priv->rproc, RPROC_WATCHDOG);"
> 
> I will give a debug on the failures in a few days.
>

So what is happening here - Peng, do you plan on providing more debugging
information? Tanmay - are you planning on sending a second revision?
 
> Thanks,
> Peng
> 
> >> 
> >> Thanks,
> >> Tanmay
> >> 
> >> > Thanks,
> >> > Peng
> >> 
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] remoteproc: xlnx: remote crash recovery
  2025-11-10 18:03         ` Mathieu Poirier
@ 2025-11-10 18:39           ` Tanmay Shah
  2025-11-11  7:12           ` Peng Fan
  1 sibling, 0 replies; 17+ messages in thread
From: Tanmay Shah @ 2025-11-10 18:39 UTC (permalink / raw)
  To: Mathieu Poirier, Peng Fan; +Cc: andersson, linux-remoteproc, linux-kernel



On 11/10/25 12:03 PM, Mathieu Poirier wrote:
> On Thu, Oct 30, 2025 at 12:21:24PM +0800, Peng Fan wrote:
>> Hi Tanmay,
>>
>> On Wed, Oct 29, 2025 at 06:51:51PM -0500, Tanmay Shah wrote:
>> ...
>>>>>
>>>>
>>>> Hi Peng,
>>>>
>>>> I don't understand why it should fail. The patch simply implements
>>>> rproc_detach() -> rproc_attach() sequence.
>>>>
>>>
>>> Hi Peng,
>>>
>>> Thanks for testing the patch. I appreciate your quick response. I think
>>> rproc_boot() should be used instead of rproc_attach(). That should probably
>>> solve the issue you are facing. I will send v2 with this change for you to
>>> try.
>>>
>>> Thanks,
>>> Tanmay
>>>
>>>> In your case, when you do detach -> attach via sysfs that sequence works?
>>>> If that works, then crash recovery should work as well.
>>
>> sysfs does not have attach option, only start/stop/detach are there.
>>
>>>>
>>>> Could you give steps how do you generate the crash?
>>
>> I have not look into the details on why it fails at my side for the 2nd time.
>>
>> On my board, the M4 core use watchdog to reset itself and notify Linux, then
>> linux side imx_rproc driver will do
>> "rproc_report_crash(priv->rproc, RPROC_WATCHDOG);"
>>
>> I will give a debug on the failures in a few days.
>>
> 
> So what is happening here - Peng, do you plan on providing more debugging
> information? Tanmay - are you planning on sending a second revision?
>   

Mathieu,

I will be providing the v2, that will replace rproc_attach with 
rproc_boot. I am testing it, so far have not seen any issues.

I hope that will resolve Peng's problem. V2 will be posted this week 
sometime.


Thanks,
Tanmay

>> Thanks,
>> Peng
>>
>>>>
>>>> Thanks,
>>>> Tanmay
>>>>
>>>>> Thanks,
>>>>> Peng
>>>>
>>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 0/3] remoteproc: xlnx: remote crash recovery
  2025-11-10 18:03         ` Mathieu Poirier
  2025-11-10 18:39           ` Tanmay Shah
@ 2025-11-11  7:12           ` Peng Fan
  2025-11-11 16:47             ` Tanmay Shah
  1 sibling, 1 reply; 17+ messages in thread
From: Peng Fan @ 2025-11-11  7:12 UTC (permalink / raw)
  To: Mathieu Poirier, Peng Fan (OSS)
  Cc: Tanmay Shah, andersson@kernel.org,
	linux-remoteproc@vger.kernel.org, linux-kernel@vger.kernel.org

Hi Mathieu, Tanmay

> Subject: Re: [PATCH 0/3] remoteproc: xlnx: remote crash recovery
....
> >
> 
> So what is happening here - Peng, do you plan on providing more
> debugging information? Tanmay - are you planning on sending a
> second revision?
> 

Sorry for delay.  I gave a hack with below changes(at end) and it works.
The below change is just hack code to let me verify the rproc->power value.
The issue with current patchset is that after rproc_attach(),
the rproc->power will be 0. So recovery only works for the 1st time,
when it is triggered again, rproc_detach() will abort early in the if check: 
if (!atomic_dec_and_test(&rproc->power)) {
ret = 0;
return ret;
}

---------------
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index a92c6cd6df67b..2b69304084d11 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1786,7 +1786,9 @@ static int rproc_attach_recovery(struct rproc *rproc)
        if (ret)
                return ret;
 
-       return rproc_attach(rproc);
+       ret = rproc_attach(rproc);
+       atomic_set(&rproc->power, 1);
+       return ret;
 }

Thanks,
Peng.

> > Thanks,
> > Peng
> >
> > >>
> > >> Thanks,
> > >> Tanmay
> > >>
> > >> > Thanks,
> > >> > Peng
> > >>
> > >


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] remoteproc: xlnx: remote crash recovery
  2025-11-11  7:12           ` Peng Fan
@ 2025-11-11 16:47             ` Tanmay Shah
  0 siblings, 0 replies; 17+ messages in thread
From: Tanmay Shah @ 2025-11-11 16:47 UTC (permalink / raw)
  To: Peng Fan, Mathieu Poirier, Peng Fan (OSS)
  Cc: andersson@kernel.org, linux-remoteproc@vger.kernel.org,
	linux-kernel@vger.kernel.org



On 11/11/25 1:12 AM, Peng Fan wrote:
> Hi Mathieu, Tanmay
> 
>> Subject: Re: [PATCH 0/3] remoteproc: xlnx: remote crash recovery
> ....
>>>
>>
>> So what is happening here - Peng, do you plan on providing more
>> debugging information? Tanmay - are you planning on sending a
>> second revision?
>>
> 
> Sorry for delay.  I gave a hack with below changes(at end) and it works.
> The below change is just hack code to let me verify the rproc->power value.
> The issue with current patchset is that after rproc_attach(),
> the rproc->power will be 0. So recovery only works for the 1st time,
> when it is triggered again, rproc_detach() will abort early in the if check:
> if (!atomic_dec_and_test(&rproc->power)) {
> ret = 0;
> return ret;
> }
> 
> ---------------
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index a92c6cd6df67b..2b69304084d11 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1786,7 +1786,9 @@ static int rproc_attach_recovery(struct rproc *rproc)
>          if (ret)
>                  return ret;
>   
> -       return rproc_attach(rproc);
> +       ret = rproc_attach(rproc);
> +       atomic_set(&rproc->power, 1);
> +       return ret;
>   }
> 
> Thanks,
> Peng.
> 

Hi Peng,

This issue should be fixed after using rproc_boot instead of 
rproc_attach. rproc_boot makes sure that rproc_attach is atomic 
operation and it also increases power count as you mentioned above.

Thanks,
Tanmay

>>> Thanks,
>>> Peng
>>>
>>>>>
>>>>> Thanks,
>>>>> Tanmay
>>>>>
>>>>>> Thanks,
>>>>>> Peng
>>>>>
>>>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 0/3] remoteproc: xlnx: remote crash recovery
@ 2025-11-13 15:44 Tanmay Shah
  0 siblings, 0 replies; 17+ messages in thread
From: Tanmay Shah @ 2025-11-13 15:44 UTC (permalink / raw)
  To: andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel, Tanmay Shah

Remote processor can crash or hang during normal execution. Linux
remoteproc framework supports different mechanisms to recover the
remote processor and re-establish the RPMsg communication in such case.

Crash reporting:

1) Using debugfs node

User can report the crash to the core framework via debugfs node using
following command:

echo 1 > /sys/kernel/debug/remoteproc/remoteproc0/crash

2) Remoteproc notify to the host about crash state and crash reason
via the resource table

This is a platform specific method where the remote firmware contains
vendor specific resource to update the crash state and the crash
reason. Then the remote notifies the crash to the host via mailbox
notification. The host then will check this resource on every mbox
notification and reports the crash to the core framework if needed.

Crash recovery mechanism:

There are two mechanisms available to recover the remote processor from
the crash. 1) boot recovery, 2) attach on recovery

Remoteproc core framework will choose proper mechanism based on the
rproc features set by the platform driver.

1) Boot recovery

This is the default mechanism to recover the remote processor.
In this method core framework will first stop the remote processor,
load the firmware again and then starts the remote processor. On
AMD-Xilinx platforms this method is supported. The coredump callback in
the platform driver isn't implemented so far, but that shouldn't cause
the recovery failure.

2) Attach on recovery

If RPROC_ATTACH_ON_RECOVERY feature is enabled by the platform driver,
then the core framework will choose this method for recovery.

On zynqmp platform following is the sequence of events expected during
remoteproc crash and attach on recovery:

a) rproc attach/detach flow is working, and RPMsg comm is established
b) Remote processor (RPU) crashed (crash not reported yet)
c) Platform management controller stops and reloads elf on inactive
   remote processor before reboot
d) platform management controller reboots the remote processor
e) Remote processor boots again, and detects previous crash (platform
   specific mechanism to detect the crash)
f) Remote processor Reports crash to the Linux (Host) and wait for
   the recovery.
g) Linux performs full detach and reattach to remote processor.
h) Normal RPMsg communication is established.

It is required to destroy all RPMsg related resource and re-create them
during recovery to establish successful RPMsg communication. To achieve
this complete rproc_detach followed by rproc_boot calls are needed. That
is what this patch-series is fixing along with adding rproc recovery
methods for xlnx platform.

Change log:

Changes in v2:
  - use rproc_boot instead of rproc_attach
  - move debug message early in the function
  - clear attach recovery boot flag during detach and stop ops

Tanmay Shah (3):
  remoteproc: xlnx: enable boot recovery
  remoteproc: core: full attach detach during recovery
  remoteproc: xlnx: add crash detection mechanism

 drivers/remoteproc/remoteproc_core.c    | 26 +++++-----
 drivers/remoteproc/xlnx_r5_remoteproc.c | 64 ++++++++++++++++++++++++-
 2 files changed, 78 insertions(+), 12 deletions(-)

base-commit: f982fbb1a6ca3553c15763ad9eb2beeae78d3684
-- 
2.34.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-11-13 15:44 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-28  4:57 [PATCH 0/3] remoteproc: xlnx: remote crash recovery Tanmay Shah
2025-10-28  4:57 ` [PATCH 1/3] remoteproc: xlnx: enable boot recovery Tanmay Shah
2025-10-28  4:57 ` [PATCH 2/3] remoteproc: core: full attach detach during recovery Tanmay Shah
2025-10-29 22:49   ` Iuliana Prodan
2025-10-29 23:41     ` Tanmay Shah
2025-11-02  8:54   ` Zhongqiu Han
2025-11-03 17:22     ` Tanmay Shah
2025-10-28  4:57 ` [PATCH 3/3] remoteproc: xlnx: add crash detection mechanism Tanmay Shah
2025-10-29  3:24 ` [PATCH 0/3] remoteproc: xlnx: remote crash recovery Peng Fan
2025-10-29  4:15   ` Tanmay Shah
2025-10-29 23:51     ` Tanmay Shah
2025-10-30  4:21       ` Peng Fan
2025-11-10 18:03         ` Mathieu Poirier
2025-11-10 18:39           ` Tanmay Shah
2025-11-11  7:12           ` Peng Fan
2025-11-11 16:47             ` Tanmay Shah
  -- strict thread matches above, loose matches on Subject: below --
2025-11-13 15:44 Tanmay Shah

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.