[PATCH 0/3] remoteproc: xlnx: remote crash recovery

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] remoteproc: xlnx: remote crash recovery
@ 2025-11-13 15:44 Tanmay Shah
  2025-11-13 15:44 ` [PATCH v2 1/3] remoteproc: xlnx: enable boot recovery Tanmay Shah
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Tanmay Shah @ 2025-11-13 15:44 UTC (permalink / raw)
  To: andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel, Tanmay Shah

Remote processor can crash or hang during normal execution. Linux
remoteproc framework supports different mechanisms to recover the
remote processor and re-establish the RPMsg communication in such case.

Crash reporting:

1) Using debugfs node

User can report the crash to the core framework via debugfs node using
following command:

echo 1 > /sys/kernel/debug/remoteproc/remoteproc0/crash

2) Remoteproc notify to the host about crash state and crash reason
via the resource table

This is a platform specific method where the remote firmware contains
vendor specific resource to update the crash state and the crash
reason. Then the remote notifies the crash to the host via mailbox
notification. The host then will check this resource on every mbox
notification and reports the crash to the core framework if needed.

Crash recovery mechanism:

There are two mechanisms available to recover the remote processor from
the crash. 1) boot recovery, 2) attach on recovery

Remoteproc core framework will choose proper mechanism based on the
rproc features set by the platform driver.

1) Boot recovery

This is the default mechanism to recover the remote processor.
In this method core framework will first stop the remote processor,
load the firmware again and then starts the remote processor. On
AMD-Xilinx platforms this method is supported. The coredump callback in
the platform driver isn't implemented so far, but that shouldn't cause
the recovery failure.

2) Attach on recovery

If RPROC_ATTACH_ON_RECOVERY feature is enabled by the platform driver,
then the core framework will choose this method for recovery.

On zynqmp platform following is the sequence of events expected during
remoteproc crash and attach on recovery:

a) rproc attach/detach flow is working, and RPMsg comm is established
b) Remote processor (RPU) crashed (crash not reported yet)
c) Platform management controller stops and reloads elf on inactive
   remote processor before reboot
d) platform management controller reboots the remote processor
e) Remote processor boots again, and detects previous crash (platform
   specific mechanism to detect the crash)
f) Remote processor Reports crash to the Linux (Host) and wait for
   the recovery.
g) Linux performs full detach and reattach to remote processor.
h) Normal RPMsg communication is established.

It is required to destroy all RPMsg related resource and re-create them
during recovery to establish successful RPMsg communication. To achieve
this complete rproc_detach followed by rproc_boot calls are needed. That
is what this patch-series is fixing along with adding rproc recovery
methods for xlnx platform.

Change log:

Changes in v2:
  - use rproc_boot instead of rproc_attach
  - move debug message early in the function
  - clear attach recovery boot flag during detach and stop ops

Tanmay Shah (3):
  remoteproc: xlnx: enable boot recovery
  remoteproc: core: full attach detach during recovery
  remoteproc: xlnx: add crash detection mechanism

 drivers/remoteproc/remoteproc_core.c    | 26 +++++-----
 drivers/remoteproc/xlnx_r5_remoteproc.c | 64 ++++++++++++++++++++++++-
 2 files changed, 78 insertions(+), 12 deletions(-)

base-commit: f982fbb1a6ca3553c15763ad9eb2beeae78d3684
-- 
2.34.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/3] remoteproc: xlnx: enable boot recovery
  2025-11-13 15:44 [PATCH 0/3] remoteproc: xlnx: remote crash recovery Tanmay Shah
@ 2025-11-13 15:44 ` Tanmay Shah
  2025-11-20 17:50   ` Mathieu Poirier
  2025-11-13 15:44 ` [PATCH v2 2/3] remoteproc: core: full attach detach during recovery Tanmay Shah
  2025-11-13 15:44 ` [PATCH v2 3/3] remoteproc: xlnx: add crash detection mechanism Tanmay Shah
  2 siblings, 1 reply; 11+ messages in thread
From: Tanmay Shah @ 2025-11-13 15:44 UTC (permalink / raw)
  To: andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel, Tanmay Shah

This is the default method to recover the remote processor from crash.
During this recovery the Linux will stop the remote, load the same
firmware again and start the remote processor. As of now, coredump
callback does not contain any useful implementation, but this can be
changed as required.

Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
---
 drivers/remoteproc/xlnx_r5_remoteproc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c
index 0b7b173d0d26..8677b732ad14 100644
--- a/drivers/remoteproc/xlnx_r5_remoteproc.c
+++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
@@ -891,6 +891,11 @@ static int zynqmp_r5_detach(struct rproc *rproc)
 	return 0;
 }
 
+static void zynqmp_r5_coredump(struct rproc *rproc)
+{
+	(void)rproc;
+}
+
 static const struct rproc_ops zynqmp_r5_rproc_ops = {
 	.prepare	= zynqmp_r5_rproc_prepare,
 	.unprepare	= zynqmp_r5_rproc_unprepare,
@@ -905,6 +910,7 @@ static const struct rproc_ops zynqmp_r5_rproc_ops = {
 	.get_loaded_rsc_table = zynqmp_r5_get_loaded_rsc_table,
 	.attach		= zynqmp_r5_attach,
 	.detach		= zynqmp_r5_detach,
+	.coredump	= zynqmp_r5_coredump,
 };
 
 /**
@@ -938,7 +944,7 @@ static struct zynqmp_r5_core *zynqmp_r5_add_rproc_core(struct device *cdev)
 
 	rproc_coredump_set_elf_info(r5_rproc, ELFCLASS32, EM_ARM);
 
-	r5_rproc->recovery_disabled = true;
+	r5_rproc->recovery_disabled = false;
 	r5_rproc->has_iommu = false;
 	r5_rproc->auto_boot = false;
 	r5_core = r5_rproc->priv;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/3] remoteproc: core: full attach detach during recovery
  2025-11-13 15:44 [PATCH 0/3] remoteproc: xlnx: remote crash recovery Tanmay Shah
  2025-11-13 15:44 ` [PATCH v2 1/3] remoteproc: xlnx: enable boot recovery Tanmay Shah
@ 2025-11-13 15:44 ` Tanmay Shah
  2025-11-20 17:58   ` Mathieu Poirier
  2025-11-13 15:44 ` [PATCH v2 3/3] remoteproc: xlnx: add crash detection mechanism Tanmay Shah
  2 siblings, 1 reply; 11+ messages in thread
From: Tanmay Shah @ 2025-11-13 15:44 UTC (permalink / raw)
  To: andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel, Tanmay Shah

Current attach on recovery mechanism loads the clean resource table
during recovery, but doesn't re-allocate the resources. RPMsg
communication will fail after recovery due to this. Fix this
incorrect behavior by doing the full detach and attach of remote
processor during the recovery. This will load the clean resource table
and re-allocate all the resources, which will set up correct vring
information in the resource table.

Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
---

Changes in v2:
  - use rproc_boot instead of rproc_attach
  - move debug message early in the function

 drivers/remoteproc/remoteproc_core.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index aada2780b343..f65e8bc2d1e1 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1777,11 +1777,11 @@ static int rproc_attach_recovery(struct rproc *rproc)
 {
 	int ret;
 
-	ret = __rproc_detach(rproc);
+	ret = rproc_detach(rproc);
 	if (ret)
 		return ret;
 
-	return __rproc_attach(rproc);
+	return rproc_boot(rproc);
 }
 
 static int rproc_boot_recovery(struct rproc *rproc)
@@ -1829,6 +1829,11 @@ int rproc_trigger_recovery(struct rproc *rproc)
 	struct device *dev = &rproc->dev;
 	int ret;
 
+	dev_err(dev, "recovering %s\n", rproc->name);
+
+	if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
+		return rproc_attach_recovery(rproc);
+
 	ret = mutex_lock_interruptible(&rproc->lock);
 	if (ret)
 		return ret;
@@ -1837,12 +1842,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
 	if (rproc->state != RPROC_CRASHED)
 		goto unlock_mutex;
 
-	dev_err(dev, "recovering %s\n", rproc->name);
-
-	if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
-		ret = rproc_attach_recovery(rproc);
-	else
-		ret = rproc_boot_recovery(rproc);
+	ret = rproc_boot_recovery(rproc);
 
 unlock_mutex:
 	mutex_unlock(&rproc->lock);
@@ -1860,6 +1860,7 @@ static void rproc_crash_handler_work(struct work_struct *work)
 {
 	struct rproc *rproc = container_of(work, struct rproc, crash_handler);
 	struct device *dev = &rproc->dev;
+	int ret;
 
 	dev_dbg(dev, "enter %s\n", __func__);
 
@@ -1883,8 +1884,11 @@ static void rproc_crash_handler_work(struct work_struct *work)
 
 	mutex_unlock(&rproc->lock);
 
-	if (!rproc->recovery_disabled)
-		rproc_trigger_recovery(rproc);
+	if (!rproc->recovery_disabled) {
+		ret = rproc_trigger_recovery(rproc);
+		if (ret)
+			dev_warn(dev, "rproc recovery failed, err %d\n", ret);
+	}
 
 out:
 	pm_relax(rproc->dev.parent);
@@ -2057,7 +2061,7 @@ int rproc_detach(struct rproc *rproc)
 		return ret;
 	}
 
-	if (rproc->state != RPROC_ATTACHED) {
+	if (rproc->state != RPROC_ATTACHED && rproc->state != RPROC_CRASHED) {
 		ret = -EINVAL;
 		goto out;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/3] remoteproc: xlnx: add crash detection mechanism
  2025-11-13 15:44 [PATCH 0/3] remoteproc: xlnx: remote crash recovery Tanmay Shah
  2025-11-13 15:44 ` [PATCH v2 1/3] remoteproc: xlnx: enable boot recovery Tanmay Shah
  2025-11-13 15:44 ` [PATCH v2 2/3] remoteproc: core: full attach detach during recovery Tanmay Shah
@ 2025-11-13 15:44 ` Tanmay Shah
  2025-11-21 15:37   ` Mathieu Poirier
  2 siblings, 1 reply; 11+ messages in thread
From: Tanmay Shah @ 2025-11-13 15:44 UTC (permalink / raw)
  To: andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel, Tanmay Shah

Remote processor will report the crash reason via the resource table
and notify the host via kick. The host checks this crash reason on
every kick notification from the remote and report to the core
framework. Then the rproc core framework will start the recovery
process.

Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
---

Changes in v2:
  - clear attach recovery boot flag during detach and stop ops

 drivers/remoteproc/xlnx_r5_remoteproc.c | 56 +++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c
index 8677b732ad14..5d04e8c0dc52 100644
--- a/drivers/remoteproc/xlnx_r5_remoteproc.c
+++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
@@ -108,6 +108,10 @@ struct rsc_tbl_data {
 	const uintptr_t rsc_tbl;
 } __packed;
 
+enum fw_vendor_rsc {
+	FW_RSC_VENDOR_CRASH_REASON = RSC_VENDOR_START,
+};
+
 /*
  * Hardcoded TCM bank values. This will stay in driver to maintain backward
  * compatibility with device-tree that does not have TCM information.
@@ -127,9 +131,21 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
 	{0xffe30000UL, 0x30000, 0x10000UL, PD_R5_1_BTCM, "btcm1"},
 };
 
+/**
+ * struct xlnx_rproc_crash_report - resource to know crash status and reason
+ *
+ * @crash_state: if true, the rproc is notifying crash, time to recover
+ * @crash_reason: reason of crash
+ */
+struct xlnx_rproc_crash_report {
+	u32 crash_state;
+	u32 crash_reason;
+} __packed;
+
 /**
  * struct zynqmp_r5_core - remoteproc core's internal data
  *
+ * @crash_report: rproc crash state and reason
  * @rsc_tbl_va: resource table virtual address
  * @sram: Array of sram memories assigned to this core
  * @num_sram: number of sram for this core
@@ -143,6 +159,7 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
  * @ipi: pointer to mailbox information
  */
 struct zynqmp_r5_core {
+	struct xlnx_rproc_crash_report *crash_report;
 	void __iomem *rsc_tbl_va;
 	struct zynqmp_sram_bank *sram;
 	int num_sram;
@@ -227,10 +244,14 @@ static void handle_event_notified(struct work_struct *work)
 static void zynqmp_r5_mb_rx_cb(struct mbox_client *cl, void *msg)
 {
 	struct zynqmp_ipi_message *ipi_msg, *buf_msg;
+	struct zynqmp_r5_core *r5_core;
+	struct rproc *rproc;
 	struct mbox_info *ipi;
 	size_t len;
 
 	ipi = container_of(cl, struct mbox_info, mbox_cl);
+	r5_core = ipi->r5_core;
+	rproc = r5_core->rproc;
 
 	/* copy data from ipi buffer to r5_core */
 	ipi_msg = (struct zynqmp_ipi_message *)msg;
@@ -244,6 +265,13 @@ static void zynqmp_r5_mb_rx_cb(struct mbox_client *cl, void *msg)
 	buf_msg->len = len;
 	memcpy(buf_msg->data, ipi_msg->data, len);
 
+	/* Check for crash only if rproc crash is expected */
+	if (rproc->state == RPROC_ATTACHED || rproc->state == RPROC_RUNNING) {
+		if (r5_core->crash_report->crash_state)
+			rproc_report_crash(rproc,
+					   r5_core->crash_report->crash_reason);
+	}
+
 	/* received and processed interrupt ack */
 	if (mbox_send_message(ipi->rx_chan, NULL) < 0)
 		dev_err(cl->dev, "ack failed to mbox rx_chan\n");
@@ -397,6 +425,7 @@ static int zynqmp_r5_rproc_start(struct rproc *rproc)
 	if (ret)
 		dev_err(r5_core->dev,
 			"failed to start RPU = 0x%x\n", r5_core->pm_domain_id);
+
 	return ret;
 }
 
@@ -438,6 +467,8 @@ static int zynqmp_r5_rproc_stop(struct rproc *rproc)
 	if (ret)
 		dev_err(r5_core->dev, "core force power down failed\n");
 
+	test_and_clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
+
 	return ret;
 }
 
@@ -874,6 +905,8 @@ static int zynqmp_r5_get_rsc_table_va(struct zynqmp_r5_core *r5_core)
 
 static int zynqmp_r5_attach(struct rproc *rproc)
 {
+	rproc_set_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY);
+
 	dev_dbg(&rproc->dev, "rproc %d attached\n", rproc->index);
 
 	return 0;
@@ -888,6 +921,8 @@ static int zynqmp_r5_detach(struct rproc *rproc)
 	 */
 	zynqmp_r5_rproc_kick(rproc, 0);
 
+	clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
+
 	return 0;
 }
 
@@ -896,6 +931,26 @@ static void zynqmp_r5_coredump(struct rproc *rproc)
 	(void)rproc;
 }
 
+static int zynqmp_r5_handle_crash_rsc(struct rproc *rproc, void *rsc,
+				      int offset, int avail)
+{
+	struct zynqmp_r5_core *r5_core = rproc->priv;
+
+	r5_core->crash_report =
+		(struct xlnx_rproc_crash_report *)(r5_core->rsc_tbl_va + offset);
+
+	return RSC_HANDLED;
+}
+
+static int zynqmp_r5_handle_rsc(struct rproc *rproc, u32 rsc_type, void *rsc,
+				int offset, int avail)
+{
+	if (rsc_type == FW_RSC_VENDOR_CRASH_REASON)
+		return zynqmp_r5_handle_crash_rsc(rproc, rsc, offset, avail);
+
+	return RSC_IGNORED;
+}
+
 static const struct rproc_ops zynqmp_r5_rproc_ops = {
 	.prepare	= zynqmp_r5_rproc_prepare,
 	.unprepare	= zynqmp_r5_rproc_unprepare,
@@ -911,6 +966,7 @@ static const struct rproc_ops zynqmp_r5_rproc_ops = {
 	.attach		= zynqmp_r5_attach,
 	.detach		= zynqmp_r5_detach,
 	.coredump	= zynqmp_r5_coredump,
+	.handle_rsc	= zynqmp_r5_handle_rsc,
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] remoteproc: xlnx: enable boot recovery
  2025-11-13 15:44 ` [PATCH v2 1/3] remoteproc: xlnx: enable boot recovery Tanmay Shah
@ 2025-11-20 17:50   ` Mathieu Poirier
  2025-12-01 21:04     ` Tanmay Shah
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Poirier @ 2025-11-20 17:50 UTC (permalink / raw)
  To: Tanmay Shah; +Cc: andersson, linux-remoteproc, linux-kernel

Good morning,

On Thu, Nov 13, 2025 at 07:44:02AM -0800, Tanmay Shah wrote:
> This is the default method to recover the remote processor from crash.
> During this recovery the Linux will stop the remote, load the same
> firmware again and start the remote processor. As of now, coredump
> callback does not contain any useful implementation, but this can be
> changed as required.
> 
> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
> ---
>  drivers/remoteproc/xlnx_r5_remoteproc.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c
> index 0b7b173d0d26..8677b732ad14 100644
> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> @@ -891,6 +891,11 @@ static int zynqmp_r5_detach(struct rproc *rproc)
>  	return 0;
>  }
>  
> +static void zynqmp_r5_coredump(struct rproc *rproc)
> +{
> +	(void)rproc;
> +}
> +

Function rproc_coredump(), which is set by default in rproc_alloc_ops(), won't
work?  If not please indicate why this is the case, otherwise this patch can be
dropped.

>  static const struct rproc_ops zynqmp_r5_rproc_ops = {
>  	.prepare	= zynqmp_r5_rproc_prepare,
>  	.unprepare	= zynqmp_r5_rproc_unprepare,
> @@ -905,6 +910,7 @@ static const struct rproc_ops zynqmp_r5_rproc_ops = {
>  	.get_loaded_rsc_table = zynqmp_r5_get_loaded_rsc_table,
>  	.attach		= zynqmp_r5_attach,
>  	.detach		= zynqmp_r5_detach,
> +	.coredump	= zynqmp_r5_coredump,
>  };
>  
>  /**
> @@ -938,7 +944,7 @@ static struct zynqmp_r5_core *zynqmp_r5_add_rproc_core(struct device *cdev)
>  
>  	rproc_coredump_set_elf_info(r5_rproc, ELFCLASS32, EM_ARM);
>  
> -	r5_rproc->recovery_disabled = true;
> +	r5_rproc->recovery_disabled = false;
>  	r5_rproc->has_iommu = false;
>  	r5_rproc->auto_boot = false;
>  	r5_core = r5_rproc->priv;
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] remoteproc: core: full attach detach during recovery
  2025-11-13 15:44 ` [PATCH v2 2/3] remoteproc: core: full attach detach during recovery Tanmay Shah
@ 2025-11-20 17:58   ` Mathieu Poirier
  2025-12-01 23:05     ` Tanmay Shah
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Poirier @ 2025-11-20 17:58 UTC (permalink / raw)
  To: Tanmay Shah; +Cc: andersson, linux-remoteproc, linux-kernel

On Thu, Nov 13, 2025 at 07:44:03AM -0800, Tanmay Shah wrote:
> Current attach on recovery mechanism loads the clean resource table
> during recovery, but doesn't re-allocate the resources. RPMsg
> communication will fail after recovery due to this. Fix this
> incorrect behavior by doing the full detach and attach of remote
> processor during the recovery. This will load the clean resource table
> and re-allocate all the resources, which will set up correct vring
> information in the resource table.
> 
> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
> ---
> 
> Changes in v2:
>   - use rproc_boot instead of rproc_attach
>   - move debug message early in the function
> 
>  drivers/remoteproc/remoteproc_core.c | 26 +++++++++++++++-----------
>  1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index aada2780b343..f65e8bc2d1e1 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1777,11 +1777,11 @@ static int rproc_attach_recovery(struct rproc *rproc)
>  {
>  	int ret;
>  
> -	ret = __rproc_detach(rproc);
> +	ret = rproc_detach(rproc);
>  	if (ret)
>  		return ret;
>  
> -	return __rproc_attach(rproc);
> +	return rproc_boot(rproc);
>  }
>  
>  static int rproc_boot_recovery(struct rproc *rproc)
> @@ -1829,6 +1829,11 @@ int rproc_trigger_recovery(struct rproc *rproc)
>  	struct device *dev = &rproc->dev;
>  	int ret;
>  
> +	dev_err(dev, "recovering %s\n", rproc->name);
> +
> +	if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
> +		return rproc_attach_recovery(rproc);
> +

Humm... I find this a little messy.  Taking [1] as an example, I suggest moving
the "unlock_mutex" block to line 1846 and add mutex calls to
rproc_boot_recovery().  That way both rproc_attach_recovery() and
rproc_boot_recovery() are called the same way.

[1] https://elixir.bootlin.com/linux/v6.17.8/source/drivers/remoteproc/remoteproc_core.c#L1832

>  	ret = mutex_lock_interruptible(&rproc->lock);
>  	if (ret)
>  		return ret;
> @@ -1837,12 +1842,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
>  	if (rproc->state != RPROC_CRASHED)
>  		goto unlock_mutex;
>  
> -	dev_err(dev, "recovering %s\n", rproc->name);
> -
> -	if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
> -		ret = rproc_attach_recovery(rproc);
> -	else
> -		ret = rproc_boot_recovery(rproc);
> +	ret = rproc_boot_recovery(rproc);
>  
>  unlock_mutex:
>  	mutex_unlock(&rproc->lock);
> @@ -1860,6 +1860,7 @@ static void rproc_crash_handler_work(struct work_struct *work)
>  {
>  	struct rproc *rproc = container_of(work, struct rproc, crash_handler);
>  	struct device *dev = &rproc->dev;
> +	int ret;
>  
>  	dev_dbg(dev, "enter %s\n", __func__);
>  
> @@ -1883,8 +1884,11 @@ static void rproc_crash_handler_work(struct work_struct *work)
>  
>  	mutex_unlock(&rproc->lock);
>  
> -	if (!rproc->recovery_disabled)
> -		rproc_trigger_recovery(rproc);
> +	if (!rproc->recovery_disabled) {
> +		ret = rproc_trigger_recovery(rproc);
> +		if (ret)
> +			dev_warn(dev, "rproc recovery failed, err %d\n", ret);

I would prefer a patch on its own for this one.

I'm running out of time for today, I'll review patch 3/3 tomorrow.

Thanks,
Mathieu

> +	}
>  
>  out:
>  	pm_relax(rproc->dev.parent);
> @@ -2057,7 +2061,7 @@ int rproc_detach(struct rproc *rproc)
>  		return ret;
>  	}
>  
> -	if (rproc->state != RPROC_ATTACHED) {
> +	if (rproc->state != RPROC_ATTACHED && rproc->state != RPROC_CRASHED) {
>  		ret = -EINVAL;
>  		goto out;
>  	}
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] remoteproc: xlnx: add crash detection mechanism
  2025-11-13 15:44 ` [PATCH v2 3/3] remoteproc: xlnx: add crash detection mechanism Tanmay Shah
@ 2025-11-21 15:37   ` Mathieu Poirier
  2025-12-02  5:04     ` Tanmay Shah
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Poirier @ 2025-11-21 15:37 UTC (permalink / raw)
  To: Tanmay Shah; +Cc: andersson, linux-remoteproc, linux-kernel

On Thu, Nov 13, 2025 at 07:44:04AM -0800, Tanmay Shah wrote:
> Remote processor will report the crash reason via the resource table
> and notify the host via kick. The host checks this crash reason on
> every kick notification from the remote and report to the core
> framework. Then the rproc core framework will start the recovery
> process.

Please substitute the word "kick" for "mailbox notification".  I also have to
assume "core framework" and "rproc core framework" are the same.  Pick one and
stick with it.

> 
> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
> ---
> 
> Changes in v2:
>   - clear attach recovery boot flag during detach and stop ops
> 
>  drivers/remoteproc/xlnx_r5_remoteproc.c | 56 +++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
> 
> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c
> index 8677b732ad14..5d04e8c0dc52 100644
> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> @@ -108,6 +108,10 @@ struct rsc_tbl_data {
>  	const uintptr_t rsc_tbl;
>  } __packed;
>  
> +enum fw_vendor_rsc {
> +	FW_RSC_VENDOR_CRASH_REASON = RSC_VENDOR_START,
> +};
> +
>  /*
>   * Hardcoded TCM bank values. This will stay in driver to maintain backward
>   * compatibility with device-tree that does not have TCM information.
> @@ -127,9 +131,21 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
>  	{0xffe30000UL, 0x30000, 0x10000UL, PD_R5_1_BTCM, "btcm1"},
>  };
>  
> +/**
> + * struct xlnx_rproc_crash_report - resource to know crash status and reason
> + *
> + * @crash_state: if true, the rproc is notifying crash, time to recover
> + * @crash_reason: reason of crash
> + */
> +struct xlnx_rproc_crash_report {
> +	u32 crash_state;
> +	u32 crash_reason;
> +} __packed;
> +
>  /**
>   * struct zynqmp_r5_core - remoteproc core's internal data
>   *
> + * @crash_report: rproc crash state and reason
>   * @rsc_tbl_va: resource table virtual address
>   * @sram: Array of sram memories assigned to this core
>   * @num_sram: number of sram for this core
> @@ -143,6 +159,7 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
>   * @ipi: pointer to mailbox information
>   */
>  struct zynqmp_r5_core {
> +	struct xlnx_rproc_crash_report *crash_report;
>  	void __iomem *rsc_tbl_va;
>  	struct zynqmp_sram_bank *sram;
>  	int num_sram;
> @@ -227,10 +244,14 @@ static void handle_event_notified(struct work_struct *work)
>  static void zynqmp_r5_mb_rx_cb(struct mbox_client *cl, void *msg)
>  {
>  	struct zynqmp_ipi_message *ipi_msg, *buf_msg;
> +	struct zynqmp_r5_core *r5_core;
> +	struct rproc *rproc;
>  	struct mbox_info *ipi;
>  	size_t len;
>  
>  	ipi = container_of(cl, struct mbox_info, mbox_cl);
> +	r5_core = ipi->r5_core;
> +	rproc = r5_core->rproc;
>  
>  	/* copy data from ipi buffer to r5_core */
>  	ipi_msg = (struct zynqmp_ipi_message *)msg;
> @@ -244,6 +265,13 @@ static void zynqmp_r5_mb_rx_cb(struct mbox_client *cl, void *msg)
>  	buf_msg->len = len;
>  	memcpy(buf_msg->data, ipi_msg->data, len);
>  
> +	/* Check for crash only if rproc crash is expected */
> +	if (rproc->state == RPROC_ATTACHED || rproc->state == RPROC_RUNNING) {
> +		if (r5_core->crash_report->crash_state)
> +			rproc_report_crash(rproc,
> +					   r5_core->crash_report->crash_reason);

At this stage ->crash_state indicates that a crash occured, but how is it reset
once the crash has been handle?  How do we make sure the next mailbox
notification won't trigger another crash report?

> +	}
> +
>  	/* received and processed interrupt ack */
>  	if (mbox_send_message(ipi->rx_chan, NULL) < 0)
>  		dev_err(cl->dev, "ack failed to mbox rx_chan\n");
> @@ -397,6 +425,7 @@ static int zynqmp_r5_rproc_start(struct rproc *rproc)
>  	if (ret)
>  		dev_err(r5_core->dev,
>  			"failed to start RPU = 0x%x\n", r5_core->pm_domain_id);
> +

Spurious change

>  	return ret;
>  }
>  
> @@ -438,6 +467,8 @@ static int zynqmp_r5_rproc_stop(struct rproc *rproc)
>  	if (ret)
>  		dev_err(r5_core->dev, "core force power down failed\n");
>  
> +	test_and_clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
> +
>  	return ret;
>  }
>  
> @@ -874,6 +905,8 @@ static int zynqmp_r5_get_rsc_table_va(struct zynqmp_r5_core *r5_core)
>  
>  static int zynqmp_r5_attach(struct rproc *rproc)
>  {
> +	rproc_set_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY);
> +

Why can't this be set in probe() and left alone from thereon?

>  	dev_dbg(&rproc->dev, "rproc %d attached\n", rproc->index);
>  
>  	return 0;
> @@ -888,6 +921,8 @@ static int zynqmp_r5_detach(struct rproc *rproc)
>  	 */
>  	zynqmp_r5_rproc_kick(rproc, 0);
>  
> +	clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
> +

I'm not sure why this needs to be done, same comment for zynqmp_r5_rproc_stop().

>  	return 0;
>  }
>  
> @@ -896,6 +931,26 @@ static void zynqmp_r5_coredump(struct rproc *rproc)
>  	(void)rproc;
>  }
>  
> +static int zynqmp_r5_handle_crash_rsc(struct rproc *rproc, void *rsc,
> +				      int offset, int avail)
> +{
> +	struct zynqmp_r5_core *r5_core = rproc->priv;
> +
> +	r5_core->crash_report =
> +		(struct xlnx_rproc_crash_report *)(r5_core->rsc_tbl_va + offset);
> +

This function is so simple that I would fold it in zynqmp_r5_handle_rsc() below.

Thanks,
Mathieu

> +	return RSC_HANDLED;
> +}
> +
> +static int zynqmp_r5_handle_rsc(struct rproc *rproc, u32 rsc_type, void *rsc,
> +				int offset, int avail)
> +{
> +	if (rsc_type == FW_RSC_VENDOR_CRASH_REASON)
> +		return zynqmp_r5_handle_crash_rsc(rproc, rsc, offset, avail);
> +
> +	return RSC_IGNORED;
> +}
> +
>  static const struct rproc_ops zynqmp_r5_rproc_ops = {
>  	.prepare	= zynqmp_r5_rproc_prepare,
>  	.unprepare	= zynqmp_r5_rproc_unprepare,
> @@ -911,6 +966,7 @@ static const struct rproc_ops zynqmp_r5_rproc_ops = {
>  	.attach		= zynqmp_r5_attach,
>  	.detach		= zynqmp_r5_detach,
>  	.coredump	= zynqmp_r5_coredump,
> +	.handle_rsc	= zynqmp_r5_handle_rsc,
>  };
>  
>  /**
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] remoteproc: xlnx: enable boot recovery
  2025-11-20 17:50   ` Mathieu Poirier
@ 2025-12-01 21:04     ` Tanmay Shah
  0 siblings, 0 replies; 11+ messages in thread
From: Tanmay Shah @ 2025-12-01 21:04 UTC (permalink / raw)
  To: Mathieu Poirier; +Cc: andersson, linux-remoteproc, linux-kernel


Hi Mathieu,

Thanks for reviews. Just got back from the vacation hence replying now.


On 11/20/25 11:50 AM, Mathieu Poirier wrote:
> Good morning,
> 
> On Thu, Nov 13, 2025 at 07:44:02AM -0800, Tanmay Shah wrote:
>> This is the default method to recover the remote processor from crash.
>> During this recovery the Linux will stop the remote, load the same
>> firmware again and start the remote processor. As of now, coredump
>> callback does not contain any useful implementation, but this can be
>> changed as required.
>>
>> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
>> ---
>>   drivers/remoteproc/xlnx_r5_remoteproc.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> index 0b7b173d0d26..8677b732ad14 100644
>> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
>> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> @@ -891,6 +891,11 @@ static int zynqmp_r5_detach(struct rproc *rproc)
>>   	return 0;
>>   }
>>   
>> +static void zynqmp_r5_coredump(struct rproc *rproc)
>> +{
>> +	(void)rproc;
>> +}
>> +
> 
> Function rproc_coredump(), which is set by default in rproc_alloc_ops(), won't
> work?  If not please indicate why this is the case, otherwise this patch can be
> dropped.

Thanks, I completely missed it. this function isn't needed and will be 
dropped.

> 
>>   static const struct rproc_ops zynqmp_r5_rproc_ops = {
>>   	.prepare	= zynqmp_r5_rproc_prepare,
>>   	.unprepare	= zynqmp_r5_rproc_unprepare,
>> @@ -905,6 +910,7 @@ static const struct rproc_ops zynqmp_r5_rproc_ops = {
>>   	.get_loaded_rsc_table = zynqmp_r5_get_loaded_rsc_table,
>>   	.attach		= zynqmp_r5_attach,
>>   	.detach		= zynqmp_r5_detach,
>> +	.coredump	= zynqmp_r5_coredump,
>>   };
>>   
>>   /**
>> @@ -938,7 +944,7 @@ static struct zynqmp_r5_core *zynqmp_r5_add_rproc_core(struct device *cdev)
>>   
>>   	rproc_coredump_set_elf_info(r5_rproc, ELFCLASS32, EM_ARM);
>>   
>> -	r5_rproc->recovery_disabled = true;
>> +	r5_rproc->recovery_disabled = false;

This part of the patch is still needed, and I will move it to 3/3. So 
this whole patch can be dropped.

Thanks.

>>   	r5_rproc->has_iommu = false;
>>   	r5_rproc->auto_boot = false;
>>   	r5_core = r5_rproc->priv;
>> -- 
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] remoteproc: core: full attach detach during recovery
  2025-11-20 17:58   ` Mathieu Poirier
@ 2025-12-01 23:05     ` Tanmay Shah
  0 siblings, 0 replies; 11+ messages in thread
From: Tanmay Shah @ 2025-12-01 23:05 UTC (permalink / raw)
  To: Mathieu Poirier; +Cc: andersson, linux-remoteproc, linux-kernel



On 11/20/25 11:58 AM, Mathieu Poirier wrote:
> On Thu, Nov 13, 2025 at 07:44:03AM -0800, Tanmay Shah wrote:
>> Current attach on recovery mechanism loads the clean resource table
>> during recovery, but doesn't re-allocate the resources. RPMsg
>> communication will fail after recovery due to this. Fix this
>> incorrect behavior by doing the full detach and attach of remote
>> processor during the recovery. This will load the clean resource table
>> and re-allocate all the resources, which will set up correct vring
>> information in the resource table.
>>
>> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
>> ---
>>
>> Changes in v2:
>>    - use rproc_boot instead of rproc_attach
>>    - move debug message early in the function
>>
>>   drivers/remoteproc/remoteproc_core.c | 26 +++++++++++++++-----------
>>   1 file changed, 15 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
>> index aada2780b343..f65e8bc2d1e1 100644
>> --- a/drivers/remoteproc/remoteproc_core.c
>> +++ b/drivers/remoteproc/remoteproc_core.c
>> @@ -1777,11 +1777,11 @@ static int rproc_attach_recovery(struct rproc *rproc)
>>   {
>>   	int ret;
>>   
>> -	ret = __rproc_detach(rproc);
>> +	ret = rproc_detach(rproc);
>>   	if (ret)
>>   		return ret;
>>   
>> -	return __rproc_attach(rproc);
>> +	return rproc_boot(rproc);
>>   }
>>   
>>   static int rproc_boot_recovery(struct rproc *rproc)
>> @@ -1829,6 +1829,11 @@ int rproc_trigger_recovery(struct rproc *rproc)
>>   	struct device *dev = &rproc->dev;
>>   	int ret;
>>   
>> +	dev_err(dev, "recovering %s\n", rproc->name);
>> +
>> +	if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
>> +		return rproc_attach_recovery(rproc);
>> +
> 
> Humm... I find this a little messy.  Taking [1] as an example, I suggest moving
> the "unlock_mutex" block to line 1846 and add mutex calls to
> rproc_boot_recovery().  That way both rproc_attach_recovery() and
> rproc_boot_recovery() are called the same way.
> 
> [1] https://elixir.bootlin.com/linux/v6.17.8/source/drivers/remoteproc/remoteproc_core.c#L1832
> 

Sounds good. I will have to test it but I don't see problem with the 
suggestion made.

>>   	ret = mutex_lock_interruptible(&rproc->lock);
>>   	if (ret)
>>   		return ret;
>> @@ -1837,12 +1842,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
>>   	if (rproc->state != RPROC_CRASHED)
>>   		goto unlock_mutex;
>>   
>> -	dev_err(dev, "recovering %s\n", rproc->name);
>> -
>> -	if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
>> -		ret = rproc_attach_recovery(rproc);
>> -	else
>> -		ret = rproc_boot_recovery(rproc);
>> +	ret = rproc_boot_recovery(rproc);
>>   
>>   unlock_mutex:
>>   	mutex_unlock(&rproc->lock);
>> @@ -1860,6 +1860,7 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>   {
>>   	struct rproc *rproc = container_of(work, struct rproc, crash_handler);
>>   	struct device *dev = &rproc->dev;
>> +	int ret;
>>   
>>   	dev_dbg(dev, "enter %s\n", __func__);
>>   
>> @@ -1883,8 +1884,11 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>   
>>   	mutex_unlock(&rproc->lock);
>>   
>> -	if (!rproc->recovery_disabled)
>> -		rproc_trigger_recovery(rproc);
>> +	if (!rproc->recovery_disabled) {
>> +		ret = rproc_trigger_recovery(rproc);
>> +		if (ret)
>> +			dev_warn(dev, "rproc recovery failed, err %d\n", ret);
> 
> I would prefer a patch on its own for this one.
> 

Ack.

> I'm running out of time for today, I'll review patch 3/3 tomorrow.
> 
> Thanks,
> Mathieu
> 
>> +	}
>>   
>>   out:
>>   	pm_relax(rproc->dev.parent);
>> @@ -2057,7 +2061,7 @@ int rproc_detach(struct rproc *rproc)
>>   		return ret;
>>   	}
>>   
>> -	if (rproc->state != RPROC_ATTACHED) {
>> +	if (rproc->state != RPROC_ATTACHED && rproc->state != RPROC_CRASHED) {
>>   		ret = -EINVAL;
>>   		goto out;
>>   	}
>> -- 
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] remoteproc: xlnx: add crash detection mechanism
  2025-11-21 15:37   ` Mathieu Poirier
@ 2025-12-02  5:04     ` Tanmay Shah
  2025-12-02 15:05       ` Mathieu Poirier
  0 siblings, 1 reply; 11+ messages in thread
From: Tanmay Shah @ 2025-12-02  5:04 UTC (permalink / raw)
  To: Mathieu Poirier; +Cc: andersson, linux-remoteproc, linux-kernel



On 11/21/25 9:37 AM, Mathieu Poirier wrote:
> On Thu, Nov 13, 2025 at 07:44:04AM -0800, Tanmay Shah wrote:
>> Remote processor will report the crash reason via the resource table
>> and notify the host via kick. The host checks this crash reason on
>> every kick notification from the remote and report to the core
>> framework. Then the rproc core framework will start the recovery
>> process.
> 
> Please substitute the word "kick" for "mailbox notification".  I also have to
> assume "core framework" and "rproc core framework" are the same.  Pick one and
> stick with it.
> 

Ack.

>>
>> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
>> ---
>>
>> Changes in v2:
>>    - clear attach recovery boot flag during detach and stop ops
>>
>>   drivers/remoteproc/xlnx_r5_remoteproc.c | 56 +++++++++++++++++++++++++
>>   1 file changed, 56 insertions(+)
>>
>> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> index 8677b732ad14..5d04e8c0dc52 100644
>> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
>> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> @@ -108,6 +108,10 @@ struct rsc_tbl_data {
>>   	const uintptr_t rsc_tbl;
>>   } __packed;
>>   
>> +enum fw_vendor_rsc {
>> +	FW_RSC_VENDOR_CRASH_REASON = RSC_VENDOR_START,
>> +};
>> +
>>   /*
>>    * Hardcoded TCM bank values. This will stay in driver to maintain backward
>>    * compatibility with device-tree that does not have TCM information.
>> @@ -127,9 +131,21 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
>>   	{0xffe30000UL, 0x30000, 0x10000UL, PD_R5_1_BTCM, "btcm1"},
>>   };
>>   
>> +/**
>> + * struct xlnx_rproc_crash_report - resource to know crash status and reason
>> + *
>> + * @crash_state: if true, the rproc is notifying crash, time to recover
>> + * @crash_reason: reason of crash
>> + */
>> +struct xlnx_rproc_crash_report {
>> +	u32 crash_state;
>> +	u32 crash_reason;
>> +} __packed;
>> +
>>   /**
>>    * struct zynqmp_r5_core - remoteproc core's internal data
>>    *
>> + * @crash_report: rproc crash state and reason
>>    * @rsc_tbl_va: resource table virtual address
>>    * @sram: Array of sram memories assigned to this core
>>    * @num_sram: number of sram for this core
>> @@ -143,6 +159,7 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
>>    * @ipi: pointer to mailbox information
>>    */
>>   struct zynqmp_r5_core {
>> +	struct xlnx_rproc_crash_report *crash_report;
>>   	void __iomem *rsc_tbl_va;
>>   	struct zynqmp_sram_bank *sram;
>>   	int num_sram;
>> @@ -227,10 +244,14 @@ static void handle_event_notified(struct work_struct *work)
>>   static void zynqmp_r5_mb_rx_cb(struct mbox_client *cl, void *msg)
>>   {
>>   	struct zynqmp_ipi_message *ipi_msg, *buf_msg;
>> +	struct zynqmp_r5_core *r5_core;
>> +	struct rproc *rproc;
>>   	struct mbox_info *ipi;
>>   	size_t len;
>>   
>>   	ipi = container_of(cl, struct mbox_info, mbox_cl);
>> +	r5_core = ipi->r5_core;
>> +	rproc = r5_core->rproc;
>>   
>>   	/* copy data from ipi buffer to r5_core */
>>   	ipi_msg = (struct zynqmp_ipi_message *)msg;
>> @@ -244,6 +265,13 @@ static void zynqmp_r5_mb_rx_cb(struct mbox_client *cl, void *msg)
>>   	buf_msg->len = len;
>>   	memcpy(buf_msg->data, ipi_msg->data, len);
>>   
>> +	/* Check for crash only if rproc crash is expected */
>> +	if (rproc->state == RPROC_ATTACHED || rproc->state == RPROC_RUNNING) {
>> +		if (r5_core->crash_report->crash_state)
>> +			rproc_report_crash(rproc,
>> +					   r5_core->crash_report->crash_reason);
> 
> At this stage ->crash_state indicates that a crash occured, but how is it reset
> once the crash has been handle?  How do we make sure the next mailbox
> notification won't trigger another crash report?
> 

I was counting on the remote firmware to reset the crash_state once it 
reboots before sending the next mailbox notification.

If it's not the best idea, I can reset the crash_state field in start() 
callback or attach() callback at the end. That will indicate that remote 
firmware has started successfully.

>> +	}
>> +
>>   	/* received and processed interrupt ack */
>>   	if (mbox_send_message(ipi->rx_chan, NULL) < 0)
>>   		dev_err(cl->dev, "ack failed to mbox rx_chan\n");
>> @@ -397,6 +425,7 @@ static int zynqmp_r5_rproc_start(struct rproc *rproc)
>>   	if (ret)
>>   		dev_err(r5_core->dev,
>>   			"failed to start RPU = 0x%x\n", r5_core->pm_domain_id);
>> +
> 
> Spurious change
> 

Ack will remove it.

>>   	return ret;
>>   }
>>   
>> @@ -438,6 +467,8 @@ static int zynqmp_r5_rproc_stop(struct rproc *rproc)
>>   	if (ret)
>>   		dev_err(r5_core->dev, "core force power down failed\n");
>>   
>> +	test_and_clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
>> +
>>   	return ret;
>>   }
>>   
>> @@ -874,6 +905,8 @@ static int zynqmp_r5_get_rsc_table_va(struct zynqmp_r5_core *r5_core)
>>   
>>   static int zynqmp_r5_attach(struct rproc *rproc)
>>   {
>> +	rproc_set_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY);
>> +
> 
> Why can't this be set in probe() and left alone from thereon?
> 

Right now no specific reason. But I wanted to enable recovery only if 
attach() callback is successful. If execution fails anytime before that, 
then no point in enabling it.

>>   	dev_dbg(&rproc->dev, "rproc %d attached\n", rproc->index);
>>   
>>   	return 0;
>> @@ -888,6 +921,8 @@ static int zynqmp_r5_detach(struct rproc *rproc)
>>   	 */
>>   	zynqmp_r5_rproc_kick(rproc, 0);
>>   
>> +	clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
>> +
> 
> I'm not sure why this needs to be done, same comment for zynqmp_r5_rproc_stop().
> 

I think for detach() may be it's not needed. I added it as a cleanup 
sequence i.e. reverse of what's done in the attach() callback.

For stop it is needed in the following case:

attach() -> stop () -> load fw () -> start ().

In this sequence we need to make sure that if recovery is requested 
after start(), then we execute "boot recovery" and not "attach recovery".


Thanks,
Tanmay



>>   	return 0;
>>   }
>>   
>> @@ -896,6 +931,26 @@ static void zynqmp_r5_coredump(struct rproc *rproc)
>>   	(void)rproc;
>>   }
>>   
>> +static int zynqmp_r5_handle_crash_rsc(struct rproc *rproc, void *rsc,
>> +				      int offset, int avail)
>> +{
>> +	struct zynqmp_r5_core *r5_core = rproc->priv;
>> +
>> +	r5_core->crash_report =
>> +		(struct xlnx_rproc_crash_report *)(r5_core->rsc_tbl_va + offset);
>> +
> 
> This function is so simple that I would fold it in zynqmp_r5_handle_rsc() below.
> 

Ack.

> Thanks,
> Mathieu
> 
>> +	return RSC_HANDLED;
>> +}
>> +
>> +static int zynqmp_r5_handle_rsc(struct rproc *rproc, u32 rsc_type, void *rsc,
>> +				int offset, int avail)
>> +{
>> +	if (rsc_type == FW_RSC_VENDOR_CRASH_REASON)
>> +		return zynqmp_r5_handle_crash_rsc(rproc, rsc, offset, avail);
>> +
>> +	return RSC_IGNORED;
>> +}
>> +
>>   static const struct rproc_ops zynqmp_r5_rproc_ops = {
>>   	.prepare	= zynqmp_r5_rproc_prepare,
>>   	.unprepare	= zynqmp_r5_rproc_unprepare,
>> @@ -911,6 +966,7 @@ static const struct rproc_ops zynqmp_r5_rproc_ops = {
>>   	.attach		= zynqmp_r5_attach,
>>   	.detach		= zynqmp_r5_detach,
>>   	.coredump	= zynqmp_r5_coredump,
>> +	.handle_rsc	= zynqmp_r5_handle_rsc,
>>   };
>>   
>>   /**
>> -- 
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] remoteproc: xlnx: add crash detection mechanism
  2025-12-02  5:04     ` Tanmay Shah
@ 2025-12-02 15:05       ` Mathieu Poirier
  0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Poirier @ 2025-12-02 15:05 UTC (permalink / raw)
  To: tanmay.shah; +Cc: andersson, linux-remoteproc, linux-kernel

On Mon, 1 Dec 2025 at 22:04, Tanmay Shah <tanmay.shah@amd.com> wrote:
>
>
>
> On 11/21/25 9:37 AM, Mathieu Poirier wrote:
> > On Thu, Nov 13, 2025 at 07:44:04AM -0800, Tanmay Shah wrote:
> >> Remote processor will report the crash reason via the resource table
> >> and notify the host via kick. The host checks this crash reason on
> >> every kick notification from the remote and report to the core
> >> framework. Then the rproc core framework will start the recovery
> >> process.
> >
> > Please substitute the word "kick" for "mailbox notification".  I also have to
> > assume "core framework" and "rproc core framework" are the same.  Pick one and
> > stick with it.
> >
>
> Ack.
>
> >>
> >> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
> >> ---
> >>
> >> Changes in v2:
> >>    - clear attach recovery boot flag during detach and stop ops
> >>
> >>   drivers/remoteproc/xlnx_r5_remoteproc.c | 56 +++++++++++++++++++++++++
> >>   1 file changed, 56 insertions(+)
> >>
> >> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c
> >> index 8677b732ad14..5d04e8c0dc52 100644
> >> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> >> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> >> @@ -108,6 +108,10 @@ struct rsc_tbl_data {
> >>      const uintptr_t rsc_tbl;
> >>   } __packed;
> >>
> >> +enum fw_vendor_rsc {
> >> +    FW_RSC_VENDOR_CRASH_REASON = RSC_VENDOR_START,
> >> +};
> >> +
> >>   /*
> >>    * Hardcoded TCM bank values. This will stay in driver to maintain backward
> >>    * compatibility with device-tree that does not have TCM information.
> >> @@ -127,9 +131,21 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
> >>      {0xffe30000UL, 0x30000, 0x10000UL, PD_R5_1_BTCM, "btcm1"},
> >>   };
> >>
> >> +/**
> >> + * struct xlnx_rproc_crash_report - resource to know crash status and reason
> >> + *
> >> + * @crash_state: if true, the rproc is notifying crash, time to recover
> >> + * @crash_reason: reason of crash
> >> + */
> >> +struct xlnx_rproc_crash_report {
> >> +    u32 crash_state;
> >> +    u32 crash_reason;
> >> +} __packed;
> >> +
> >>   /**
> >>    * struct zynqmp_r5_core - remoteproc core's internal data
> >>    *
> >> + * @crash_report: rproc crash state and reason
> >>    * @rsc_tbl_va: resource table virtual address
> >>    * @sram: Array of sram memories assigned to this core
> >>    * @num_sram: number of sram for this core
> >> @@ -143,6 +159,7 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
> >>    * @ipi: pointer to mailbox information
> >>    */
> >>   struct zynqmp_r5_core {
> >> +    struct xlnx_rproc_crash_report *crash_report;
> >>      void __iomem *rsc_tbl_va;
> >>      struct zynqmp_sram_bank *sram;
> >>      int num_sram;
> >> @@ -227,10 +244,14 @@ static void handle_event_notified(struct work_struct *work)
> >>   static void zynqmp_r5_mb_rx_cb(struct mbox_client *cl, void *msg)
> >>   {
> >>      struct zynqmp_ipi_message *ipi_msg, *buf_msg;
> >> +    struct zynqmp_r5_core *r5_core;
> >> +    struct rproc *rproc;
> >>      struct mbox_info *ipi;
> >>      size_t len;
> >>
> >>      ipi = container_of(cl, struct mbox_info, mbox_cl);
> >> +    r5_core = ipi->r5_core;
> >> +    rproc = r5_core->rproc;
> >>
> >>      /* copy data from ipi buffer to r5_core */
> >>      ipi_msg = (struct zynqmp_ipi_message *)msg;
> >> @@ -244,6 +265,13 @@ static void zynqmp_r5_mb_rx_cb(struct mbox_client *cl, void *msg)
> >>      buf_msg->len = len;
> >>      memcpy(buf_msg->data, ipi_msg->data, len);
> >>
> >> +    /* Check for crash only if rproc crash is expected */
> >> +    if (rproc->state == RPROC_ATTACHED || rproc->state == RPROC_RUNNING) {
> >> +            if (r5_core->crash_report->crash_state)
> >> +                    rproc_report_crash(rproc,
> >> +                                       r5_core->crash_report->crash_reason);
> >
> > At this stage ->crash_state indicates that a crash occured, but how is it reset
> > once the crash has been handle?  How do we make sure the next mailbox
> > notification won't trigger another crash report?
> >
>
> I was counting on the remote firmware to reset the crash_state once it
> reboots before sending the next mailbox notification.
>
> If it's not the best idea, I can reset the crash_state field in start()
> callback or attach() callback at the end. That will indicate that remote
> firmware has started successfully.

I think this is a better solution.  That way we don't rely on
something that may or may not happen.

>
> >> +    }
> >> +
> >>      /* received and processed interrupt ack */
> >>      if (mbox_send_message(ipi->rx_chan, NULL) < 0)
> >>              dev_err(cl->dev, "ack failed to mbox rx_chan\n");
> >> @@ -397,6 +425,7 @@ static int zynqmp_r5_rproc_start(struct rproc *rproc)
> >>      if (ret)
> >>              dev_err(r5_core->dev,
> >>                      "failed to start RPU = 0x%x\n", r5_core->pm_domain_id);
> >> +
> >
> > Spurious change
> >
>
> Ack will remove it.
>
> >>      return ret;
> >>   }
> >>
> >> @@ -438,6 +467,8 @@ static int zynqmp_r5_rproc_stop(struct rproc *rproc)
> >>      if (ret)
> >>              dev_err(r5_core->dev, "core force power down failed\n");
> >>
> >> +    test_and_clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
> >> +
> >>      return ret;
> >>   }
> >>
> >> @@ -874,6 +905,8 @@ static int zynqmp_r5_get_rsc_table_va(struct zynqmp_r5_core *r5_core)
> >>
> >>   static int zynqmp_r5_attach(struct rproc *rproc)
> >>   {
> >> +    rproc_set_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY);
> >> +
> >
> > Why can't this be set in probe() and left alone from thereon?
> >
>
> Right now no specific reason. But I wanted to enable recovery only if
> attach() callback is successful. If execution fails anytime before that,
> then no point in enabling it.
>
> >>      dev_dbg(&rproc->dev, "rproc %d attached\n", rproc->index);
> >>
> >>      return 0;
> >> @@ -888,6 +921,8 @@ static int zynqmp_r5_detach(struct rproc *rproc)
> >>       */
> >>      zynqmp_r5_rproc_kick(rproc, 0);
> >>
> >> +    clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
> >> +
> >
> > I'm not sure why this needs to be done, same comment for zynqmp_r5_rproc_stop().
> >
>
> I think for detach() may be it's not needed. I added it as a cleanup
> sequence i.e. reverse of what's done in the attach() callback.
>
> For stop it is needed in the following case:
>
> attach() -> stop () -> load fw () -> start ().
>
> In this sequence we need to make sure that if recovery is requested
> after start(), then we execute "boot recovery" and not "attach recovery".
>

I think this is a valid reason, just make sure it is documented in the
code here and for _attach() above.

>
> Thanks,
> Tanmay
>
>
>
> >>      return 0;
> >>   }
> >>
> >> @@ -896,6 +931,26 @@ static void zynqmp_r5_coredump(struct rproc *rproc)
> >>      (void)rproc;
> >>   }
> >>
> >> +static int zynqmp_r5_handle_crash_rsc(struct rproc *rproc, void *rsc,
> >> +                                  int offset, int avail)
> >> +{
> >> +    struct zynqmp_r5_core *r5_core = rproc->priv;
> >> +
> >> +    r5_core->crash_report =
> >> +            (struct xlnx_rproc_crash_report *)(r5_core->rsc_tbl_va + offset);
> >> +
> >
> > This function is so simple that I would fold it in zynqmp_r5_handle_rsc() below.
> >
>
> Ack.
>
> > Thanks,
> > Mathieu
> >
> >> +    return RSC_HANDLED;
> >> +}
> >> +
> >> +static int zynqmp_r5_handle_rsc(struct rproc *rproc, u32 rsc_type, void *rsc,
> >> +                            int offset, int avail)
> >> +{
> >> +    if (rsc_type == FW_RSC_VENDOR_CRASH_REASON)
> >> +            return zynqmp_r5_handle_crash_rsc(rproc, rsc, offset, avail);
> >> +
> >> +    return RSC_IGNORED;
> >> +}
> >> +
> >>   static const struct rproc_ops zynqmp_r5_rproc_ops = {
> >>      .prepare        = zynqmp_r5_rproc_prepare,
> >>      .unprepare      = zynqmp_r5_rproc_unprepare,
> >> @@ -911,6 +966,7 @@ static const struct rproc_ops zynqmp_r5_rproc_ops = {
> >>      .attach         = zynqmp_r5_attach,
> >>      .detach         = zynqmp_r5_detach,
> >>      .coredump       = zynqmp_r5_coredump,
> >> +    .handle_rsc     = zynqmp_r5_handle_rsc,
> >>   };
> >>
> >>   /**
> >> --
> >> 2.34.1
> >>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-12-02 15:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-13 15:44 [PATCH 0/3] remoteproc: xlnx: remote crash recovery Tanmay Shah
2025-11-13 15:44 ` [PATCH v2 1/3] remoteproc: xlnx: enable boot recovery Tanmay Shah
2025-11-20 17:50   ` Mathieu Poirier
2025-12-01 21:04     ` Tanmay Shah
2025-11-13 15:44 ` [PATCH v2 2/3] remoteproc: core: full attach detach during recovery Tanmay Shah
2025-11-20 17:58   ` Mathieu Poirier
2025-12-01 23:05     ` Tanmay Shah
2025-11-13 15:44 ` [PATCH v2 3/3] remoteproc: xlnx: add crash detection mechanism Tanmay Shah
2025-11-21 15:37   ` Mathieu Poirier
2025-12-02  5:04     ` Tanmay Shah
2025-12-02 15:05       ` Mathieu Poirier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox