All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/2] remoteproc: xlnx: remote crash recovery
@ 2026-06-10 18:27 Tanmay Shah
  2026-06-10 18:27 ` [PATCH v5 1/2] remoteproc: core: full attach detach during recovery Tanmay Shah
  2026-06-10 18:27 ` [PATCH v5 2/2] remoteproc: xlnx: add crash detection mechanism Tanmay Shah
  0 siblings, 2 replies; 6+ messages in thread
From: Tanmay Shah @ 2026-06-10 18:27 UTC (permalink / raw)
  To: andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel, Tanmay Shah

Remote processor can crash or hang during normal execution. Linux
remoteproc framework supports different mechanisms to recover the
remote processor and re-establish the RPMsg communication in such case.

Crash reporting on AMD-Xilinx platform:

1) Using debugfs node

User can report the crash to the core framework via debugfs node using
following command:

echo 1 > /sys/kernel/debug/remoteproc/remoteproc0/crash

2) Remoteproc notify to the host about crash state and crash reason
via the resource table

This is a platform specific method where the remote firmware contains
vendor specific resource to update the crash state and the crash
reason. Then the remote notifies the crash to the host via mailbox
notification. The host then will check this resource on every mbox
notification and reports the crash to the core framework if needed.

Crash recovery mechanism on AMD-Xilnx platform:

There are two mechanisms available to recover the remote processor from
the crash. 1) boot recovery, 2) attach on recovery

Remoteproc core framework will choose proper mechanism based on the
rproc features set by the platform driver.

1) Boot recovery

This is the default mechanism to recover the remote processor.
In this method core framework will first stop the remote processor,
load the firmware again and then starts the remote processor. On
AMD-Xilinx platforms this method is supported. The default coredump
method is supported.

2) Attach on recovery

If RPROC_ATTACH_ON_RECOVERY feature is enabled by the platform driver,
then the core framework will choose this method for recovery.

On versal and later platforms following is the sequence of events expected
during remoteproc crash and attach on recovery:

a) Remoteproc attach/detach flow is working, and RPMsg comm is established
b) Remote processor (RPU) crashed (crash not reported yet)
c) Platform management controller is instructed to stop and reload elf
   on inactive remote processor before reboot (platform specific method)
d) Platform management controller reboots the remote processor
e) Remote processor boots again, and detects previous crash (platform
   specific mechanism to detect the crash)
f) Remote processor Reports crash to the Linux (Host) and wait for
   the recovery.
g) Linux performs full detach and reattach to remote processor.
h) Normal RPMsg communication is established.

It is required to destroy all RPMsg related resources and recreate them
during recovery to establish successful RPMsg communication. To achieve
this complete rproc_detach followed by rproc_boot calls are needed. That
is what this patch-series is fixing along with adding rproc recovery
methods for AMD-Xilinx platforms.

Change log:

Changes in v5:
  - use local variable to access crash report pointer
  - End crash report string with '\0' without relying on fw

Changes in v4:
  - move pieces needed from rproc_detach() to rproc_attach_recovery() to
    cleanup the resources
  - use rproc_attach() over __rproc_attach() to create resources again
  - keep rproc_trigger_recovery() unmodified
  - Optimize crash resource memory by changing type to u32 to u8
  - Introduce version field in the crash resource
  - Check crash related condition before rproc state related condition
  - Move crash reporting to the bottom half of the isr instead of
    actual irq handler
  - Introduce 16 bytes array in the crash report to store crash reason
    in the string format
  - Remove redundant type cast

Changes in 3: 
  - both rproc_attach_recovery() and
    rproc_boot_recovery() are called the same way.
  - remove unrelated changes
  - %s/kick/mailbox notification/
  - %s/core framework/rproc core framework/
  - fold simple function within zynqmp_r5_handle_rsc().
  - remove spurious change
  - reset crash state after reporting the crash
  - document set and reset of ATTACH_ON_RECOVERY flag
  - set recovery_disabled flag to false
  - check condition rproc->crash_reason != NULL

Changes in v2:
  - use rproc_boot instead of rproc_attach
  - move debug message early in the function
  - clear attach recovery boot flag during detach and stop ops

Tanmay Shah (2):
  remoteproc: core: full attach detach during recovery
  remoteproc: xlnx: add crash detection mechanism

Tanmay Shah (2):
  remoteproc: core: full attach detach during recovery
  remoteproc: xlnx: add crash detection mechanism

 drivers/remoteproc/remoteproc_core.c    | 15 ++++-
 drivers/remoteproc/xlnx_r5_remoteproc.c | 73 ++++++++++++++++++++++++-
 2 files changed, 86 insertions(+), 2 deletions(-)


base-commit: 85842b61f64cac93d28e129d35193e329d463fd1
-- 
2.34.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v5 1/2] remoteproc: core: full attach detach during recovery
  2026-06-10 18:27 [PATCH v5 0/2] remoteproc: xlnx: remote crash recovery Tanmay Shah
@ 2026-06-10 18:27 ` Tanmay Shah
  2026-06-16 15:29   ` Mathieu Poirier
  2026-06-10 18:27 ` [PATCH v5 2/2] remoteproc: xlnx: add crash detection mechanism Tanmay Shah
  1 sibling, 1 reply; 6+ messages in thread
From: Tanmay Shah @ 2026-06-10 18:27 UTC (permalink / raw)
  To: andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel, Tanmay Shah

Current attach on recovery mechanism loads the clean resource table
during recovery, but doesn't re-allocate the resources. RPMsg
communication will fail after recovery due to this. Fix this
incorrect behavior by doing the full detach and attach of remote
processor during the recovery. This will load the clean resource table
and re-allocate all the resources, which will set up correct vring
information in the resource table.

Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
---
 drivers/remoteproc/remoteproc_core.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index f003be006b1b..8393468fa1ee 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1776,7 +1776,20 @@ static int rproc_attach_recovery(struct rproc *rproc)
 	if (ret)
 		return ret;
 
-	return __rproc_attach(rproc);
+	/* clean up all acquired resources */
+	rproc_resource_cleanup(rproc);
+
+	/* release HW resources if needed */
+	rproc_unprepare_device(rproc);
+
+	rproc_disable_iommu(rproc);
+
+	/* Free the copy of the resource table */
+	kfree(rproc->cached_table);
+	rproc->cached_table = NULL;
+	rproc->table_ptr = NULL;
+
+	return rproc_attach(rproc);
 }
 
 static int rproc_boot_recovery(struct rproc *rproc)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v5 2/2] remoteproc: xlnx: add crash detection mechanism
  2026-06-10 18:27 [PATCH v5 0/2] remoteproc: xlnx: remote crash recovery Tanmay Shah
  2026-06-10 18:27 ` [PATCH v5 1/2] remoteproc: core: full attach detach during recovery Tanmay Shah
@ 2026-06-10 18:27 ` Tanmay Shah
  2026-06-16 16:01   ` Mathieu Poirier
  1 sibling, 1 reply; 6+ messages in thread
From: Tanmay Shah @ 2026-06-10 18:27 UTC (permalink / raw)
  To: andersson, mathieu.poirier; +Cc: linux-remoteproc, linux-kernel, Tanmay Shah

Remote processor will report the crash reason via the resource table
and notify the host via mailbox notification. The host checks this
crash reason on every mailbox notification from the remote and report
to the rproc core framework. Then the rproc core framework will start
the recovery process.

Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
---

changes in v5
  - use local variable to access crash report pointer
  - End crash report string with '\0' without relying on fw

 drivers/remoteproc/xlnx_r5_remoteproc.c | 73 ++++++++++++++++++++++++-
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c
index 3349d1877751..86afff9f3b40 100644
--- a/drivers/remoteproc/xlnx_r5_remoteproc.c
+++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
@@ -112,6 +112,10 @@ struct rsc_tbl_data {
 	const uintptr_t rsc_tbl;
 } __packed;
 
+enum xlnx_rproc_fw_rsc {
+	XLNX_RPROC_FW_CRASH_REASON = RSC_VENDOR_START,
+};
+
 /*
  * Hardcoded TCM bank values. This will stay in driver to maintain backward
  * compatibility with device-tree that does not have TCM information.
@@ -131,9 +135,27 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
 	{0xffe30000UL, 0x30000, 0x10000UL, PD_R5_1_BTCM, "btcm1"},
 };
 
+#define CRASH_REASON_STR_LEN 16
+
+/**
+ * struct xlnx_rproc_crash_report - resource to know crash status and reason
+ *
+ * @version: version of this resource
+ * @crash_state: if true, the rproc is notifying crash, time to recover
+ * @crash_reason: number to describe reason of crash
+ * @crash_reason_str: short string description of crash reason
+ */
+struct xlnx_rproc_crash_report {
+	u8 version;
+	u8 crash_state;
+	u8 crash_reason;
+	char crash_reason_str[CRASH_REASON_STR_LEN];
+} __packed;
+
 /**
  * struct zynqmp_r5_core - remoteproc core's internal data
  *
+ * @crash_report: rproc crash state and reason
  * @rsc_tbl_va: resource table virtual address
  * @sram: Array of sram memories assigned to this core
  * @num_sram: number of sram for this core
@@ -147,6 +169,7 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
  * @ipi: pointer to mailbox information
  */
 struct zynqmp_r5_core {
+	struct xlnx_rproc_crash_report *crash_report;
 	void __iomem *rsc_tbl_va;
 	struct zynqmp_sram_bank *sram;
 	int num_sram;
@@ -204,11 +227,27 @@ static int event_notified_idr_cb(int id, void *ptr, void *data)
  */
 static void handle_event_notified(struct work_struct *work)
 {
+	struct xlnx_rproc_crash_report *report;
+	struct zynqmp_r5_core *r5_core;
 	struct mbox_info *ipi;
 	struct rproc *rproc;
 
 	ipi = container_of(work, struct mbox_info, mbox_work);
 	rproc = ipi->r5_core->rproc;
+	r5_core = ipi->r5_core;
+	report = r5_core->crash_report;
+
+	/* report crash only if expected */
+	if (report && report->crash_state) {
+		if (rproc->state == RPROC_ATTACHED || rproc->state == RPROC_RUNNING) {
+			report->crash_reason_str[CRASH_REASON_STR_LEN - 1] = '\0';
+			dev_warn(&rproc->dev, "crash reason id: %d %s\n",
+				 report->crash_reason, report->crash_reason_str);
+			rproc_report_crash(rproc, RPROC_FATAL_ERROR);
+			report->crash_state = false;
+			return;
+		}
+	}
 
 	/*
 	 * We only use IPI for interrupt. The RPU firmware side may or may
@@ -448,6 +487,13 @@ static int zynqmp_r5_rproc_stop(struct rproc *rproc)
 	if (ret)
 		dev_err(r5_core->dev, "core force power down failed\n");
 
+	/*
+	 * Clear attach on recovery flag during stop operation. The next state
+	 * of the remote processor is expected to be "Running" state. In this
+	 * state boot recovery method must take place over attach on recovery.
+	 */
+	test_and_clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
+
 	return ret;
 }
 
@@ -869,6 +915,9 @@ static int zynqmp_r5_get_rsc_table_va(struct zynqmp_r5_core *r5_core)
 
 static int zynqmp_r5_attach(struct rproc *rproc)
 {
+	/* Enable attach on recovery method. Clear it during rproc stop. */
+	rproc_set_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY);
+
 	dev_dbg(&rproc->dev, "rproc %d attached\n", rproc->index);
 
 	return 0;
@@ -883,9 +932,30 @@ static int zynqmp_r5_detach(struct rproc *rproc)
 	 */
 	zynqmp_r5_rproc_kick(rproc, 0);
 
+	clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
+
 	return 0;
 }
 
+static int zynqmp_r5_handle_rsc(struct rproc *rproc, u32 rsc_type, void *rsc,
+				int offset, int avail)
+{
+	struct zynqmp_r5_core *r5_core = rproc->priv;
+	void *rsc_offset = (r5_core->rsc_tbl_va + offset);
+
+	if (rsc_type == XLNX_RPROC_FW_CRASH_REASON) {
+		r5_core->crash_report = rsc_offset;
+		/* reset all values */
+		r5_core->crash_report->crash_state = false;
+		r5_core->crash_report->crash_reason = 0;
+		r5_core->crash_report->crash_reason_str[0] = '\0';
+	} else {
+		return RSC_IGNORED;
+	}
+
+	return RSC_HANDLED;
+}
+
 static const struct rproc_ops zynqmp_r5_rproc_ops = {
 	.prepare	= zynqmp_r5_rproc_prepare,
 	.unprepare	= zynqmp_r5_rproc_unprepare,
@@ -900,6 +970,7 @@ static const struct rproc_ops zynqmp_r5_rproc_ops = {
 	.get_loaded_rsc_table = zynqmp_r5_get_loaded_rsc_table,
 	.attach		= zynqmp_r5_attach,
 	.detach		= zynqmp_r5_detach,
+	.handle_rsc	= zynqmp_r5_handle_rsc,
 };
 
 /**
@@ -939,7 +1010,7 @@ static struct zynqmp_r5_core *zynqmp_r5_alloc_rproc_core(struct device *cdev)
 
 	rproc_coredump_set_elf_info(r5_rproc, ELFCLASS32, EM_ARM);
 
-	r5_rproc->recovery_disabled = true;
+	r5_rproc->recovery_disabled = false;
 	r5_rproc->has_iommu = false;
 	r5_rproc->auto_boot = false;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 1/2] remoteproc: core: full attach detach during recovery
  2026-06-10 18:27 ` [PATCH v5 1/2] remoteproc: core: full attach detach during recovery Tanmay Shah
@ 2026-06-16 15:29   ` Mathieu Poirier
  0 siblings, 0 replies; 6+ messages in thread
From: Mathieu Poirier @ 2026-06-16 15:29 UTC (permalink / raw)
  To: Tanmay Shah; +Cc: andersson, linux-remoteproc, linux-kernel

Hi Tanmay,

On Wed, Jun 10, 2026 at 11:27:10AM -0700, Tanmay Shah wrote:
> Current attach on recovery mechanism loads the clean resource table
> during recovery, but doesn't re-allocate the resources. RPMsg
> communication will fail after recovery due to this. Fix this
> incorrect behavior by doing the full detach and attach of remote
> processor during the recovery. This will load the clean resource table
> and re-allocate all the resources, which will set up correct vring
> information in the resource table.
> 
> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
> ---
>  drivers/remoteproc/remoteproc_core.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index f003be006b1b..8393468fa1ee 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1776,7 +1776,20 @@ static int rproc_attach_recovery(struct rproc *rproc)
>  	if (ret)
>  		return ret;
>  
> -	return __rproc_attach(rproc);
> +	/* clean up all acquired resources */
> +	rproc_resource_cleanup(rproc);
> +
> +	/* release HW resources if needed */
> +	rproc_unprepare_device(rproc);
> +
> +	rproc_disable_iommu(rproc);
> +
> +	/* Free the copy of the resource table */
> +	kfree(rproc->cached_table);
> +	rproc->cached_table = NULL;
> +	rproc->table_ptr = NULL;
> +
> +	return rproc_attach(rproc);

That part works.

>  }
>  
>  static int rproc_boot_recovery(struct rproc *rproc)
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 2/2] remoteproc: xlnx: add crash detection mechanism
  2026-06-10 18:27 ` [PATCH v5 2/2] remoteproc: xlnx: add crash detection mechanism Tanmay Shah
@ 2026-06-16 16:01   ` Mathieu Poirier
  2026-06-16 19:52     ` Shah, Tanmay
  0 siblings, 1 reply; 6+ messages in thread
From: Mathieu Poirier @ 2026-06-16 16:01 UTC (permalink / raw)
  To: Tanmay Shah; +Cc: andersson, linux-remoteproc, linux-kernel

On Wed, Jun 10, 2026 at 11:27:11AM -0700, Tanmay Shah wrote:
> Remote processor will report the crash reason via the resource table
> and notify the host via mailbox notification. The host checks this
> crash reason on every mailbox notification from the remote and report
> to the rproc core framework. Then the rproc core framework will start
> the recovery process.
> 
> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
> ---
> 
> changes in v5
>   - use local variable to access crash report pointer
>   - End crash report string with '\0' without relying on fw
> 
>  drivers/remoteproc/xlnx_r5_remoteproc.c | 73 ++++++++++++++++++++++++-
>  1 file changed, 72 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c
> index 3349d1877751..86afff9f3b40 100644
> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> @@ -112,6 +112,10 @@ struct rsc_tbl_data {
>  	const uintptr_t rsc_tbl;
>  } __packed;
>  
> +enum xlnx_rproc_fw_rsc {
> +	XLNX_RPROC_FW_CRASH_REASON = RSC_VENDOR_START,

I would call this XLNX_RPROC_FW_CRASH_REPORT

> +};
> +
>  /*
>   * Hardcoded TCM bank values. This will stay in driver to maintain backward
>   * compatibility with device-tree that does not have TCM information.
> @@ -131,9 +135,27 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
>  	{0xffe30000UL, 0x30000, 0x10000UL, PD_R5_1_BTCM, "btcm1"},
>  };
>  
> +#define CRASH_REASON_STR_LEN 16
> +
> +/**
> + * struct xlnx_rproc_crash_report - resource to know crash status and reason
> + *
> + * @version: version of this resource
> + * @crash_state: if true, the rproc is notifying crash, time to recover
> + * @crash_reason: number to describe reason of crash
> + * @crash_reason_str: short string description of crash reason
> + */
> +struct xlnx_rproc_crash_report {
> +	u8 version;
> +	u8 crash_state;
> +	u8 crash_reason;

Do you need 2 variables?  Would it be possible for @crash_reason != 0 to
indicate a crash, making @cash_state irrelevant?

> +	char crash_reason_str[CRASH_REASON_STR_LEN];
> +} __packed;
> +
>  /**
>   * struct zynqmp_r5_core - remoteproc core's internal data
>   *
> + * @crash_report: rproc crash state and reason
>   * @rsc_tbl_va: resource table virtual address
>   * @sram: Array of sram memories assigned to this core
>   * @num_sram: number of sram for this core
> @@ -147,6 +169,7 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
>   * @ipi: pointer to mailbox information
>   */
>  struct zynqmp_r5_core {
> +	struct xlnx_rproc_crash_report *crash_report;
>  	void __iomem *rsc_tbl_va;
>  	struct zynqmp_sram_bank *sram;
>  	int num_sram;
> @@ -204,11 +227,27 @@ static int event_notified_idr_cb(int id, void *ptr, void *data)
>   */
>  static void handle_event_notified(struct work_struct *work)
>  {
> +	struct xlnx_rproc_crash_report *report;
> +	struct zynqmp_r5_core *r5_core;
>  	struct mbox_info *ipi;
>  	struct rproc *rproc;
>  
>  	ipi = container_of(work, struct mbox_info, mbox_work);
>  	rproc = ipi->r5_core->rproc;
> +	r5_core = ipi->r5_core;
> +	report = r5_core->crash_report;
> +
> +	/* report crash only if expected */
> +	if (report && report->crash_state) {
> +		if (rproc->state == RPROC_ATTACHED || rproc->state == RPROC_RUNNING) {
> +			report->crash_reason_str[CRASH_REASON_STR_LEN - 1] = '\0';
> +			dev_warn(&rproc->dev, "crash reason id: %d %s\n",
> +				 report->crash_reason, report->crash_reason_str);
> +			rproc_report_crash(rproc, RPROC_FATAL_ERROR);
> +			report->crash_state = false;
> +			return;
> +		}
> +	}
>  
>  	/*
>  	 * We only use IPI for interrupt. The RPU firmware side may or may
> @@ -448,6 +487,13 @@ static int zynqmp_r5_rproc_stop(struct rproc *rproc)
>  	if (ret)
>  		dev_err(r5_core->dev, "core force power down failed\n");
>  
> +	/*
> +	 * Clear attach on recovery flag during stop operation. The next state
> +	 * of the remote processor is expected to be "Running" state. In this
> +	 * state boot recovery method must take place over attach on recovery.
> +	 */
> +	test_and_clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);

Why is there a need to play set/clear the RPROC_FEAT_ATTACH_ON_RECOVERY flag
here and in zynqmp_r5_attach() and zynqmp_r5_detach()?  I think we talked about
that before but can't remember the outcome of that conversation.

> +
>  	return ret;
>  }
>  
> @@ -869,6 +915,9 @@ static int zynqmp_r5_get_rsc_table_va(struct zynqmp_r5_core *r5_core)
>  
>  static int zynqmp_r5_attach(struct rproc *rproc)
>  {
> +	/* Enable attach on recovery method. Clear it during rproc stop. */
> +	rproc_set_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY);
> +
>  	dev_dbg(&rproc->dev, "rproc %d attached\n", rproc->index);
>  
>  	return 0;
> @@ -883,9 +932,30 @@ static int zynqmp_r5_detach(struct rproc *rproc)
>  	 */
>  	zynqmp_r5_rproc_kick(rproc, 0);
>  
> +	clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
> +
>  	return 0;
>  }
>  
> +static int zynqmp_r5_handle_rsc(struct rproc *rproc, u32 rsc_type, void *rsc,
> +				int offset, int avail)
> +{
> +	struct zynqmp_r5_core *r5_core = rproc->priv;
> +	void *rsc_offset = (r5_core->rsc_tbl_va + offset);
> +

        if (rsc_type != XLNX_RPROC_FW_CRASH_REASON)
                return RSC_IGNORED;

> +	if (rsc_type == XLNX_RPROC_FW_CRASH_REASON) {
> +		r5_core->crash_report = rsc_offset;
> +		/* reset all values */
> +		r5_core->crash_report->crash_state = false;
> +		r5_core->crash_report->crash_reason = 0;
> +		r5_core->crash_report->crash_reason_str[0] = '\0';
> +	} else {
> +		return RSC_IGNORED;
> +	}
> +
> +	return RSC_HANDLED;
> +}
> +
>  static const struct rproc_ops zynqmp_r5_rproc_ops = {
>  	.prepare	= zynqmp_r5_rproc_prepare,
>  	.unprepare	= zynqmp_r5_rproc_unprepare,
> @@ -900,6 +970,7 @@ static const struct rproc_ops zynqmp_r5_rproc_ops = {
>  	.get_loaded_rsc_table = zynqmp_r5_get_loaded_rsc_table,
>  	.attach		= zynqmp_r5_attach,
>  	.detach		= zynqmp_r5_detach,
> +	.handle_rsc	= zynqmp_r5_handle_rsc,
>  };
>  
>  /**
> @@ -939,7 +1010,7 @@ static struct zynqmp_r5_core *zynqmp_r5_alloc_rproc_core(struct device *cdev)
>  
>  	rproc_coredump_set_elf_info(r5_rproc, ELFCLASS32, EM_ARM);
>  
> -	r5_rproc->recovery_disabled = true;
> +	r5_rproc->recovery_disabled = false;

I'm good with the overall architecture of this set.

Regards,
Mathieu

>  	r5_rproc->has_iommu = false;
>  	r5_rproc->auto_boot = false;
>  
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 2/2] remoteproc: xlnx: add crash detection mechanism
  2026-06-16 16:01   ` Mathieu Poirier
@ 2026-06-16 19:52     ` Shah, Tanmay
  0 siblings, 0 replies; 6+ messages in thread
From: Shah, Tanmay @ 2026-06-16 19:52 UTC (permalink / raw)
  To: Mathieu Poirier, Tanmay Shah; +Cc: andersson, linux-remoteproc, linux-kernel

Hello,

Please find my comments below:

On 6/16/2026 11:01 AM, Mathieu Poirier wrote:
> On Wed, Jun 10, 2026 at 11:27:11AM -0700, Tanmay Shah wrote:
>> Remote processor will report the crash reason via the resource table
>> and notify the host via mailbox notification. The host checks this
>> crash reason on every mailbox notification from the remote and report
>> to the rproc core framework. Then the rproc core framework will start
>> the recovery process.
>>
>> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
>> ---
>>
>> changes in v5
>>   - use local variable to access crash report pointer
>>   - End crash report string with '\0' without relying on fw
>>
>>  drivers/remoteproc/xlnx_r5_remoteproc.c | 73 ++++++++++++++++++++++++-
>>  1 file changed, 72 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> index 3349d1877751..86afff9f3b40 100644
>> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
>> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> @@ -112,6 +112,10 @@ struct rsc_tbl_data {
>>  	const uintptr_t rsc_tbl;
>>  } __packed;
>>  
>> +enum xlnx_rproc_fw_rsc {
>> +	XLNX_RPROC_FW_CRASH_REASON = RSC_VENDOR_START,
> 
> I would call this XLNX_RPROC_FW_CRASH_REPORT
> 

Ack.

>> +};
>> +
>>  /*
>>   * Hardcoded TCM bank values. This will stay in driver to maintain backward
>>   * compatibility with device-tree that does not have TCM information.
>> @@ -131,9 +135,27 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
>>  	{0xffe30000UL, 0x30000, 0x10000UL, PD_R5_1_BTCM, "btcm1"},
>>  };
>>  
>> +#define CRASH_REASON_STR_LEN 16
>> +
>> +/**
>> + * struct xlnx_rproc_crash_report - resource to know crash status and reason
>> + *
>> + * @version: version of this resource
>> + * @crash_state: if true, the rproc is notifying crash, time to recover
>> + * @crash_reason: number to describe reason of crash
>> + * @crash_reason_str: short string description of crash reason
>> + */
>> +struct xlnx_rproc_crash_report {
>> +	u8 version;
>> +	u8 crash_state;
>> +	u8 crash_reason;
> 
> Do you need 2 variables?  Would it be possible for @crash_reason != 0 to
> indicate a crash, making @cash_state irrelevant?
> 

Actually initially I had defined crash_reason only, but when I looked at
how crash reason/type is defined in the kernel, I got idea to keep
crash_state separate than crash_reason:

https://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux.git/tree/include/linux/remoteproc.h?h=for-next#n183

enum rproc_crash_type {
	RPROC_MMUFAULT,
	RPROC_WATCHDOG,
	RPROC_FATAL_ERROR,
};

So, if crash_reason is defined like above, then it's easy to make
mistake and not define no_crash = 0. Different firmware projects can
treat crash_reason differently. I would like to keep crash_state and
crash_reason separate and not impose policy on firmware projects on how
to define crash_reason.

>> +	char crash_reason_str[CRASH_REASON_STR_LEN];
>> +} __packed;
>> +
>>  /**
>>   * struct zynqmp_r5_core - remoteproc core's internal data
>>   *
>> + * @crash_report: rproc crash state and reason
>>   * @rsc_tbl_va: resource table virtual address
>>   * @sram: Array of sram memories assigned to this core
>>   * @num_sram: number of sram for this core
>> @@ -147,6 +169,7 @@ static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
>>   * @ipi: pointer to mailbox information
>>   */
>>  struct zynqmp_r5_core {
>> +	struct xlnx_rproc_crash_report *crash_report;
>>  	void __iomem *rsc_tbl_va;
>>  	struct zynqmp_sram_bank *sram;
>>  	int num_sram;
>> @@ -204,11 +227,27 @@ static int event_notified_idr_cb(int id, void *ptr, void *data)
>>   */
>>  static void handle_event_notified(struct work_struct *work)
>>  {
>> +	struct xlnx_rproc_crash_report *report;
>> +	struct zynqmp_r5_core *r5_core;
>>  	struct mbox_info *ipi;
>>  	struct rproc *rproc;
>>  
>>  	ipi = container_of(work, struct mbox_info, mbox_work);
>>  	rproc = ipi->r5_core->rproc;
>> +	r5_core = ipi->r5_core;
>> +	report = r5_core->crash_report;
>> +
>> +	/* report crash only if expected */
>> +	if (report && report->crash_state) {
>> +		if (rproc->state == RPROC_ATTACHED || rproc->state == RPROC_RUNNING) {
>> +			report->crash_reason_str[CRASH_REASON_STR_LEN - 1] = '\0';
>> +			dev_warn(&rproc->dev, "crash reason id: %d %s\n",
>> +				 report->crash_reason, report->crash_reason_str);
>> +			rproc_report_crash(rproc, RPROC_FATAL_ERROR);
>> +			report->crash_state = false;
>> +			return;
>> +		}
>> +	}
>>  
>>  	/*
>>  	 * We only use IPI for interrupt. The RPU firmware side may or may
>> @@ -448,6 +487,13 @@ static int zynqmp_r5_rproc_stop(struct rproc *rproc)
>>  	if (ret)
>>  		dev_err(r5_core->dev, "core force power down failed\n");
>>  
>> +	/*
>> +	 * Clear attach on recovery flag during stop operation. The next state
>> +	 * of the remote processor is expected to be "Running" state. In this
>> +	 * state boot recovery method must take place over attach on recovery.
>> +	 */
>> +	test_and_clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
> 
> Why is there a need to play set/clear the RPROC_FEAT_ATTACH_ON_RECOVERY flag
> here and in zynqmp_r5_attach() and zynqmp_r5_detach()?  I think we talked about
> that before but can't remember the outcome of that conversation.
> 

The remote processor can go through following life cycle:

attach() -> stop() -> start(). When this happens, ATTACH_ON_RECOVERY is
not valid anymore. At this point, it should be BOOT_RECOVERY which is
indicated by clearing ATTACH_RECOVERY flag. That is why it is important
to clear this bit on stop().

Now in detach() it is not needed. I am just clearing the flag as part of
a cleanup sequence.

 >> +
>>  	return ret;
>>  }
>>  
>> @@ -869,6 +915,9 @@ static int zynqmp_r5_get_rsc_table_va(struct zynqmp_r5_core *r5_core)
>>  
>>  static int zynqmp_r5_attach(struct rproc *rproc)
>>  {
>> +	/* Enable attach on recovery method. Clear it during rproc stop. */
>> +	rproc_set_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY);
>> +
>>  	dev_dbg(&rproc->dev, "rproc %d attached\n", rproc->index);
>>  
>>  	return 0;
>> @@ -883,9 +932,30 @@ static int zynqmp_r5_detach(struct rproc *rproc)
>>  	 */
>>  	zynqmp_r5_rproc_kick(rproc, 0);
>>  
>> +	clear_bit(RPROC_FEAT_ATTACH_ON_RECOVERY, rproc->features);
>> +
>>  	return 0;
>>  }
>>  
>> +static int zynqmp_r5_handle_rsc(struct rproc *rproc, u32 rsc_type, void *rsc,
>> +				int offset, int avail)
>> +{
>> +	struct zynqmp_r5_core *r5_core = rproc->priv;
>> +	void *rsc_offset = (r5_core->rsc_tbl_va + offset);
>> +
> 
>         if (rsc_type != XLNX_RPROC_FW_CRASH_REASON)
>                 return RSC_IGNORED;
> 

Ack.

>> +	if (rsc_type == XLNX_RPROC_FW_CRASH_REASON) {
>> +		r5_core->crash_report = rsc_offset;
>> +		/* reset all values */
>> +		r5_core->crash_report->crash_state = false;
>> +		r5_core->crash_report->crash_reason = 0;
>> +		r5_core->crash_report->crash_reason_str[0] = '\0';
>> +	} else {
>> +		return RSC_IGNORED;
>> +	}
>> +
>> +	return RSC_HANDLED;
>> +}
>> +
>>  static const struct rproc_ops zynqmp_r5_rproc_ops = {
>>  	.prepare	= zynqmp_r5_rproc_prepare,
>>  	.unprepare	= zynqmp_r5_rproc_unprepare,
>> @@ -900,6 +970,7 @@ static const struct rproc_ops zynqmp_r5_rproc_ops = {
>>  	.get_loaded_rsc_table = zynqmp_r5_get_loaded_rsc_table,
>>  	.attach		= zynqmp_r5_attach,
>>  	.detach		= zynqmp_r5_detach,
>> +	.handle_rsc	= zynqmp_r5_handle_rsc,
>>  };
>>  
>>  /**
>> @@ -939,7 +1010,7 @@ static struct zynqmp_r5_core *zynqmp_r5_alloc_rproc_core(struct device *cdev)
>>  
>>  	rproc_coredump_set_elf_info(r5_rproc, ELFCLASS32, EM_ARM);
>>  
>> -	r5_rproc->recovery_disabled = true;
>> +	r5_rproc->recovery_disabled = false;
> 
> I'm good with the overall architecture of this set.
> 

Okay. If above comments looks good, I will send v6 soon.

> Regards,
> Mathieu
> 
>>  	r5_rproc->has_iommu = false;
>>  	r5_rproc->auto_boot = false;
>>  
>> -- 
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-16 19:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10 18:27 [PATCH v5 0/2] remoteproc: xlnx: remote crash recovery Tanmay Shah
2026-06-10 18:27 ` [PATCH v5 1/2] remoteproc: core: full attach detach during recovery Tanmay Shah
2026-06-16 15:29   ` Mathieu Poirier
2026-06-10 18:27 ` [PATCH v5 2/2] remoteproc: xlnx: add crash detection mechanism Tanmay Shah
2026-06-16 16:01   ` Mathieu Poirier
2026-06-16 19:52     ` Shah, Tanmay

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.