[RFC PATCH 0/5] riscv: Handle synchronous hardware error exception

linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception
@ 2025-09-10  9:33 Ruidong Tian
  2025-09-10  9:33 ` [RFC PATCH 1/5] riscv: Define ioremap_cache for RISC-V Ruidong Tian
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Ruidong Tian @ 2025-09-10  9:33 UTC (permalink / raw)
  To: xueshuai, palmer, paul.walmsley, linux-riscv, linux-kernel,
	linux-acpi
  Cc: james.morse, tony.luck, cleger, hchauhan, tianruidong

Hi all,
This patch series introduces support for handling synchronous hardware errors 
on RISC-V, laying the groundwork for more robust kernel-mode error recovery.

1. Background
Hardware error reporting mechanisms typically fall into two categories: 
asynchronous and synchronous.

- Asynchronous errors (e.g., memory scrubbing errors) repoted by a asynchronous
exceptions or a interrupt, are usually handled by GHES subsystems. For instance,
ARM uses SDEI, and a similar SSE specification is being proposed for RISC-V.
- Synchronous errors (e.g., reading poisoned data) cause the processor core to 
take a precise exception. This is known as a Synchronous External Abort (SEA)
on ARM, a Machine Check Exception (MCE) on x86, and is designated as trap with
mcause 19 on RISC-V.

Discussions within the RVI PRS TG have already led to proposals[0] to UEFI for 
standardizing two notification methods, SSE and Hardware Error Exception, 
on RISC-V. 
This series focuses on implementing Hardware Error Exception notification to
handle synchronous errors. Himanshu Chauhan has already started working on SSE[1].

2. Motivation
While a synchronous hardware errors occurring in kernel context (e.g., during 
get_user, put_user, CoW, etc.). The kernel requires a fixup mechanism (via
extable) to recover from such errors and prevent a system panic. However, the 
APEI/GHES subsystem, being asynchronous, cannot directly leverage the synchronous
extable fixup path.

By handling the synchronous exception directly, we enable the use of this fixup
mechanism, allowing the kernel to gracefully recover from hardware errors
encountered during kernel execution. This brings RISC-V's error handling
capabilities closer to the robustness found on ARM[2] and x86[3].

3. What This Patch Series Does
This initial series lays the foundational infrastructure. It primarily:
- Introduces a new exception handler for synchronous hardware errors (mcause=19).
- Establishes the core exception path, which is a prerequisite for kernel
  context error recovery.

Please note that this version does not yet implement the full kernel fixup logic
for recovery. That functionality is planned for the next formal version.

Some adaptations for GHES are included, based on the work from Himanshu Chauhan[1]

4. Future Plans
- Implement full kernel fixup support to handle and recover from errors in 
  some kernel context[2].
- Add support for handling "double trap" scenarios.

5. Testing Methodology

test program: ras-tools: https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools/
qemu: https://github.com/winterddd/qemu
offcial opensbi and edk2:

- Run qemu:
qemu-system-riscv64 -M virt,pflash0=pflash0,pflash1=pflash1,acpi=on,aia=aplic-imsic 
 -cpu max -m 64G -smp 64 -device virtio-gpu-pci -full-screen -device qemu-xhci 
 -device usb-kbd -device virtio-rng-pci 
 -blockdev node-name=pflash0,driver=file,read-only=on,filename=RISCV_VIRT_CODE.fd 
 -blockdev node-name=pflash1,driver=file,filename=RISCV_VIRT_VARS.fd 
 -bios fw_dynamic.bin -device virtio-net-device,netdev=net0 
 -netdev user,id=net0,hostfwd=tcp::2223-:22 
 -kernel Image -initrd rootfs
 -append "rdinit=/sbin/init earlycon verbose debug strict_devmem=0 nokaslr" 
 -monitor telnet:127.0.0.1:5557,server,nowait -nographic

- Run ras-tools:
./einj_mem_uc -j -k single &
$ 0: single   vaddr = 0x7fff86ff4400 paddr = 107d11b400

- Inject poison
telnet localhost 5557
poison_enable on
poison_add 0x107d11b400

- Read poison
echo trigger > ./trigger_start
$ triggering ...
$ signal 7 code 3 addr 0x7fff86ff4400

[0]: https://lists.riscv.org/g/tech-prs/topic/risc_v_ras_related_ecrs/113685653 
[1]: https://patchew.org/linux/20250227123628.2931490-1-hchauhan@ventanamicro.com/
[2]: https://lore.kernel.org/lkml/20241209024257.3618492-1-tongtiangen@huawei.com/
[3]: https://github.com/torvalds/linux/blob/9dd1835ecda5b96ac88c166f4a87386f3e727bd9/arch/x86/kernel/cpu/mce/core.c#L1514

Himanshu Chauhan (2):
  riscv: Define ioremap_cache for RISC-V
  riscv: Define arch_apei_get_mem_attribute for RISC-V

Ruidong Tian (3):
  acpi: Introduce SSE and HEE in HEST notification types
  riscv: Introduce HEST HEE notification handlers for APEI
  riscv: Add Hardware Error Exception trap handler

 arch/riscv/Kconfig              |  1 +
 arch/riscv/include/asm/acpi.h   | 22 +++++++++++++
 arch/riscv/include/asm/fixmap.h |  6 ++++
 arch/riscv/include/asm/io.h     |  3 ++
 arch/riscv/kernel/acpi.c        | 55 +++++++++++++++++++++++++++++++
 arch/riscv/kernel/entry.S       |  4 +++
 arch/riscv/kernel/traps.c       | 19 +++++++++++
 drivers/acpi/apei/Kconfig       | 12 +++++++
 drivers/acpi/apei/ghes.c        | 58 +++++++++++++++++++++++++++++++++
 include/acpi/actbl1.h           |  4 ++-
 include/acpi/ghes.h             |  6 ++++
 11 files changed, 189 insertions(+), 1 deletion(-)

-- 
2.43.7


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH 1/5] riscv: Define ioremap_cache for RISC-V
  2025-09-10  9:33 [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception Ruidong Tian
@ 2025-09-10  9:33 ` Ruidong Tian
  2025-09-17 17:23   ` Paul Walmsley
  2025-09-10  9:33 ` [RFC PATCH 2/5] riscv: Define arch_apei_get_mem_attribute " Ruidong Tian
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 8+ messages in thread
From: Ruidong Tian @ 2025-09-10  9:33 UTC (permalink / raw)
  To: xueshuai, palmer, paul.walmsley, linux-riscv, linux-kernel,
	linux-acpi
  Cc: james.morse, tony.luck, cleger, hchauhan, tianruidong

From: Himanshu Chauhan <hchauhan@ventanamicro.com>

bert and einj drivers use ioremap_cache for mapping entries
but ioremap_cache is not defined for RISC-V.

Signed-off-by: Himanshu Chauhan <hchauhan@ventanamicro.com>
---
 arch/riscv/include/asm/io.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
index a0e51840b9db..56eca6b3031f 100644
--- a/arch/riscv/include/asm/io.h
+++ b/arch/riscv/include/asm/io.h
@@ -30,6 +30,9 @@
 #define PCI_IOBASE		((void __iomem *)PCI_IO_START)
 #endif /* CONFIG_MMU */
 
+#define ioremap_cache(addr, size)					\
+	((__force void *)ioremap_prot((addr), (size), __pgprot(_PAGE_KERNEL)))
+
 /*
  * Emulation routines for the port-mapped IO space used by some PCI drivers.
  * These are defined as being "fully synchronous", but also "not guaranteed to
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 2/5] riscv: Define arch_apei_get_mem_attribute for RISC-V
  2025-09-10  9:33 [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception Ruidong Tian
  2025-09-10  9:33 ` [RFC PATCH 1/5] riscv: Define ioremap_cache for RISC-V Ruidong Tian
@ 2025-09-10  9:33 ` Ruidong Tian
  2025-09-10  9:33 ` [RFC PATCH 3/5] acpi: Introduce SSE and HEE in HEST notification types Ruidong Tian
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Ruidong Tian @ 2025-09-10  9:33 UTC (permalink / raw)
  To: xueshuai, palmer, paul.walmsley, linux-riscv, linux-kernel,
	linux-acpi
  Cc: james.morse, tony.luck, cleger, hchauhan, tianruidong

From: Himanshu Chauhan <hchauhan@ventanamicro.com>

ghes_map function uses arch_apei_get_mem_attribute to get the
protection bits for a given physical address. These protection
bits are then used to map the physical address.

Signed-off-by: Himanshu Chauhan <hchauhan@ventanamicro.com>
---
 arch/riscv/include/asm/acpi.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/riscv/include/asm/acpi.h b/arch/riscv/include/asm/acpi.h
index 6e13695120bc..0c599452ef48 100644
--- a/arch/riscv/include/asm/acpi.h
+++ b/arch/riscv/include/asm/acpi.h
@@ -27,6 +27,26 @@ extern int acpi_disabled;
 extern int acpi_noirq;
 extern int acpi_pci_disabled;
 
+#ifdef	CONFIG_ACPI_APEI
+/*
+ * acpi_disable_cmcff is used in drivers/acpi/apei/hest.c for disabling
+ * IA-32 Architecture Corrected Machine Check (CMC) Firmware-First mode
+ * with a kernel command line parameter "acpi=nocmcoff". But we don't
+ * have this IA-32 specific feature on ARM64, this definition is only
+ * for compatibility.
+ */
+#define acpi_disable_cmcff 1
+static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
+{
+	/*
+	 * Until we have a way to look for EFI memory attributes.
+	 */
+	return PAGE_KERNEL;
+}
+#else /* CONFIG_ACPI_APEI */
+#define acpi_disable_cmcff 0
+#endif /* !CONFIG_ACPI_APEI */
+
 static inline void disable_acpi(void)
 {
 	acpi_disabled = 1;
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 3/5] acpi: Introduce SSE and HEE in HEST notification types
  2025-09-10  9:33 [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception Ruidong Tian
  2025-09-10  9:33 ` [RFC PATCH 1/5] riscv: Define ioremap_cache for RISC-V Ruidong Tian
  2025-09-10  9:33 ` [RFC PATCH 2/5] riscv: Define arch_apei_get_mem_attribute " Ruidong Tian
@ 2025-09-10  9:33 ` Ruidong Tian
  2025-09-10  9:33 ` [RFC PATCH 4/5] riscv: Introduce HEST HEE notification handlers for APEI Ruidong Tian
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Ruidong Tian @ 2025-09-10  9:33 UTC (permalink / raw)
  To: xueshuai, palmer, paul.walmsley, linux-riscv, linux-kernel,
	linux-acpi
  Cc: james.morse, tony.luck, cleger, hchauhan, tianruidong

Introduce atwo new HEST notification type for RISC-V Hardware
Error Exception and SSE. The GHES entry's notification structure
contains the notification to be used for a given error source.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
 include/acpi/actbl1.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h
index 99fd1588ff38..0f04ef10f510 100644
--- a/include/acpi/actbl1.h
+++ b/include/acpi/actbl1.h
@@ -1534,7 +1534,9 @@ enum acpi_hest_notify_types {
 	ACPI_HEST_NOTIFY_SEI = 9,	/* ACPI 6.1 */
 	ACPI_HEST_NOTIFY_GSIV = 10,	/* ACPI 6.1 */
 	ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED = 11,	/* ACPI 6.2 */
-	ACPI_HEST_NOTIFY_RESERVED = 12	/* 12 and greater are reserved */
+	ACPI_HEST_NOTIFY_SSE = 12, /* RISCV SSE */
+	ACPI_HEST_NOTIFY_HEE = 13, /* RISCV Hardware Error Exception */
+	ACPI_HEST_NOTIFY_RESERVED = 14	/* 14 and greater are reserved */
 };
 
 /* Values for config_write_enable bitfield above */
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 4/5] riscv: Introduce HEST HEE notification handlers for APEI
  2025-09-10  9:33 [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception Ruidong Tian
                   ` (2 preceding siblings ...)
  2025-09-10  9:33 ` [RFC PATCH 3/5] acpi: Introduce SSE and HEE in HEST notification types Ruidong Tian
@ 2025-09-10  9:33 ` Ruidong Tian
  2025-09-10  9:33 ` [RFC PATCH 5/5] riscv: Add Hardware Error Exception trap handler Ruidong Tian
  2025-09-10 17:20 ` [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception Anup Patel
  5 siblings, 0 replies; 8+ messages in thread
From: Ruidong Tian @ 2025-09-10  9:33 UTC (permalink / raw)
  To: xueshuai, palmer, paul.walmsley, linux-riscv, linux-kernel,
	linux-acpi
  Cc: james.morse, tony.luck, cleger, hchauhan, tianruidong

Add functions to register a ghes entry with HEE, allowing the OS
to receive hardware error notifications from firmware through
standardized ACPI interfaces.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
 arch/riscv/Kconfig              |  1 +
 arch/riscv/include/asm/fixmap.h |  6 ++++
 drivers/acpi/apei/Kconfig       | 12 +++++++
 drivers/acpi/apei/ghes.c        | 58 +++++++++++++++++++++++++++++++++
 include/acpi/ghes.h             |  6 ++++
 5 files changed, 83 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index a4b233a0659e..b085e172b355 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -23,6 +23,7 @@ config RISCV
 	select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
 	select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
 	select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+	select HAVE_ACPI_APEI if (ACPI && EFI)
 	select ARCH_HAS_BINFMT_FLAT
 	select ARCH_HAS_CURRENT_STACK_POINTER
 	select ARCH_HAS_DEBUG_VIRTUAL if MMU
diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
index 0a55099bb734..07421edc9daa 100644
--- a/arch/riscv/include/asm/fixmap.h
+++ b/arch/riscv/include/asm/fixmap.h
@@ -38,6 +38,12 @@ enum fixed_addresses {
 	FIX_TEXT_POKE0,
 	FIX_EARLYCON_MEM_BASE,
 
+#ifdef CONFIG_ACPI_APEI_HEE
+	/* Used for GHES mapping from assorted contexts */
+	FIX_APEI_GHES_IRQ,
+	FIX_APEI_GHES_HEE,
+#endif /* CONFIG_ACPI_APEI_GHES */
+
 	__end_of_permanent_fixed_addresses,
 	/*
 	 * Temporary boot-time mappings, used by early_ioremap(),
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index 070c07d68dfb..d54a295cfc8d 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -46,6 +46,18 @@ config ACPI_APEI_SEA
 	depends on ARM64 && ACPI_APEI_GHES
 	default y
 
+config ACPI_APEI_HEE
+	bool "APEI Hardware Error Exception support"
+	depends on RISCV && ACPI_APEI_GHES
+	default y
+	help
+	  Enable support for RISC-V Hardware Error Exception (HEE) notification
+	  in ACPI Platform Error Interface (APEI). This allows firmware
+	  to report hardware errors through RISC-V exception mechanism.
+
+	  Say Y if you want to support firmware-first error handling
+	  on RISC-V platforms with ACPI.
+
 config ACPI_APEI_MEMORY_FAILURE
 	bool "APEI memory error recovering support"
 	depends on ACPI_APEI && MEMORY_FAILURE
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a0d54993edb3..1011e28091dc 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -97,6 +97,11 @@
 #define FIX_APEI_GHES_SDEI_CRITICAL	__end_of_fixed_addresses
 #endif
 
+#if !defined(CONFIG_X86) && !defined(CONFIG_ARM64)
+#define FIX_APEI_GHES_NMI		__end_of_fixed_addresses
+#define FIX_APEI_GHES_SEA		__end_of_fixed_addresses
+#endif
+
 static ATOMIC_NOTIFIER_HEAD(ghes_report_chain);
 
 static inline bool is_hest_type_generic_v2(struct ghes *ghes)
@@ -1415,6 +1420,45 @@ static inline void ghes_sea_add(struct ghes *ghes) { }
 static inline void ghes_sea_remove(struct ghes *ghes) { }
 #endif /* CONFIG_ACPI_APEI_SEA */
 
+#ifdef CONFIG_ACPI_APEI_HEE
+static LIST_HEAD(ghes_hee);
+
+/*
+ * Return 0 only if one of the HEE error sources successfully reported an error
+ * record sent from the firmware.
+ */
+int ghes_notify_hee(void)
+{
+	static DEFINE_RAW_SPINLOCK(ghes_notify_lock_hee);
+	int rv;
+
+	raw_spin_lock(&ghes_notify_lock_hee);
+	rv = ghes_in_nmi_spool_from_list(&ghes_hee, FIX_APEI_GHES_HEE);
+	raw_spin_unlock(&ghes_notify_lock_hee);
+
+	return rv;
+}
+EXPORT_SYMBOL_GPL(ghes_notify_hee);
+
+static void ghes_hee_add(struct ghes *ghes)
+{
+	mutex_lock(&ghes_list_mutex);
+	list_add_rcu(&ghes->list, &ghes_hee);
+	mutex_unlock(&ghes_list_mutex);
+}
+
+static void ghes_hee_remove(struct ghes *ghes)
+{
+	mutex_lock(&ghes_list_mutex);
+	list_del_rcu(&ghes->list);
+	mutex_unlock(&ghes_list_mutex);
+	synchronize_rcu();
+}
+#else /* CONFIG_ACPI_APEI_HEE */
+static inline void ghes_hee_add(struct ghes *ghes) { }
+static inline void ghes_hee_remove(struct ghes *ghes) { }
+#endif /* CONFIG_ACPI_APEI_HEE */
+
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 /*
  * NMI may be triggered on any CPU, so ghes_in_nmi is used for
@@ -1558,6 +1602,14 @@ static int ghes_probe(struct platform_device *ghes_dev)
 			goto err;
 		}
 		break;
+	case ACPI_HEST_NOTIFY_HEE:
+		if (!IS_ENABLED(CONFIG_ACPI_APEI_HEE)) {
+			pr_warn(GHES_PFX "Generic hardware error source: %d notified via HEE is not supported\n",
+				generic->header.source_id);
+			rc = -ENOTSUPP;
+			goto err;
+		}
+		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_NMI)) {
 			pr_warn(GHES_PFX "Generic hardware error source: %d notified via NMI interrupt is not supported!\n",
@@ -1631,6 +1683,9 @@ static int ghes_probe(struct platform_device *ghes_dev)
 	case ACPI_HEST_NOTIFY_SEA:
 		ghes_sea_add(ghes);
 		break;
+	case ACPI_HEST_NOTIFY_HEE:
+		ghes_hee_add(ghes);
+		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		ghes_nmi_add(ghes);
 		break;
@@ -1698,6 +1753,9 @@ static void ghes_remove(struct platform_device *ghes_dev)
 	case ACPI_HEST_NOTIFY_SEA:
 		ghes_sea_remove(ghes);
 		break;
+	case ACPI_HEST_NOTIFY_HEE:
+		ghes_hee_remove(ghes);
+		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		ghes_nmi_remove(ghes);
 		break;
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index ebd21b05fe6e..8046e1b30c21 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -127,6 +127,12 @@ int ghes_notify_sea(void);
 static inline int ghes_notify_sea(void) { return -ENOENT; }
 #endif
 
+#ifdef CONFIG_ACPI_APEI_HEE
+int ghes_notify_hee(void);
+#else
+static inline int ghes_notify_hee(void) { return -ENOENT; }
+#endif
+
 struct notifier_block;
 extern void ghes_register_report_chain(struct notifier_block *nb);
 extern void ghes_unregister_report_chain(struct notifier_block *nb);
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 5/5] riscv: Add Hardware Error Exception trap handler
  2025-09-10  9:33 [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception Ruidong Tian
                   ` (3 preceding siblings ...)
  2025-09-10  9:33 ` [RFC PATCH 4/5] riscv: Introduce HEST HEE notification handlers for APEI Ruidong Tian
@ 2025-09-10  9:33 ` Ruidong Tian
  2025-09-10 17:20 ` [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception Anup Patel
  5 siblings, 0 replies; 8+ messages in thread
From: Ruidong Tian @ 2025-09-10  9:33 UTC (permalink / raw)
  To: xueshuai, palmer, paul.walmsley, linux-riscv, linux-kernel,
	linux-acpi
  Cc: james.morse, tony.luck, cleger, hchauhan, tianruidong

Implement the Hardware Error Exception trap handler for RISC-V architecture
synchronous hardware error handling. This enables the OS to receive
hardware error notifications from firmware through the standardized ACPI
HEST (Hardware Error Source Table) interface.

The implementation includes:
- A new exception vector entry for Hardware Error Exceptio
- A trap handler (do_trap_hardware_error) that processes hardware errors
  in both kernel(panic now) and user modes(SIGBUS)
- Integration with APEI GHES (Generic Hardware Error Source) to report
  hardware errors from firmware

This change enables RISC-V systems with ACPI to handle synchronous
hardware errors in a firmware-first manner.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
 arch/riscv/include/asm/acpi.h |  2 ++
 arch/riscv/kernel/acpi.c      | 55 +++++++++++++++++++++++++++++++++++
 arch/riscv/kernel/entry.S     |  4 +++
 arch/riscv/kernel/traps.c     | 19 ++++++++++++
 4 files changed, 80 insertions(+)

diff --git a/arch/riscv/include/asm/acpi.h b/arch/riscv/include/asm/acpi.h
index 0c599452ef48..ae861885b97d 100644
--- a/arch/riscv/include/asm/acpi.h
+++ b/arch/riscv/include/asm/acpi.h
@@ -91,6 +91,7 @@ int acpi_get_riscv_isa(struct acpi_table_header *table,
 
 void acpi_get_cbo_block_size(struct acpi_table_header *table, u32 *cbom_size,
 			     u32 *cboz_size, u32 *cbop_size);
+int apei_claim_hee(struct pt_regs *regs);
 #else
 static inline void acpi_init_rintc_map(void) { }
 static inline struct acpi_madt_rintc *acpi_cpu_get_madt_rintc(int cpu)
@@ -108,6 +109,7 @@ static inline void acpi_get_cbo_block_size(struct acpi_table_header *table,
 					   u32 *cbom_size, u32 *cboz_size,
 					   u32 *cbop_size) { }
 
+static inline int apei_claim_hee(struct pt_regs *regs) { return -ENOENT; }
 #endif /* CONFIG_ACPI */
 
 #ifdef CONFIG_ACPI_NUMA
diff --git a/arch/riscv/kernel/acpi.c b/arch/riscv/kernel/acpi.c
index 3f6d5a6789e8..928f9474bfee 100644
--- a/arch/riscv/kernel/acpi.c
+++ b/arch/riscv/kernel/acpi.c
@@ -20,6 +20,11 @@
 #include <linux/of_fdt.h>
 #include <linux/pci.h>
 #include <linux/serial_core.h>
+#include <linux/efi.h>
+#include <linux/irq_work.h>
+#include <linux/nmi.h>
+#include <acpi/ghes.h>
+#include <asm/csr.h>
 
 int acpi_noirq = 1;		/* skip ACPI IRQ initialization */
 int acpi_disabled = 1;
@@ -334,3 +339,53 @@ int raw_pci_write(unsigned int domain, unsigned int bus,
 }
 
 #endif	/* CONFIG_PCI */
+
+/*
+ * Claim Hardware Error Exception as a firmware first notification.
+ *
+ * Used by RISC-V exception handler for hardware error processing.
+ * @regs may be NULL when called from process context.
+ */
+int apei_claim_hee(struct pt_regs *regs)
+{
+	int err = -ENOENT;
+	bool return_to_irqs_enabled;
+	unsigned long flags;
+
+	if (!IS_ENABLED(CONFIG_ACPI_APEI_GHES))
+		return err;
+
+	/* Save current interrupt state */
+	local_irq_save(flags);
+	return_to_irqs_enabled = !irqs_disabled();
+
+	if (regs)
+		return_to_irqs_enabled = (regs->status & SR_SIE) != 0;
+
+	/*
+	 * HEE can interrupt other operations, handle as NMI-like context
+	 * to ensure proper APEI processing
+	 */
+	nmi_enter();
+	err = ghes_notify_hee();
+	nmi_exit();
+
+	/*
+	 * APEI NMI-like notifications are deferred to irq_work. Unless
+	 * we interrupted irqs-masked code, we can do that now.
+	 */
+	if (!err) {
+		if (return_to_irqs_enabled) {
+			local_irq_restore(flags);
+			irq_work_run();
+		} else {
+			pr_warn_ratelimited("APEI work queued but not completed");
+			err = -EINPROGRESS;
+		}
+	} else {
+		local_irq_restore(flags);
+	}
+
+	return err;
+}
+EXPORT_SYMBOL(apei_claim_hee);
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 3a0ec6fd5956..1cbefe934d84 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -459,6 +459,10 @@ SYM_DATA_START_LOCAL(excp_vect_table)
 	RISCV_PTR do_page_fault   /* load page fault */
 	RISCV_PTR do_trap_unknown
 	RISCV_PTR do_page_fault   /* store page fault */
+	RISCV_PTR do_trap_unknown
+	RISCV_PTR do_trap_unknown
+	RISCV_PTR do_trap_unknown
+	RISCV_PTR do_trap_hardware_error /* Hardware Error */
 SYM_DATA_END_LABEL(excp_vect_table, SYM_L_LOCAL, excp_vect_table_end)
 
 #ifndef CONFIG_MMU
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 80230de167de..48f1ea1e03e6 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -22,6 +22,7 @@
 #include <linux/irq.h>
 #include <linux/kexec.h>
 #include <linux/entry-common.h>
+#include <linux/acpi.h>
 
 #include <asm/asm-prototypes.h>
 #include <asm/bug.h>
@@ -442,3 +443,21 @@ asmlinkage void handle_bad_stack(struct pt_regs *regs)
 		wait_for_interrupt();
 }
 #endif
+
+asmlinkage __visible __trap_section void do_trap_hardware_error(struct pt_regs *regs)
+{
+	if (user_mode(regs)) {
+		irqentry_enter_from_user_mode(regs);
+
+		if (apei_claim_hee(regs))
+			do_trap_error(regs, SIGBUS, BUS_OBJERR, regs->badaddr, "Hardware Error");
+
+		irqentry_exit_to_user_mode(regs);
+	} else {
+		irqentry_state_t state = irqentry_nmi_enter(regs);
+
+		die(regs, "Hardware Error");
+
+		irqentry_nmi_exit(regs, state);
+	}
+}
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception
  2025-09-10  9:33 [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception Ruidong Tian
                   ` (4 preceding siblings ...)
  2025-09-10  9:33 ` [RFC PATCH 5/5] riscv: Add Hardware Error Exception trap handler Ruidong Tian
@ 2025-09-10 17:20 ` Anup Patel
  5 siblings, 0 replies; 8+ messages in thread
From: Anup Patel @ 2025-09-10 17:20 UTC (permalink / raw)
  To: Ruidong Tian
  Cc: xueshuai, palmer, paul.walmsley, linux-riscv, linux-kernel,
	linux-acpi, james.morse, tony.luck, cleger, hchauhan

On Wed, Sep 10, 2025 at 3:04 PM Ruidong Tian
<tianruidong@linux.alibaba.com> wrote:
>
> Hi all,
> This patch series introduces support for handling synchronous hardware errors
> on RISC-V, laying the groundwork for more robust kernel-mode error recovery.
>
> 1. Background
> Hardware error reporting mechanisms typically fall into two categories:
> asynchronous and synchronous.
>
> - Asynchronous errors (e.g., memory scrubbing errors) repoted by a asynchronous
> exceptions or a interrupt, are usually handled by GHES subsystems. For instance,
> ARM uses SDEI, and a similar SSE specification is being proposed for RISC-V.
> - Synchronous errors (e.g., reading poisoned data) cause the processor core to
> take a precise exception. This is known as a Synchronous External Abort (SEA)
> on ARM, a Machine Check Exception (MCE) on x86, and is designated as trap with
> mcause 19 on RISC-V.
>
> Discussions within the RVI PRS TG have already led to proposals[0] to UEFI for
> standardizing two notification methods, SSE and Hardware Error Exception,
> on RISC-V.
> This series focuses on implementing Hardware Error Exception notification to
> handle synchronous errors. Himanshu Chauhan has already started working on SSE[1].
>
> 2. Motivation
> While a synchronous hardware errors occurring in kernel context (e.g., during
> get_user, put_user, CoW, etc.). The kernel requires a fixup mechanism (via
> extable) to recover from such errors and prevent a system panic. However, the
> APEI/GHES subsystem, being asynchronous, cannot directly leverage the synchronous
> extable fixup path.
>
> By handling the synchronous exception directly, we enable the use of this fixup
> mechanism, allowing the kernel to gracefully recover from hardware errors
> encountered during kernel execution. This brings RISC-V's error handling
> capabilities closer to the robustness found on ARM[2] and x86[3].
>
> 3. What This Patch Series Does
> This initial series lays the foundational infrastructure. It primarily:
> - Introduces a new exception handler for synchronous hardware errors (mcause=19).
> - Establishes the core exception path, which is a prerequisite for kernel
>   context error recovery.
>
> Please note that this version does not yet implement the full kernel fixup logic
> for recovery. That functionality is planned for the next formal version.
>
> Some adaptations for GHES are included, based on the work from Himanshu Chauhan[1]
>
> 4. Future Plans
> - Implement full kernel fixup support to handle and recover from errors in
>   some kernel context[2].
> - Add support for handling "double trap" scenarios.
>
> 5. Testing Methodology
>
> test program: ras-tools: https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools/
> qemu: https://github.com/winterddd/qemu
> offcial opensbi and edk2:
>
> - Run qemu:
> qemu-system-riscv64 -M virt,pflash0=pflash0,pflash1=pflash1,acpi=on,aia=aplic-imsic
>  -cpu max -m 64G -smp 64 -device virtio-gpu-pci -full-screen -device qemu-xhci
>  -device usb-kbd -device virtio-rng-pci
>  -blockdev node-name=pflash0,driver=file,read-only=on,filename=RISCV_VIRT_CODE.fd
>  -blockdev node-name=pflash1,driver=file,filename=RISCV_VIRT_VARS.fd
>  -bios fw_dynamic.bin -device virtio-net-device,netdev=net0
>  -netdev user,id=net0,hostfwd=tcp::2223-:22
>  -kernel Image -initrd rootfs
>  -append "rdinit=/sbin/init earlycon verbose debug strict_devmem=0 nokaslr"
>  -monitor telnet:127.0.0.1:5557,server,nowait -nographic
>
> - Run ras-tools:
> ./einj_mem_uc -j -k single &
> $ 0: single   vaddr = 0x7fff86ff4400 paddr = 107d11b400
>
> - Inject poison
> telnet localhost 5557
> poison_enable on
> poison_add 0x107d11b400
>
> - Read poison
> echo trigger > ./trigger_start
> $ triggering ...
> $ signal 7 code 3 addr 0x7fff86ff4400
>
> [0]: https://lists.riscv.org/g/tech-prs/topic/risc_v_ras_related_ecrs/113685653
> [1]: https://patchew.org/linux/20250227123628.2931490-1-hchauhan@ventanamicro.com/
> [2]: https://lore.kernel.org/lkml/20241209024257.3618492-1-tongtiangen@huawei.com/
> [3]: https://github.com/torvalds/linux/blob/9dd1835ecda5b96ac88c166f4a87386f3e727bd9/arch/x86/kernel/cpu/mce/core.c#L1514
>
> Himanshu Chauhan (2):
>   riscv: Define ioremap_cache for RISC-V
>   riscv: Define arch_apei_get_mem_attribute for RISC-V
>
> Ruidong Tian (3):
>   acpi: Introduce SSE and HEE in HEST notification types
>   riscv: Introduce HEST HEE notification handlers for APEI
>   riscv: Add Hardware Error Exception trap handler
>

Himanshu had already sent-out RFC v1 way back in Feb 2025 [1] which
did not receive any comments or feedback.

Instead of sending out a half-baked series, it will be helpful if you
can review Himanshu's series.

Regards,
Anup

[1] https://patchew.org/linux/20250227123628.2931490-1-hchauhan@ventanamicro.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/5] riscv: Define ioremap_cache for RISC-V
  2025-09-10  9:33 ` [RFC PATCH 1/5] riscv: Define ioremap_cache for RISC-V Ruidong Tian
@ 2025-09-17 17:23   ` Paul Walmsley
  0 siblings, 0 replies; 8+ messages in thread
From: Paul Walmsley @ 2025-09-17 17:23 UTC (permalink / raw)
  To: Ruidong Tian, hchauhan
  Cc: xueshuai, palmer, paul.walmsley, linux-riscv, linux-kernel,
	linux-acpi, james.morse, tony.luck, cleger

On Wed, 10 Sep 2025, Ruidong Tian wrote:

> From: Himanshu Chauhan <hchauhan@ventanamicro.com>
> 
> bert and einj drivers use ioremap_cache for mapping entries
> but ioremap_cache is not defined for RISC-V.
> 
> Signed-off-by: Himanshu Chauhan <hchauhan@ventanamicro.com>

Looks like nothing should be using ioremap_cache() at all:

  https://lore.kernel.org/linux-riscv/YzQ6pqykLhJVeD2p@infradead.org/#t

Probably best just to fix the ACPI drivers?


- Paul


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-09-17 17:23 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-10  9:33 [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception Ruidong Tian
2025-09-10  9:33 ` [RFC PATCH 1/5] riscv: Define ioremap_cache for RISC-V Ruidong Tian
2025-09-17 17:23   ` Paul Walmsley
2025-09-10  9:33 ` [RFC PATCH 2/5] riscv: Define arch_apei_get_mem_attribute " Ruidong Tian
2025-09-10  9:33 ` [RFC PATCH 3/5] acpi: Introduce SSE and HEE in HEST notification types Ruidong Tian
2025-09-10  9:33 ` [RFC PATCH 4/5] riscv: Introduce HEST HEE notification handlers for APEI Ruidong Tian
2025-09-10  9:33 ` [RFC PATCH 5/5] riscv: Add Hardware Error Exception trap handler Ruidong Tian
2025-09-10 17:20 ` [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception Anup Patel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).