[RESEND PATCH v18 0/2] ACPI: APEI: handle synchronous errors in task work

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RESEND PATCH v18 0/2] ACPI: APEI: handle synchronous errors in task work
@ 2025-04-04 11:20 Shuai Xue
  2025-04-04 11:20 ` [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered Shuai Xue
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Shuai Xue @ 2025-04-04 11:20 UTC (permalink / raw)
  To: catalin.marinas, sudeep.holla, guohanjun, lpieralisi, linux-acpi,
	yazen.ghannam, mark.rutland, mingo, robin.murphy,
	Jonathan.Cameron, bp, rafael, linux-arm-kernel, wangkefeng.wang,
	tanxiaofei, mawupeng1, tony.luck, linmiaohe, naoya.horiguchi,
	james.morse, tongtiangen, gregkh, will, jarkko
  Cc: linux-mm, linux-kernel, akpm, linux-edac, x86, xueshuai,
	justin.he, ardb, ying.huang, ashish.kalra, baolin.wang, tglx,
	dave.hansen, lenb, hpa, robert.moore, lvying6, xiexiuqi,
	zhuo.song

From Catalin:

> James Morse is listed as reviewer of the ACPI APEI code but he's busy
> with resctrl/MPAM. 

These two patches have undergone 18 iterations of review and have received
11 'Reviewed-by' tags in total, but they have not yet been merged into the
mainline. I am requesting further review and ack from the arm64
ACPI maintainers: Lorenzo, Sudeep, and Hanjun. Thank you for your attention
and assistance.

no code changes since last v18:
- drop a mm/hwpoison patch which is merged into mainline

changes singce v17:
- rebase to Linux 6.13-rc7 with no functional changes
- add reviewed-by tag for patch 1-3 from Jane Chu
- add reviewed-by tag for patch 3 from Yazen

changes singce v16:
- add reviewed-by tag for patch 1 and patch 2 from Yazen
- rewrite warning message for force kill (per Yazen)
- warn with dev_err in ghes (per Jarkko)
- add return value -ENXIO in memory_failure comments  (per Yazen)
- Link: https://lore.kernel.org/lkml/20241104015430.98599-1-xueshuai@linux.alibaba.com/

changes singce v15:
- add HW_ERR and GHES_PFX prefix per Yazen 

changes since v14:
- add reviewed-by tags from Jarkko and Jonathan
- remove local variable and use twcb->pfn

changes since v13:
- add reviewed-by tag from Jarkko
- rename task_work to ghes_task_work (per Jarkko)

changes since v12:
- tweak error message for force kill (per Jarkko)
- fix comments style (per Jarkko)
- fix commit log typo (per Jarko)

changes since v11:
- rebase to Linux 6.11-rc6
- fix grammer and typo in commit log (per Borislav)
- remove `sync_` perfix of `sync_task_work`  (per Borislav)
- comments flags and description of `task_work`  (per Borislav)

changes since v10:
- rebase to v6.8-rc2

changes since v9:
- split patch 2 to address exactly one issue in one patch (per Borislav)
- rewrite commit log according to template (per Borislav)
- pickup reviewed-by tag of patch 1 from James Morse
- alloc and free twcb through gen_pool_{alloc, free) (Per James)
- rewrite cover letter

changes since v8:
- remove the bug fix tag of patch 2 (per Jarkko Sakkinen)
- remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi)
- rewrite the return value comments of memory_failure (per Naoya Horiguchi)

changes since v7:
- rebase to Linux v6.6-rc2 (no code changed)
- rewritten the cover letter to explain the motivation of this patchset

changes since v6:
- add more explicty error message suggested by Xiaofei
- pick up reviewed-by tag from Xiaofei
- pick up internal reviewed-by tag from Baolin

changes since v5 by addressing comments from Kefeng:
- document return value of memory_failure()
- drop redundant comments in call site of memory_failure() 
- make ghes_do_proc void and handle abnormal case within it
- pick up reviewed-by tag from Kefeng Wang 

changes since v4 by addressing comments from Xiaofei:
- do a force kill only for abnormal sync errors

changes since v3 by addressing comments from Xiaofei:
- do a force kill for abnormal memory failure error such as invalid PA,
unexpected severity, OOM, etc
- pcik up tested-by tag from Ma Wupeng

changes since v2 by addressing comments from Naoya:
- rename mce_task_work to sync_task_work
- drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
- add steps to reproduce this problem in cover letter

changes since v1:
- synchronous events by notify type
- Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/

## Cover Letter

There are two major types of uncorrected recoverable (UCR) errors :

- Synchronous error: The error is detected and raised at the point of the
  consumption in the execution flow, e.g. when a CPU tries to access
  a poisoned cache line. The CPU will take a synchronous error exception
  such as Synchronous External Abort (SEA) on Arm64 and Machine Check
  Exception (MCE) on X86. OS requires to take action (for example, offline
  failure page/kill failure thread) to recover this uncorrectable error.

- Asynchronous error: The error is detected out of processor execution
  context, e.g. when an error is detected by a background scrubber. Some data
  in the memory are corrupted. But the data have not been consumed. OS is
  optional to take action to recover this uncorrectable error.

Currently, both synchronous and asynchronous error use
memory_failure_queue() to schedule memory_failure() exectute in kworker
context. As a result, when a user-space process is accessing a poisoned
data, a data abort is taken and the memory_failure() is executed in the
kworker context:

  - will send wrong si_code by SIGBUS signal in early_kill mode, and
  - can not kill the user-space in some cases resulting a synchronous
    error infinite loop

Issue 1: send wrong si_code in early_kill mode

Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as
MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED
could be used to determine whether a synchronous exception occurs on
ARM64 platform.  When a synchronous exception is detected, the kernel is
expected to terminate the current process which has accessed poisoned
page. This is done by sending a SIGBUS signal with an error code
BUS_MCEERR_AR, indicating an action-required machine check error on
read.

However, when kill_proc() is called to terminate the processes who have
the poisoned page mapped, it sends the incorrect SIGBUS error code
BUS_MCEERR_AO because the context in which it operates is not the one
where the error was triggered.

To reproduce this problem:

  # STEP1: enable early kill mode
  #sysctl -w vm.memory_failure_early_kill=1
  vm.memory_failure_early_kill = 1

  # STEP2: inject an UCE error and consume it to trigger a synchronous error
  #einj_mem_uc single
  0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
  injecting ...
  triggering ...
  signal 7 code 5 addr 0xffffb0d75000
  page not present
  Test passed

The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO
error and it is not fact.

To fix it, queue memory_failure() as a task_work so that it runs in
the context of the process that is actually consuming the poisoned data.

After this patch set:

  # STEP1: enable early kill mode
  #sysctl -w vm.memory_failure_early_kill=1
  vm.memory_failure_early_kill = 1

  # STEP2: inject an UCE error and consume it to trigger a synchronous error
  #einj_mem_uc single
  0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
  injecting ...
  triggering ...
  signal 7 code 4 addr 0xffffb0d75000
  page not present
  Test passed

The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR
error as we expected.

Issue 2: a synchronous error infinite loop due to memory_failure() failed

If a user-space process, e.g. devmem, a poisoned page which has been set
HWPosion flag, kill_accessing_process() is called to send SIGBUS to the
current processs with error info. Because the memory_failure() is
executed in the kworker contex, it will just do nothing but return
EFAULT. So, devmem will access the posioned page and trigger an
excepction again, resulting in a synchronous error infinite loop. Such
loop may cause platform firmware to exceed some threshold and reboot
when Linux could have recovered from this error.

To reproduce this problem:

  # STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page
  #einj_mem_uc single
  0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
  injecting ...
  triggering ...
  signal 7 code 4 addr 0xffffb0d75000
  page not present
  Test passed

  # STEP 2: access the same page and it will trigger a synchronous error infinite loop
  devmem 0x4092d55b400

To fix it, if memory_failure() failed, perform a force kill to current process.

Issue 3: a synchronous error infinite loop due to no memory_failure() queued

No memory_failure() work is queued unless all bellow preconditions check passed:

- `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure()
- `if (flags == -1)` in ghes_handle_memory_failure()
- `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure()
- `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure()

If the preconditions are not passed, the user-space process will trigger SEA again.
This loop can potentially exceed the platform firmware threshold or even
trigger a kernel hard lockup, leading to a system reboot.

To fix it, if no memory_failure() queued, perform a force kill to current process.

And the the memory errors triggered in kernel-mode[5], also relies on this
patchset to kill the failure thread.

Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4].
Acknowledge to discussion with them.

[1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/
[2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/
[3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com
[4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/
[5] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20240528085915.1955987-1-tongtiangen@huawei.com/

Shuai Xue (2):
  ACPI: APEI: send SIGBUS to current task if synchronous memory error
    not recovered
  ACPI: APEI: handle synchronous exceptions in task work

 drivers/acpi/apei/ghes.c | 88 +++++++++++++++++++++++++---------------
 include/acpi/ghes.h      |  3 --
 include/linux/mm.h       |  1 -
 mm/memory-failure.c      | 13 ------
 4 files changed, 55 insertions(+), 50 deletions(-)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-04-04 11:20 [RESEND PATCH v18 0/2] ACPI: APEI: handle synchronous errors in task work Shuai Xue
@ 2025-04-04 11:20 ` Shuai Xue
  2025-04-14 14:37   ` Hanjun Guo
  2025-04-04 11:20 ` [RESEND PATCH v18 2/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
  2025-04-08  2:34 ` [RESEND PATCH v18 0/2] ACPI: APEI: handle synchronous errors " Hanjun Guo
  2 siblings, 1 reply; 22+ messages in thread
From: Shuai Xue @ 2025-04-04 11:20 UTC (permalink / raw)
  To: catalin.marinas, sudeep.holla, guohanjun, lpieralisi, linux-acpi,
	yazen.ghannam, mark.rutland, mingo, robin.murphy,
	Jonathan.Cameron, bp, rafael, linux-arm-kernel, wangkefeng.wang,
	tanxiaofei, mawupeng1, tony.luck, linmiaohe, naoya.horiguchi,
	james.morse, tongtiangen, gregkh, will, jarkko
  Cc: linux-mm, linux-kernel, akpm, linux-edac, x86, xueshuai,
	justin.he, ardb, ying.huang, ashish.kalra, baolin.wang, tglx,
	dave.hansen, lenb, hpa, robert.moore, lvying6, xiexiuqi,
	zhuo.song

Synchronous error was detected as a result of user-space process accessing
a 2-bit uncorrected error. The CPU will take a synchronous error exception
such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
memory_failure() work which poisons the related page, unmaps the page, and
then sends a SIGBUS to the process, so that a system wide panic can be
avoided.

However, no memory_failure() work will be queued when abnormal synchronous
errors occur. These errors can include situations such as invalid PA,
unexpected severity, no memory failure config support, invalid GUID
section, etc. In such case, the user-space process will trigger SEA again.
This loop can potentially exceed the platform firmware threshold or even
trigger a kernel hard lockup, leading to a system reboot.

Fix it by performing a force kill if no memory_failure() work is queued
for synchronous errors.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/acpi/apei/ghes.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index b72772494655..50e4d924aa8b 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -799,6 +799,17 @@ static bool ghes_do_proc(struct ghes *ghes,
 		}
 	}

+	/*
+	 * If no memory failure work is queued for abnormal synchronous
+	 * errors, do a force kill.
+	 */
+	if (sync && !queued) {
+		dev_err(ghes->dev,
+			HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n",
+			current->comm, task_pid_nr(current));
+		force_sig(SIGBUS);
+	}
+
 	return queued;
 }

-- 
2.39.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RESEND PATCH v18 2/2] ACPI: APEI: handle synchronous exceptions in task work
  2025-04-04 11:20 [RESEND PATCH v18 0/2] ACPI: APEI: handle synchronous errors in task work Shuai Xue
  2025-04-04 11:20 ` [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered Shuai Xue
@ 2025-04-04 11:20 ` Shuai Xue
  2025-04-14 14:48   ` Hanjun Guo
  2025-04-08  2:34 ` [RESEND PATCH v18 0/2] ACPI: APEI: handle synchronous errors " Hanjun Guo
  2 siblings, 1 reply; 22+ messages in thread
From: Shuai Xue @ 2025-04-04 11:20 UTC (permalink / raw)
  To: catalin.marinas, sudeep.holla, guohanjun, lpieralisi, linux-acpi,
	yazen.ghannam, mark.rutland, mingo, robin.murphy,
	Jonathan.Cameron, bp, rafael, linux-arm-kernel, wangkefeng.wang,
	tanxiaofei, mawupeng1, tony.luck, linmiaohe, naoya.horiguchi,
	james.morse, tongtiangen, gregkh, will, jarkko
  Cc: linux-mm, linux-kernel, akpm, linux-edac, x86, xueshuai,
	justin.he, ardb, ying.huang, ashish.kalra, baolin.wang, tglx,
	dave.hansen, lenb, hpa, robert.moore, lvying6, xiexiuqi,
	zhuo.song

The memory uncorrected error could be signaled by asynchronous interrupt
(specifically, SPI in arm64 platform), e.g. when an error is detected by
a background scrubber, or signaled by synchronous exception
(specifically, data abort exception in arm64 platform), e.g. when a CPU
tries to access a poisoned cache line. Currently, both synchronous and
asynchronous error use memory_failure_queue() to schedule
memory_failure() to exectute in a kworker context.

As a result, when a user-space process is accessing a poisoned data, a
data abort is taken and the memory_failure() is executed in the kworker
context, memory_failure():

  - will send wrong si_code by SIGBUS signal in early_kill mode, and
  - can not kill the user-space in some cases resulting a synchronous
    error infinite loop

Issue 1: send wrong si_code in early_kill mode

Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as
MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED
could be used to determine whether a synchronous exception occurs on
ARM64 platform.  When a synchronous exception is detected, the kernel is
expected to terminate the current process which has accessed poisoned
page. This is done by sending a SIGBUS signal with an error code
BUS_MCEERR_AR, indicating an action-required machine check error on
read.

However, when kill_proc() is called to terminate the processes who have
the poisoned page mapped, it sends the incorrect SIGBUS error code
BUS_MCEERR_AO because the context in which it operates is not the one
where the error was triggered.

To reproduce this problem:

  #sysctl -w vm.memory_failure_early_kill=1
  vm.memory_failure_early_kill = 1

  # STEP2: inject an UCE error and consume it to trigger a synchronous error
  #einj_mem_uc single
  0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
  injecting ...
  triggering ...
  signal 7 code 5 addr 0xffffb0d75000
  page not present
  Test passed

The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO
error and it is not the fact.

After this patch:

  # STEP1: enable early kill mode
  #sysctl -w vm.memory_failure_early_kill=1
  vm.memory_failure_early_kill = 1
  # STEP2: inject an UCE error and consume it to trigger a synchronous error
  #einj_mem_uc single
  0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
  injecting ...
  triggering ...
  signal 7 code 4 addr 0xffffb0d75000
  page not present
  Test passed

The si_code (code 4) from einj_mem_uc indicates that it is a BUS_MCEERR_AR
error as we expected.

Issue 2: a synchronous error infinite loop

If a user-space process, e.g. devmem, accesses a poisoned page for which
the HWPoison flag is set, kill_accessing_process() is called to send
SIGBUS to current processs with error info. Because the memory_failure()
is executed in the kworker context, it will just do nothing but return
EFAULT. So, devmem will access the posioned page and trigger an
exception again, resulting in a synchronous error infinite loop. Such
exception loop may cause platform firmware to exceed some threshold and
reboot when Linux could have recovered from this error.

To reproduce this problem:

  # STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page
  #einj_mem_uc single
  0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
  injecting ...
  triggering ...
  signal 7 code 4 addr 0xffffb0d75000
  page not present
  Test passed

  # STEP 2: access the same page and it will trigger a synchronous error infinite loop
  devmem 0x4092d55b400

To fix above two issues, queue memory_failure() as a task_work so that
it runs in the context of the process that is actually consuming the
poisoned data.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Tested-by: Ma Wupeng <mawupeng1@huawei.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 drivers/acpi/apei/ghes.c | 79 +++++++++++++++++++++++-----------------
 include/acpi/ghes.h      |  3 --
 include/linux/mm.h       |  1 -
 mm/memory-failure.c      | 13 -------
 4 files changed, 45 insertions(+), 51 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 50e4d924aa8b..87cf4b373ebe 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -464,28 +464,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
 		ghes_ack_error(ghes->generic_v2);
 }
 
-/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+/**
+ * struct ghes_task_work - for synchronous RAS event
+ *
+ * @twork:                callback_head for task work
+ * @pfn:                  page frame number of corrupted page
+ * @flags:                work control flags
+ *
+ * Structure to pass task work to be handled before
+ * returning to user-space via task_work_add().
  */
-static void ghes_kick_task_work(struct callback_head *head)
+struct ghes_task_work {
+	struct callback_head twork;
+	u64 pfn;
+	int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
 {
-	struct acpi_hest_generic_status *estatus;
-	struct ghes_estatus_node *estatus_node;
-	u32 node_len;
+	struct ghes_task_work *twcb = container_of(twork, struct ghes_task_work, twork);
+	int ret;
 
-	estatus_node = container_of(head, struct ghes_estatus_node, task_work);
-	if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
-		memory_failure_queue_kick(estatus_node->task_work_cpu);
+	ret = memory_failure(twcb->pfn, twcb->flags);
+	gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));
 
-	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
-	node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
-	gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+	if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
+		return;
+
+	pr_err("%#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+			twcb->pfn, current->comm, task_pid_nr(current));
+	force_sig(SIGBUS);
 }
 
 static bool ghes_do_memory_failure(u64 physical_addr, int flags)
 {
+	struct ghes_task_work *twcb;
 	unsigned long pfn;
 
 	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
@@ -499,6 +512,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
 		return false;
 	}
 
+	if (flags == MF_ACTION_REQUIRED && current->mm) {
+		twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
+		if (!twcb)
+			return false;
+
+		twcb->pfn = pfn;
+		twcb->flags = flags;
+		init_task_work(&twcb->twork, memory_failure_cb);
+		task_work_add(current, &twcb->twork, TWA_RESUME);
+		return true;
+	}
+
 	memory_failure_queue(pfn, flags);
 	return true;
 }
@@ -743,7 +768,7 @@ int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");
 
-static bool ghes_do_proc(struct ghes *ghes,
+static void ghes_do_proc(struct ghes *ghes,
 			 const struct acpi_hest_generic_status *estatus)
 {
 	int sev, sec_sev;
@@ -809,8 +834,6 @@ static bool ghes_do_proc(struct ghes *ghes,
 			current->comm, task_pid_nr(current));
 		force_sig(SIGBUS);
 	}
-
-	return queued;
 }
 
 static void __ghes_print_estatus(const char *pfx,
@@ -1114,9 +1137,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 	struct ghes_estatus_node *estatus_node;
 	struct acpi_hest_generic *generic;
 	struct acpi_hest_generic_status *estatus;
-	bool task_work_pending;
 	u32 len, node_len;
-	int ret;
 
 	llnode = llist_del_all(&ghes_estatus_llist);
 	/*
@@ -1131,25 +1152,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 		len = cper_estatus_len(estatus);
 		node_len = GHES_ESTATUS_NODE_LEN(len);
-		task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+
+		ghes_do_proc(estatus_node->ghes, estatus);
+
 		if (!ghes_estatus_cached(estatus)) {
 			generic = estatus_node->generic;
 			if (ghes_print_estatus(NULL, generic, estatus))
 				ghes_estatus_cache_add(generic, estatus);
 		}
-
-		if (task_work_pending && current->mm) {
-			estatus_node->task_work.func = ghes_kick_task_work;
-			estatus_node->task_work_cpu = smp_processor_id();
-			ret = task_work_add(current, &estatus_node->task_work,
-					    TWA_RESUME);
-			if (ret)
-				estatus_node->task_work.func = NULL;
-		}
-
-		if (!estatus_node->task_work.func)
-			gen_pool_free(ghes_estatus_pool,
-				      (unsigned long)estatus_node, node_len);
+		gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+			      node_len);
 
 		llnode = next;
 	}
@@ -1210,7 +1222,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
 
 	estatus_node->ghes = ghes;
 	estatus_node->generic = ghes->generic;
-	estatus_node->task_work.func = NULL;
 	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 
 	if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index be1dd4c1a917..ebd21b05fe6e 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -35,9 +35,6 @@ struct ghes_estatus_node {
 	struct llist_node llnode;
 	struct acpi_hest_generic *generic;
 	struct ghes *ghes;
-
-	int task_work_cpu;
-	struct callback_head task_work;
 };
 
 struct ghes_estatus_cache {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8483e09aeb2c..327517bf2168 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3933,7 +3933,6 @@ enum mf_flags {
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		      unsigned long count, int mf_flags);
 extern int memory_failure(unsigned long pfn, int flags);
-extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
 extern atomic_long_t num_poisoned_pages __read_mostly;
 extern int soft_offline_page(unsigned long pfn, int flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 327e02fdc029..ad07f673608d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2494,19 +2494,6 @@ static void memory_failure_work_func(struct work_struct *work)
 	}
 }
 
-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
-void memory_failure_queue_kick(int cpu)
-{
-	struct memory_failure_cpu *mf_cpu;
-
-	mf_cpu = &per_cpu(memory_failure_cpu, cpu);
-	cancel_work_sync(&mf_cpu->work);
-	memory_failure_work_func(&mf_cpu->work);
-}
-
 static int __init memory_failure_init(void)
 {
 	struct memory_failure_cpu *mf_cpu;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 0/2] ACPI: APEI: handle synchronous errors in task work
  2025-04-04 11:20 [RESEND PATCH v18 0/2] ACPI: APEI: handle synchronous errors in task work Shuai Xue
  2025-04-04 11:20 ` [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered Shuai Xue
  2025-04-04 11:20 ` [RESEND PATCH v18 2/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
@ 2025-04-08  2:34 ` Hanjun Guo
  2 siblings, 0 replies; 22+ messages in thread
From: Hanjun Guo @ 2025-04-08  2:34 UTC (permalink / raw)
  To: Shuai Xue, catalin.marinas, sudeep.holla, lpieralisi, linux-acpi,
	yazen.ghannam, mark.rutland, mingo, robin.murphy,
	Jonathan.Cameron, bp, rafael, linux-arm-kernel, wangkefeng.wang,
	tanxiaofei, mawupeng1, tony.luck, linmiaohe, naoya.horiguchi,
	james.morse, tongtiangen, gregkh, will, jarkko
  Cc: linux-mm, linux-kernel, akpm, linux-edac, x86, justin.he, ardb,
	ying.huang, ashish.kalra, baolin.wang, tglx, dave.hansen, lenb,
	hpa, robert.moore, lvying6, xiexiuqi, zhuo.song

Hi Shuai Xue,

On 2025/4/4 19:20, Shuai Xue wrote:
>>From Catalin:
> 
>> James Morse is listed as reviewer of the ACPI APEI code but he's busy
>> with resctrl/MPAM.
> 
> These two patches have undergone 18 iterations of review and have received
> 11 'Reviewed-by' tags in total, but they have not yet been merged into the
> mainline. I am requesting further review and ack from the arm64
> ACPI maintainers: Lorenzo, Sudeep, and Hanjun. Thank you for your attention
> and assistance.

I will take a detail review this week.

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-04-04 11:20 ` [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered Shuai Xue
@ 2025-04-14 14:37   ` Hanjun Guo
  2025-04-14 15:02     ` Shuai Xue
  0 siblings, 1 reply; 22+ messages in thread
From: Hanjun Guo @ 2025-04-14 14:37 UTC (permalink / raw)
  To: Shuai Xue, catalin.marinas, sudeep.holla, lpieralisi, linux-acpi,
	yazen.ghannam, mark.rutland, mingo, robin.murphy,
	Jonathan.Cameron, bp, rafael, linux-arm-kernel, wangkefeng.wang,
	tanxiaofei, mawupeng1, tony.luck, linmiaohe, naoya.horiguchi,
	james.morse, tongtiangen, gregkh, will, jarkko
  Cc: linux-mm, linux-kernel, akpm, linux-edac, x86, justin.he, ardb,
	ying.huang, ashish.kalra, baolin.wang, tglx, dave.hansen, lenb,
	hpa, robert.moore, lvying6, xiexiuqi, zhuo.song

On 2025/4/4 19:20, Shuai Xue wrote:
> Synchronous error was detected as a result of user-space process accessing
> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> memory_failure() work which poisons the related page, unmaps the page, and
> then sends a SIGBUS to the process, so that a system wide panic can be
> avoided.
> 
> However, no memory_failure() work will be queued when abnormal synchronous
> errors occur. These errors can include situations such as invalid PA,
> unexpected severity, no memory failure config support, invalid GUID
> section, etc. In such case, the user-space process will trigger SEA again.
> This loop can potentially exceed the platform firmware threshold or even
> trigger a kernel hard lockup, leading to a system reboot.
> 
> Fix it by performing a force kill if no memory_failure() work is queued
> for synchronous errors.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
> Reviewed-by: Jane Chu <jane.chu@oracle.com>
> ---
>   drivers/acpi/apei/ghes.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index b72772494655..50e4d924aa8b 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -799,6 +799,17 @@ static bool ghes_do_proc(struct ghes *ghes,
>   		}
>   	}
>   
> +	/*
> +	 * If no memory failure work is queued for abnormal synchronous
> +	 * errors, do a force kill.
> +	 */
> +	if (sync && !queued) {
> +		dev_err(ghes->dev,
> +			HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n",
> +			current->comm, task_pid_nr(current));
> +		force_sig(SIGBUS);
> +	}

I think it's reasonable to send a force kill to the task when the
synchronous memory error is not recovered.

But I hope this code will not trigger some legacy firmware issues,
let's be careful for this, so can we just introduce arch specific
callbacks for this?

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 2/2] ACPI: APEI: handle synchronous exceptions in task work
  2025-04-04 11:20 ` [RESEND PATCH v18 2/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
@ 2025-04-14 14:48   ` Hanjun Guo
  2025-04-14 14:56     ` Shuai Xue
  0 siblings, 1 reply; 22+ messages in thread
From: Hanjun Guo @ 2025-04-14 14:48 UTC (permalink / raw)
  To: Shuai Xue, catalin.marinas, sudeep.holla, lpieralisi, linux-acpi,
	yazen.ghannam, mark.rutland, mingo, robin.murphy,
	Jonathan.Cameron, bp, rafael, linux-arm-kernel, wangkefeng.wang,
	tanxiaofei, mawupeng1, tony.luck, linmiaohe, naoya.horiguchi,
	james.morse, tongtiangen, gregkh, will, jarkko
  Cc: linux-mm, linux-kernel, akpm, linux-edac, x86, justin.he, ardb,
	ying.huang, ashish.kalra, baolin.wang, tglx, dave.hansen, lenb,
	hpa, robert.moore, lvying6, xiexiuqi, zhuo.song

On 2025/4/4 19:20, Shuai Xue wrote:
> The memory uncorrected error could be signaled by asynchronous interrupt
> (specifically, SPI in arm64 platform), e.g. when an error is detected by
> a background scrubber, or signaled by synchronous exception
> (specifically, data abort exception in arm64 platform), e.g. when a CPU
> tries to access a poisoned cache line. Currently, both synchronous and
> asynchronous error use memory_failure_queue() to schedule
> memory_failure() to exectute in a kworker context.
> 
> As a result, when a user-space process is accessing a poisoned data, a
> data abort is taken and the memory_failure() is executed in the kworker
> context, memory_failure():
> 
>    - will send wrong si_code by SIGBUS signal in early_kill mode, and
>    - can not kill the user-space in some cases resulting a synchronous
>      error infinite loop
> 
> Issue 1: send wrong si_code in early_kill mode
> 
> Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as
> MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED
> could be used to determine whether a synchronous exception occurs on
> ARM64 platform.  When a synchronous exception is detected, the kernel is
> expected to terminate the current process which has accessed poisoned
> page. This is done by sending a SIGBUS signal with an error code
> BUS_MCEERR_AR, indicating an action-required machine check error on
> read.
> 
> However, when kill_proc() is called to terminate the processes who have
> the poisoned page mapped, it sends the incorrect SIGBUS error code
> BUS_MCEERR_AO because the context in which it operates is not the one
> where the error was triggered.
> 
> To reproduce this problem:
> 
>    #sysctl -w vm.memory_failure_early_kill=1
>    vm.memory_failure_early_kill = 1
> 
>    # STEP2: inject an UCE error and consume it to trigger a synchronous error
>    #einj_mem_uc single
>    0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
>    injecting ...
>    triggering ...
>    signal 7 code 5 addr 0xffffb0d75000
>    page not present
>    Test passed
> 
> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO
> error and it is not the fact.
> 
> After this patch:
> 
>    # STEP1: enable early kill mode
>    #sysctl -w vm.memory_failure_early_kill=1
>    vm.memory_failure_early_kill = 1
>    # STEP2: inject an UCE error and consume it to trigger a synchronous error
>    #einj_mem_uc single
>    0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
>    injecting ...
>    triggering ...
>    signal 7 code 4 addr 0xffffb0d75000
>    page not present
>    Test passed
> 
> The si_code (code 4) from einj_mem_uc indicates that it is a BUS_MCEERR_AR
> error as we expected.
> 
> Issue 2: a synchronous error infinite loop
> 
> If a user-space process, e.g. devmem, accesses a poisoned page for which
> the HWPoison flag is set, kill_accessing_process() is called to send
> SIGBUS to current processs with error info. Because the memory_failure()
> is executed in the kworker context, it will just do nothing but return
> EFAULT. So, devmem will access the posioned page and trigger an
> exception again, resulting in a synchronous error infinite loop. Such
> exception loop may cause platform firmware to exceed some threshold and
> reboot when Linux could have recovered from this error.
> 
> To reproduce this problem:
> 
>    # STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page
>    #einj_mem_uc single
>    0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
>    injecting ...
>    triggering ...
>    signal 7 code 4 addr 0xffffb0d75000
>    page not present
>    Test passed
> 
>    # STEP 2: access the same page and it will trigger a synchronous error infinite loop
>    devmem 0x4092d55b400
> 
> To fix above two issues, queue memory_failure() as a task_work so that
> it runs in the context of the process that is actually consuming the
> poisoned data.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Tested-by: Ma Wupeng <mawupeng1@huawei.com>
> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Jane Chu <jane.chu@oracle.com>
> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
> ---
>   drivers/acpi/apei/ghes.c | 79 +++++++++++++++++++++++-----------------
>   include/acpi/ghes.h      |  3 --
>   include/linux/mm.h       |  1 -
>   mm/memory-failure.c      | 13 -------
>   4 files changed, 45 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 50e4d924aa8b..87cf4b373ebe 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -464,28 +464,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
>   		ghes_ack_error(ghes->generic_v2);
>   }
>   
> -/*
> - * Called as task_work before returning to user-space.
> - * Ensure any queued work has been done before we return to the context that
> - * triggered the notification.
> +/**
> + * struct ghes_task_work - for synchronous RAS event
> + *
> + * @twork:                callback_head for task work
> + * @pfn:                  page frame number of corrupted page
> + * @flags:                work control flags
> + *
> + * Structure to pass task work to be handled before
> + * returning to user-space via task_work_add().
>    */
> -static void ghes_kick_task_work(struct callback_head *head)
> +struct ghes_task_work {
> +	struct callback_head twork;
> +	u64 pfn;
> +	int flags;
> +};
> +
> +static void memory_failure_cb(struct callback_head *twork)
>   {
> -	struct acpi_hest_generic_status *estatus;
> -	struct ghes_estatus_node *estatus_node;
> -	u32 node_len;
> +	struct ghes_task_work *twcb = container_of(twork, struct ghes_task_work, twork);
> +	int ret;
>   
> -	estatus_node = container_of(head, struct ghes_estatus_node, task_work);
> -	if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> -		memory_failure_queue_kick(estatus_node->task_work_cpu);
> +	ret = memory_failure(twcb->pfn, twcb->flags);
> +	gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));
>   
> -	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> -	node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
> -	gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
> +	if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
> +		return;
> +
> +	pr_err("%#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> +			twcb->pfn, current->comm, task_pid_nr(current));
> +	force_sig(SIGBUS);
>   }
>   
>   static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>   {
> +	struct ghes_task_work *twcb;
>   	unsigned long pfn;
>   
>   	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> @@ -499,6 +512,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>   		return false;
>   	}
>   
> +	if (flags == MF_ACTION_REQUIRED && current->mm) {
> +		twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
> +		if (!twcb)
> +			return false;
> +
> +		twcb->pfn = pfn;
> +		twcb->flags = flags;
> +		init_task_work(&twcb->twork, memory_failure_cb);
> +		task_work_add(current, &twcb->twork, TWA_RESUME);
> +		return true;
> +	}
> +
>   	memory_failure_queue(pfn, flags);
>   	return true;
>   }
> @@ -743,7 +768,7 @@ int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");
>   
> -static bool ghes_do_proc(struct ghes *ghes,
> +static void ghes_do_proc(struct ghes *ghes,
>   			 const struct acpi_hest_generic_status *estatus)
>   {
>   	int sev, sec_sev;
> @@ -809,8 +834,6 @@ static bool ghes_do_proc(struct ghes *ghes,
>   			current->comm, task_pid_nr(current));
>   		force_sig(SIGBUS);
>   	}
> -
> -	return queued;
>   }
>   
>   static void __ghes_print_estatus(const char *pfx,
> @@ -1114,9 +1137,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>   	struct ghes_estatus_node *estatus_node;
>   	struct acpi_hest_generic *generic;
>   	struct acpi_hest_generic_status *estatus;
> -	bool task_work_pending;
>   	u32 len, node_len;
> -	int ret;
>   
>   	llnode = llist_del_all(&ghes_estatus_llist);
>   	/*
> @@ -1131,25 +1152,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>   		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>   		len = cper_estatus_len(estatus);
>   		node_len = GHES_ESTATUS_NODE_LEN(len);
> -		task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
> +
> +		ghes_do_proc(estatus_node->ghes, estatus);
> +
>   		if (!ghes_estatus_cached(estatus)) {
>   			generic = estatus_node->generic;
>   			if (ghes_print_estatus(NULL, generic, estatus))
>   				ghes_estatus_cache_add(generic, estatus);
>   		}
> -
> -		if (task_work_pending && current->mm) {
> -			estatus_node->task_work.func = ghes_kick_task_work;
> -			estatus_node->task_work_cpu = smp_processor_id();
> -			ret = task_work_add(current, &estatus_node->task_work,
> -					    TWA_RESUME);
> -			if (ret)
> -				estatus_node->task_work.func = NULL;
> -		}
> -
> -		if (!estatus_node->task_work.func)
> -			gen_pool_free(ghes_estatus_pool,
> -				      (unsigned long)estatus_node, node_len);
> +		gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
> +			      node_len);
>   
>   		llnode = next;
>   	}
> @@ -1210,7 +1222,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>   
>   	estatus_node->ghes = ghes;
>   	estatus_node->generic = ghes->generic;
> -	estatus_node->task_work.func = NULL;
>   	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>   
>   	if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index be1dd4c1a917..ebd21b05fe6e 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
>   	struct llist_node llnode;
>   	struct acpi_hest_generic *generic;
>   	struct ghes *ghes;
> -
> -	int task_work_cpu;
> -	struct callback_head task_work;
>   };
>   
>   struct ghes_estatus_cache {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8483e09aeb2c..327517bf2168 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3933,7 +3933,6 @@ enum mf_flags {
>   int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>   		      unsigned long count, int mf_flags);
>   extern int memory_failure(unsigned long pfn, int flags);
> -extern void memory_failure_queue_kick(int cpu);
>   extern int unpoison_memory(unsigned long pfn);
>   extern atomic_long_t num_poisoned_pages __read_mostly;
>   extern int soft_offline_page(unsigned long pfn, int flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 327e02fdc029..ad07f673608d 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2494,19 +2494,6 @@ static void memory_failure_work_func(struct work_struct *work)
>   	}
>   }
>   
> -/*
> - * Process memory_failure work queued on the specified CPU.
> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
> - */
> -void memory_failure_queue_kick(int cpu)
> -{
> -	struct memory_failure_cpu *mf_cpu;
> -
> -	mf_cpu = &per_cpu(memory_failure_cpu, cpu);
> -	cancel_work_sync(&mf_cpu->work);
> -	memory_failure_work_func(&mf_cpu->work);
> -}
> -
>   static int __init memory_failure_init(void)
>   {
>   	struct memory_failure_cpu *mf_cpu;

Looks good to me,

Reviewed-by: Hanjun Guo <guohanjun@huawei.com>

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 2/2] ACPI: APEI: handle synchronous exceptions in task work
  2025-04-14 14:48   ` Hanjun Guo
@ 2025-04-14 14:56     ` Shuai Xue
  0 siblings, 0 replies; 22+ messages in thread
From: Shuai Xue @ 2025-04-14 14:56 UTC (permalink / raw)
  To: Hanjun Guo, catalin.marinas, sudeep.holla, lpieralisi, linux-acpi,
	yazen.ghannam, mark.rutland, mingo, robin.murphy,
	Jonathan.Cameron, bp, rafael, linux-arm-kernel, wangkefeng.wang,
	tanxiaofei, mawupeng1, tony.luck, linmiaohe, naoya.horiguchi,
	james.morse, tongtiangen, gregkh, will, jarkko
  Cc: linux-mm, linux-kernel, akpm, linux-edac, x86, justin.he, ardb,
	ying.huang, ashish.kalra, baolin.wang, tglx, dave.hansen, lenb,
	hpa, robert.moore, lvying6, xiexiuqi, zhuo.song



在 2025/4/14 22:48, Hanjun Guo 写道:
> On 2025/4/4 19:20, Shuai Xue wrote:
>> The memory uncorrected error could be signaled by asynchronous interrupt
>> (specifically, SPI in arm64 platform), e.g. when an error is detected by
>> a background scrubber, or signaled by synchronous exception
>> (specifically, data abort exception in arm64 platform), e.g. when a CPU
>> tries to access a poisoned cache line. Currently, both synchronous and
>> asynchronous error use memory_failure_queue() to schedule
>> memory_failure() to exectute in a kworker context.
>>
>> As a result, when a user-space process is accessing a poisoned data, a
>> data abort is taken and the memory_failure() is executed in the kworker
>> context, memory_failure():
>>
>>    - will send wrong si_code by SIGBUS signal in early_kill mode, and
>>    - can not kill the user-space in some cases resulting a synchronous
>>      error infinite loop
>>
>> Issue 1: send wrong si_code in early_kill mode
>>
>> Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as
>> MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED
>> could be used to determine whether a synchronous exception occurs on
>> ARM64 platform.  When a synchronous exception is detected, the kernel is
>> expected to terminate the current process which has accessed poisoned
>> page. This is done by sending a SIGBUS signal with an error code
>> BUS_MCEERR_AR, indicating an action-required machine check error on
>> read.
>>
>> However, when kill_proc() is called to terminate the processes who have
>> the poisoned page mapped, it sends the incorrect SIGBUS error code
>> BUS_MCEERR_AO because the context in which it operates is not the one
>> where the error was triggered.
>>
>> To reproduce this problem:
>>
>>    #sysctl -w vm.memory_failure_early_kill=1
>>    vm.memory_failure_early_kill = 1
>>
>>    # STEP2: inject an UCE error and consume it to trigger a synchronous error
>>    #einj_mem_uc single
>>    0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
>>    injecting ...
>>    triggering ...
>>    signal 7 code 5 addr 0xffffb0d75000
>>    page not present
>>    Test passed
>>
>> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO
>> error and it is not the fact.
>>
>> After this patch:
>>
>>    # STEP1: enable early kill mode
>>    #sysctl -w vm.memory_failure_early_kill=1
>>    vm.memory_failure_early_kill = 1
>>    # STEP2: inject an UCE error and consume it to trigger a synchronous error
>>    #einj_mem_uc single
>>    0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
>>    injecting ...
>>    triggering ...
>>    signal 7 code 4 addr 0xffffb0d75000
>>    page not present
>>    Test passed
>>
>> The si_code (code 4) from einj_mem_uc indicates that it is a BUS_MCEERR_AR
>> error as we expected.
>>
>> Issue 2: a synchronous error infinite loop
>>
>> If a user-space process, e.g. devmem, accesses a poisoned page for which
>> the HWPoison flag is set, kill_accessing_process() is called to send
>> SIGBUS to current processs with error info. Because the memory_failure()
>> is executed in the kworker context, it will just do nothing but return
>> EFAULT. So, devmem will access the posioned page and trigger an
>> exception again, resulting in a synchronous error infinite loop. Such
>> exception loop may cause platform firmware to exceed some threshold and
>> reboot when Linux could have recovered from this error.
>>
>> To reproduce this problem:
>>
>>    # STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page
>>    #einj_mem_uc single
>>    0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
>>    injecting ...
>>    triggering ...
>>    signal 7 code 4 addr 0xffffb0d75000
>>    page not present
>>    Test passed
>>
>>    # STEP 2: access the same page and it will trigger a synchronous error infinite loop
>>    devmem 0x4092d55b400
>>
>> To fix above two issues, queue memory_failure() as a task_work so that
>> it runs in the context of the process that is actually consuming the
>> poisoned data.
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> Tested-by: Ma Wupeng <mawupeng1@huawei.com>
>> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Jane Chu <jane.chu@oracle.com>
>> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
>> ---
>>   drivers/acpi/apei/ghes.c | 79 +++++++++++++++++++++++-----------------
>>   include/acpi/ghes.h      |  3 --
>>   include/linux/mm.h       |  1 -
>>   mm/memory-failure.c      | 13 -------
>>   4 files changed, 45 insertions(+), 51 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 50e4d924aa8b..87cf4b373ebe 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -464,28 +464,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
>>           ghes_ack_error(ghes->generic_v2);
>>   }
>> -/*
>> - * Called as task_work before returning to user-space.
>> - * Ensure any queued work has been done before we return to the context that
>> - * triggered the notification.
>> +/**
>> + * struct ghes_task_work - for synchronous RAS event
>> + *
>> + * @twork:                callback_head for task work
>> + * @pfn:                  page frame number of corrupted page
>> + * @flags:                work control flags
>> + *
>> + * Structure to pass task work to be handled before
>> + * returning to user-space via task_work_add().
>>    */
>> -static void ghes_kick_task_work(struct callback_head *head)
>> +struct ghes_task_work {
>> +    struct callback_head twork;
>> +    u64 pfn;
>> +    int flags;
>> +};
>> +
>> +static void memory_failure_cb(struct callback_head *twork)
>>   {
>> -    struct acpi_hest_generic_status *estatus;
>> -    struct ghes_estatus_node *estatus_node;
>> -    u32 node_len;
>> +    struct ghes_task_work *twcb = container_of(twork, struct ghes_task_work, twork);
>> +    int ret;
>> -    estatus_node = container_of(head, struct ghes_estatus_node, task_work);
>> -    if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> -        memory_failure_queue_kick(estatus_node->task_work_cpu);
>> +    ret = memory_failure(twcb->pfn, twcb->flags);
>> +    gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));
>> -    estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> -    node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
>> -    gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
>> +    if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
>> +        return;
>> +
>> +    pr_err("%#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
>> +            twcb->pfn, current->comm, task_pid_nr(current));
>> +    force_sig(SIGBUS);
>>   }
>>   static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>   {
>> +    struct ghes_task_work *twcb;
>>       unsigned long pfn;
>>       if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> @@ -499,6 +512,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>           return false;
>>       }
>> +    if (flags == MF_ACTION_REQUIRED && current->mm) {
>> +        twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
>> +        if (!twcb)
>> +            return false;
>> +
>> +        twcb->pfn = pfn;
>> +        twcb->flags = flags;
>> +        init_task_work(&twcb->twork, memory_failure_cb);
>> +        task_work_add(current, &twcb->twork, TWA_RESUME);
>> +        return true;
>> +    }
>> +
>>       memory_failure_queue(pfn, flags);
>>       return true;
>>   }
>> @@ -743,7 +768,7 @@ int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
>>   }
>>   EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");
>> -static bool ghes_do_proc(struct ghes *ghes,
>> +static void ghes_do_proc(struct ghes *ghes,
>>                const struct acpi_hest_generic_status *estatus)
>>   {
>>       int sev, sec_sev;
>> @@ -809,8 +834,6 @@ static bool ghes_do_proc(struct ghes *ghes,
>>               current->comm, task_pid_nr(current));
>>           force_sig(SIGBUS);
>>       }
>> -
>> -    return queued;
>>   }
>>   static void __ghes_print_estatus(const char *pfx,
>> @@ -1114,9 +1137,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>       struct ghes_estatus_node *estatus_node;
>>       struct acpi_hest_generic *generic;
>>       struct acpi_hest_generic_status *estatus;
>> -    bool task_work_pending;
>>       u32 len, node_len;
>> -    int ret;
>>       llnode = llist_del_all(&ghes_estatus_llist);
>>       /*
>> @@ -1131,25 +1152,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>           estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>           len = cper_estatus_len(estatus);
>>           node_len = GHES_ESTATUS_NODE_LEN(len);
>> -        task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>> +
>> +        ghes_do_proc(estatus_node->ghes, estatus);
>> +
>>           if (!ghes_estatus_cached(estatus)) {
>>               generic = estatus_node->generic;
>>               if (ghes_print_estatus(NULL, generic, estatus))
>>                   ghes_estatus_cache_add(generic, estatus);
>>           }
>> -
>> -        if (task_work_pending && current->mm) {
>> -            estatus_node->task_work.func = ghes_kick_task_work;
>> -            estatus_node->task_work_cpu = smp_processor_id();
>> -            ret = task_work_add(current, &estatus_node->task_work,
>> -                        TWA_RESUME);
>> -            if (ret)
>> -                estatus_node->task_work.func = NULL;
>> -        }
>> -
>> -        if (!estatus_node->task_work.func)
>> -            gen_pool_free(ghes_estatus_pool,
>> -                      (unsigned long)estatus_node, node_len);
>> +        gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
>> +                  node_len);
>>           llnode = next;
>>       }
>> @@ -1210,7 +1222,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>>       estatus_node->ghes = ghes;
>>       estatus_node->generic = ghes->generic;
>> -    estatus_node->task_work.func = NULL;
>>       estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>       if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index be1dd4c1a917..ebd21b05fe6e 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
>>       struct llist_node llnode;
>>       struct acpi_hest_generic *generic;
>>       struct ghes *ghes;
>> -
>> -    int task_work_cpu;
>> -    struct callback_head task_work;
>>   };
>>   struct ghes_estatus_cache {
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 8483e09aeb2c..327517bf2168 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3933,7 +3933,6 @@ enum mf_flags {
>>   int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>                 unsigned long count, int mf_flags);
>>   extern int memory_failure(unsigned long pfn, int flags);
>> -extern void memory_failure_queue_kick(int cpu);
>>   extern int unpoison_memory(unsigned long pfn);
>>   extern atomic_long_t num_poisoned_pages __read_mostly;
>>   extern int soft_offline_page(unsigned long pfn, int flags);
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 327e02fdc029..ad07f673608d 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2494,19 +2494,6 @@ static void memory_failure_work_func(struct work_struct *work)
>>       }
>>   }
>> -/*
>> - * Process memory_failure work queued on the specified CPU.
>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
>> - */
>> -void memory_failure_queue_kick(int cpu)
>> -{
>> -    struct memory_failure_cpu *mf_cpu;
>> -
>> -    mf_cpu = &per_cpu(memory_failure_cpu, cpu);
>> -    cancel_work_sync(&mf_cpu->work);
>> -    memory_failure_work_func(&mf_cpu->work);
>> -}
>> -
>>   static int __init memory_failure_init(void)
>>   {
>>       struct memory_failure_cpu *mf_cpu;
> 
> Looks good to me,
> 
> Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
> 
> Thanks
> Hanjun

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-04-14 14:37   ` Hanjun Guo
@ 2025-04-14 15:02     ` Shuai Xue
  2025-04-18  7:48       ` Hanjun Guo
  0 siblings, 1 reply; 22+ messages in thread
From: Shuai Xue @ 2025-04-14 15:02 UTC (permalink / raw)
  To: Hanjun Guo, catalin.marinas, sudeep.holla, lpieralisi, linux-acpi,
	yazen.ghannam, mark.rutland, mingo, robin.murphy,
	Jonathan.Cameron, bp, rafael, linux-arm-kernel, wangkefeng.wang,
	tanxiaofei, mawupeng1, tony.luck, linmiaohe, naoya.horiguchi,
	james.morse, tongtiangen, gregkh, will, jarkko
  Cc: linux-mm, linux-kernel, akpm, linux-edac, x86, justin.he, ardb,
	ying.huang, ashish.kalra, baolin.wang, tglx, dave.hansen, lenb,
	hpa, robert.moore, lvying6, xiexiuqi, zhuo.song



在 2025/4/14 22:37, Hanjun Guo 写道:
> On 2025/4/4 19:20, Shuai Xue wrote:
>> Synchronous error was detected as a result of user-space process accessing
>> a 2-bit uncorrected error. The CPU will take a synchronous error exception
>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
>> memory_failure() work which poisons the related page, unmaps the page, and
>> then sends a SIGBUS to the process, so that a system wide panic can be
>> avoided.
>>
>> However, no memory_failure() work will be queued when abnormal synchronous
>> errors occur. These errors can include situations such as invalid PA,
>> unexpected severity, no memory failure config support, invalid GUID
>> section, etc. In such case, the user-space process will trigger SEA again.
>> This loop can potentially exceed the platform firmware threshold or even
>> trigger a kernel hard lockup, leading to a system reboot.
>>
>> Fix it by performing a force kill if no memory_failure() work is queued
>> for synchronous errors.
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
>> Reviewed-by: Jane Chu <jane.chu@oracle.com>
>> ---
>>   drivers/acpi/apei/ghes.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index b72772494655..50e4d924aa8b 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -799,6 +799,17 @@ static bool ghes_do_proc(struct ghes *ghes,
>>           }
>>       }
>> +    /*
>> +     * If no memory failure work is queued for abnormal synchronous
>> +     * errors, do a force kill.
>> +     */
>> +    if (sync && !queued) {
>> +        dev_err(ghes->dev,
>> +            HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n",
>> +            current->comm, task_pid_nr(current));
>> +        force_sig(SIGBUS);
>> +    }
> 
> I think it's reasonable to send a force kill to the task when the
> synchronous memory error is not recovered.
> 
> But I hope this code will not trigger some legacy firmware issues,
> let's be careful for this, so can we just introduce arch specific
> callbacks for this?

Sorry, can you give more details? I am not sure I got your point.

For x86, Tony confirmed that ghes will not dispatch x86 synchronous errors
(a.k.a machine check exception), in previous vesion.
Sync is only used in arm64 platform, see is_hest_sync_notify().

> 
> Thanks
> Hanjun

Thanks.
Shuai

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-04-14 15:02     ` Shuai Xue
@ 2025-04-18  7:48       ` Hanjun Guo
  2025-04-18 12:35         ` Shuai Xue
  0 siblings, 1 reply; 22+ messages in thread
From: Hanjun Guo @ 2025-04-18  7:48 UTC (permalink / raw)
  To: Shuai Xue, catalin.marinas, sudeep.holla, lpieralisi, linux-acpi,
	yazen.ghannam, mark.rutland, mingo, robin.murphy,
	Jonathan.Cameron, bp, rafael, linux-arm-kernel, wangkefeng.wang,
	tanxiaofei, mawupeng1, tony.luck, linmiaohe, naoya.horiguchi,
	james.morse, tongtiangen, gregkh, will, jarkko
  Cc: linux-mm, linux-kernel, akpm, linux-edac, x86, justin.he, ardb,
	ying.huang, ashish.kalra, baolin.wang, tglx, dave.hansen, lenb,
	hpa, robert.moore, lvying6, xiexiuqi, zhuo.song

On 2025/4/14 23:02, Shuai Xue wrote:
> 
> 
> 在 2025/4/14 22:37, Hanjun Guo 写道:
>> On 2025/4/4 19:20, Shuai Xue wrote:
>>> Synchronous error was detected as a result of user-space process 
>>> accessing
>>> a 2-bit uncorrected error. The CPU will take a synchronous error 
>>> exception
>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will 
>>> queue a
>>> memory_failure() work which poisons the related page, unmaps the 
>>> page, and
>>> then sends a SIGBUS to the process, so that a system wide panic can be
>>> avoided.
>>>
>>> However, no memory_failure() work will be queued when abnormal 
>>> synchronous
>>> errors occur. These errors can include situations such as invalid PA,
>>> unexpected severity, no memory failure config support, invalid GUID
>>> section, etc. In such case, the user-space process will trigger SEA 
>>> again.
>>> This loop can potentially exceed the platform firmware threshold or even
>>> trigger a kernel hard lockup, leading to a system reboot.
>>>
>>> Fix it by performing a force kill if no memory_failure() work is queued
>>> for synchronous errors.
>>>
>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
>>> Reviewed-by: Jane Chu <jane.chu@oracle.com>
>>> ---
>>>   drivers/acpi/apei/ghes.c | 11 +++++++++++
>>>   1 file changed, 11 insertions(+)
>>>
>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>> index b72772494655..50e4d924aa8b 100644
>>> --- a/drivers/acpi/apei/ghes.c
>>> +++ b/drivers/acpi/apei/ghes.c
>>> @@ -799,6 +799,17 @@ static bool ghes_do_proc(struct ghes *ghes,
>>>           }
>>>       }
>>> +    /*
>>> +     * If no memory failure work is queued for abnormal synchronous
>>> +     * errors, do a force kill.
>>> +     */
>>> +    if (sync && !queued) {
>>> +        dev_err(ghes->dev,
>>> +            HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error 
>>> (SIGBUS)\n",
>>> +            current->comm, task_pid_nr(current));
>>> +        force_sig(SIGBUS);
>>> +    }
>>
>> I think it's reasonable to send a force kill to the task when the
>> synchronous memory error is not recovered.
>>
>> But I hope this code will not trigger some legacy firmware issues,
>> let's be careful for this, so can we just introduce arch specific
>> callbacks for this?
> 
> Sorry, can you give more details? I am not sure I got your point.
> 
> For x86, Tony confirmed that ghes will not dispatch x86 synchronous errors
> (a.k.a machine check exception), in previous vesion.
> Sync is only used in arm64 platform, see is_hest_sync_notify().

Sorry for the late reply, from the code I can see that x86 will reuse
ghes_do_proc(), if Tony confirmed that x86 is OK, it's OK to me as well.

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-04-18  7:48       ` Hanjun Guo
@ 2025-04-18 12:35         ` Shuai Xue
  2025-04-25  1:00           ` Hanjun Guo
  0 siblings, 1 reply; 22+ messages in thread
From: Shuai Xue @ 2025-04-18 12:35 UTC (permalink / raw)
  To: Hanjun Guo, Luck, Tony, rafael, Catalin Marinas
  Cc: linux-mm, linux-kernel, akpm, linux-edac, x86, justin.he, ardb,
	ying.huang, ashish.kalra, baolin.wang, tglx, dave.hansen, lenb,
	hpa, robert.moore, lvying6, xiexiuqi, zhuo.song, Hanjun Guo,
	catalin.marinas, sudeep.holla, lpieralisi, linux-acpi,
	yazen.ghannam, mark.rutland, mingo, robin.murphy,
	Jonathan.Cameron, bp, rafael, linux-arm-kernel, wangkefeng.wang,
	tanxiaofei, mawupeng1, tony.luck, linmiaohe, naoya.horiguchi,
	james.morse, tongtiangen, gregkh, will, jarkko



在 2025/4/18 15:48, Hanjun Guo 写道:
> On 2025/4/14 23:02, Shuai Xue wrote:
>>
>>
>> 在 2025/4/14 22:37, Hanjun Guo 写道:
>>> On 2025/4/4 19:20, Shuai Xue wrote:
>>>> Synchronous error was detected as a result of user-space process accessing
>>>> a 2-bit uncorrected error. The CPU will take a synchronous error exception
>>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
>>>> memory_failure() work which poisons the related page, unmaps the page, and
>>>> then sends a SIGBUS to the process, so that a system wide panic can be
>>>> avoided.
>>>>
>>>> However, no memory_failure() work will be queued when abnormal synchronous
>>>> errors occur. These errors can include situations such as invalid PA,
>>>> unexpected severity, no memory failure config support, invalid GUID
>>>> section, etc. In such case, the user-space process will trigger SEA again.
>>>> This loop can potentially exceed the platform firmware threshold or even
>>>> trigger a kernel hard lockup, leading to a system reboot.
>>>>
>>>> Fix it by performing a force kill if no memory_failure() work is queued
>>>> for synchronous errors.
>>>>
>>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>>> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
>>>> Reviewed-by: Jane Chu <jane.chu@oracle.com>
>>>> ---
>>>>   drivers/acpi/apei/ghes.c | 11 +++++++++++
>>>>   1 file changed, 11 insertions(+)
>>>>
>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>>> index b72772494655..50e4d924aa8b 100644
>>>> --- a/drivers/acpi/apei/ghes.c
>>>> +++ b/drivers/acpi/apei/ghes.c
>>>> @@ -799,6 +799,17 @@ static bool ghes_do_proc(struct ghes *ghes,
>>>>           }
>>>>       }
>>>> +    /*
>>>> +     * If no memory failure work is queued for abnormal synchronous
>>>> +     * errors, do a force kill.
>>>> +     */
>>>> +    if (sync && !queued) {
>>>> +        dev_err(ghes->dev,
>>>> +            HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n",
>>>> +            current->comm, task_pid_nr(current));
>>>> +        force_sig(SIGBUS);
>>>> +    }
>>>
>>> I think it's reasonable to send a force kill to the task when the
>>> synchronous memory error is not recovered.
>>>
>>> But I hope this code will not trigger some legacy firmware issues,
>>> let's be careful for this, so can we just introduce arch specific
>>> callbacks for this?
>>
>> Sorry, can you give more details? I am not sure I got your point.
>>
>> For x86, Tony confirmed that ghes will not dispatch x86 synchronous errors
>> (a.k.a machine check exception), in previous vesion.
>> Sync is only used in arm64 platform, see is_hest_sync_notify().
> 
> Sorry for the late reply, from the code I can see that x86 will reuse
> ghes_do_proc(), if Tony confirmed that x86 is OK, it's OK to me as well.

Hi, Hanjun,

Glad to hear that.

I copy and paste in the original disscusion with @Tony from mailist.[1]

> On x86 the "action required" cases are signaled by a synchronous machine check
> that is delivered before the instruction that is attempting to consume the uncorrected
> data retires. I.e., it is guaranteed that the uncorrected error has not been propagated
> because it is not visible in any architectural state.

> APEI signaled errors don't fall into that category on x86 ... the uncorrected data
> could have been consumed and propagated long before the signaling used for
> APEI can alert the OS.

I also add comments in the code.

/*
  * A platform may describe one error source for the handling of synchronous
  * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
  * or External Interrupt). On x86, the HEST notifications are always
  * asynchronous, so only SEA on ARM is delivered as a synchronous
  * notification.
  */
static inline bool is_hest_sync_notify(struct ghes *ghes)
{
	u8 notify_type = ghes->generic->notify.type;

	return notify_type == ACPI_HEST_NOTIFY_SEA;
}


If you are happy with code, please explictly give me your reviewed-by tags :)


> 
> Thanks
> Hanjun

Thanks.

Best Regards,
Shuai

[1] https://lore.kernel.org/lkml/CAJZ5v0hdgxsDiXqOmeqBQoZUQJ1RssM=3jpYpWt3qzy0n2eyaA@mail.gmail.com/t/#u


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-04-18 12:35         ` Shuai Xue
@ 2025-04-25  1:00           ` Hanjun Guo
  2025-04-25  1:10             ` Shuai Xue
  0 siblings, 1 reply; 22+ messages in thread
From: Hanjun Guo @ 2025-04-25  1:00 UTC (permalink / raw)
  To: Shuai Xue, Luck, Tony, rafael, Catalin Marinas
  Cc: linux-mm, linux-kernel, akpm, linux-edac, x86, justin.he, ardb,
	ying.huang, ashish.kalra, baolin.wang, tglx, dave.hansen, lenb,
	hpa, robert.moore, lvying6, xiexiuqi, zhuo.song, sudeep.holla,
	lpieralisi, linux-acpi, yazen.ghannam, mark.rutland, mingo,
	robin.murphy, Jonathan.Cameron, bp, linux-arm-kernel,
	wangkefeng.wang, tanxiaofei, mawupeng1, linmiaohe,
	naoya.horiguchi, james.morse, tongtiangen, gregkh, will, jarkko

On 2025/4/18 20:35, Shuai Xue wrote:
> 
> 
> 在 2025/4/18 15:48, Hanjun Guo 写道:
>> On 2025/4/14 23:02, Shuai Xue wrote:
>>>
>>>
>>> 在 2025/4/14 22:37, Hanjun Guo 写道:
>>>> On 2025/4/4 19:20, Shuai Xue wrote:
>>>>> Synchronous error was detected as a result of user-space process 
>>>>> accessing
>>>>> a 2-bit uncorrected error. The CPU will take a synchronous error 
>>>>> exception
>>>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will 
>>>>> queue a
>>>>> memory_failure() work which poisons the related page, unmaps the 
>>>>> page, and
>>>>> then sends a SIGBUS to the process, so that a system wide panic can be
>>>>> avoided.
>>>>>
>>>>> However, no memory_failure() work will be queued when abnormal 
>>>>> synchronous
>>>>> errors occur. These errors can include situations such as invalid PA,
>>>>> unexpected severity, no memory failure config support, invalid GUID
>>>>> section, etc. In such case, the user-space process will trigger SEA 
>>>>> again.
>>>>> This loop can potentially exceed the platform firmware threshold or 
>>>>> even
>>>>> trigger a kernel hard lockup, leading to a system reboot.
>>>>>
>>>>> Fix it by performing a force kill if no memory_failure() work is 
>>>>> queued
>>>>> for synchronous errors.
>>>>>
>>>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>>>> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
>>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>>> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
>>>>> Reviewed-by: Jane Chu <jane.chu@oracle.com>
>>>>> ---
>>>>>   drivers/acpi/apei/ghes.c | 11 +++++++++++
>>>>>   1 file changed, 11 insertions(+)
>>>>>
>>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>>>> index b72772494655..50e4d924aa8b 100644
>>>>> --- a/drivers/acpi/apei/ghes.c
>>>>> +++ b/drivers/acpi/apei/ghes.c
>>>>> @@ -799,6 +799,17 @@ static bool ghes_do_proc(struct ghes *ghes,
>>>>>           }
>>>>>       }
>>>>> +    /*
>>>>> +     * If no memory failure work is queued for abnormal synchronous
>>>>> +     * errors, do a force kill.
>>>>> +     */
>>>>> +    if (sync && !queued) {
>>>>> +        dev_err(ghes->dev,
>>>>> +            HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable 
>>>>> error (SIGBUS)\n",
>>>>> +            current->comm, task_pid_nr(current));
>>>>> +        force_sig(SIGBUS);
>>>>> +    }
>>>>
>>>> I think it's reasonable to send a force kill to the task when the
>>>> synchronous memory error is not recovered.
>>>>
>>>> But I hope this code will not trigger some legacy firmware issues,
>>>> let's be careful for this, so can we just introduce arch specific
>>>> callbacks for this?
>>>
>>> Sorry, can you give more details? I am not sure I got your point.
>>>
>>> For x86, Tony confirmed that ghes will not dispatch x86 synchronous 
>>> errors
>>> (a.k.a machine check exception), in previous vesion.
>>> Sync is only used in arm64 platform, see is_hest_sync_notify().
>>
>> Sorry for the late reply, from the code I can see that x86 will reuse
>> ghes_do_proc(), if Tony confirmed that x86 is OK, it's OK to me as well.
> 
> Hi, Hanjun,
> 
> Glad to hear that.
> 
> I copy and paste in the original disscusion with @Tony from mailist.[1]
> 
>> On x86 the "action required" cases are signaled by a synchronous 
>> machine check
>> that is delivered before the instruction that is attempting to consume 
>> the uncorrected
>> data retires. I.e., it is guaranteed that the uncorrected error has 
>> not been propagated
>> because it is not visible in any architectural state.
> 
>> APEI signaled errors don't fall into that category on x86 ... the 
>> uncorrected data
>> could have been consumed and propagated long before the signaling used 
>> for
>> APEI can alert the OS.
> 
> I also add comments in the code.
> 
> /*
>   * A platform may describe one error source for the handling of 
> synchronous
>   * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
>   * or External Interrupt). On x86, the HEST notifications are always
>   * asynchronous, so only SEA on ARM is delivered as a synchronous
>   * notification.
>   */
> static inline bool is_hest_sync_notify(struct ghes *ghes)
> {
>      u8 notify_type = ghes->generic->notify.type;
> 
>      return notify_type == ACPI_HEST_NOTIFY_SEA;
> }
> 
> 
> If you are happy with code, please explictly give me your reviewed-by 
> tags :)

Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite,
but I can bear that, please add

Reviewed-by: Hanjun Guo <guohanjun@huawei.com>

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-04-25  1:00           ` Hanjun Guo
@ 2025-04-25  1:10             ` Shuai Xue
  2025-04-28 15:23               ` Will Deacon
  0 siblings, 1 reply; 22+ messages in thread
From: Shuai Xue @ 2025-04-25  1:10 UTC (permalink / raw)
  To: Hanjun Guo, Luck, Tony, rafael, Catalin Marinas
  Cc: linux-mm, linux-kernel, akpm, linux-edac, x86, justin.he, ardb,
	ying.huang, ashish.kalra, baolin.wang, tglx, dave.hansen, lenb,
	hpa, robert.moore, lvying6, xiexiuqi, zhuo.song, sudeep.holla,
	lpieralisi, linux-acpi, yazen.ghannam, mark.rutland, mingo,
	robin.murphy, Jonathan.Cameron, bp, linux-arm-kernel,
	wangkefeng.wang, tanxiaofei, mawupeng1, linmiaohe,
	naoya.horiguchi, james.morse, tongtiangen, gregkh, will, jarkko



在 2025/4/25 09:00, Hanjun Guo 写道:
> On 2025/4/18 20:35, Shuai Xue wrote:
>>
>>
>> 在 2025/4/18 15:48, Hanjun Guo 写道:
>>> On 2025/4/14 23:02, Shuai Xue wrote:
>>>>
>>>>
>>>> 在 2025/4/14 22:37, Hanjun Guo 写道:
>>>>> On 2025/4/4 19:20, Shuai Xue wrote:
>>>>>> Synchronous error was detected as a result of user-space process accessing
>>>>>> a 2-bit uncorrected error. The CPU will take a synchronous error exception
>>>>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
>>>>>> memory_failure() work which poisons the related page, unmaps the page, and
>>>>>> then sends a SIGBUS to the process, so that a system wide panic can be
>>>>>> avoided.
>>>>>>
>>>>>> However, no memory_failure() work will be queued when abnormal synchronous
>>>>>> errors occur. These errors can include situations such as invalid PA,
>>>>>> unexpected severity, no memory failure config support, invalid GUID
>>>>>> section, etc. In such case, the user-space process will trigger SEA again.
>>>>>> This loop can potentially exceed the platform firmware threshold or even
>>>>>> trigger a kernel hard lockup, leading to a system reboot.
>>>>>>
>>>>>> Fix it by performing a force kill if no memory_failure() work is queued
>>>>>> for synchronous errors.
>>>>>>
>>>>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>>>>> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
>>>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>>>> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
>>>>>> Reviewed-by: Jane Chu <jane.chu@oracle.com>
>>>>>> ---
>>>>>>   drivers/acpi/apei/ghes.c | 11 +++++++++++
>>>>>>   1 file changed, 11 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>>>>> index b72772494655..50e4d924aa8b 100644
>>>>>> --- a/drivers/acpi/apei/ghes.c
>>>>>> +++ b/drivers/acpi/apei/ghes.c
>>>>>> @@ -799,6 +799,17 @@ static bool ghes_do_proc(struct ghes *ghes,
>>>>>>           }
>>>>>>       }
>>>>>> +    /*
>>>>>> +     * If no memory failure work is queued for abnormal synchronous
>>>>>> +     * errors, do a force kill.
>>>>>> +     */
>>>>>> +    if (sync && !queued) {
>>>>>> +        dev_err(ghes->dev,
>>>>>> +            HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n",
>>>>>> +            current->comm, task_pid_nr(current));
>>>>>> +        force_sig(SIGBUS);
>>>>>> +    }
>>>>>
>>>>> I think it's reasonable to send a force kill to the task when the
>>>>> synchronous memory error is not recovered.
>>>>>
>>>>> But I hope this code will not trigger some legacy firmware issues,
>>>>> let's be careful for this, so can we just introduce arch specific
>>>>> callbacks for this?
>>>>
>>>> Sorry, can you give more details? I am not sure I got your point.
>>>>
>>>> For x86, Tony confirmed that ghes will not dispatch x86 synchronous errors
>>>> (a.k.a machine check exception), in previous vesion.
>>>> Sync is only used in arm64 platform, see is_hest_sync_notify().
>>>
>>> Sorry for the late reply, from the code I can see that x86 will reuse
>>> ghes_do_proc(), if Tony confirmed that x86 is OK, it's OK to me as well.
>>
>> Hi, Hanjun,
>>
>> Glad to hear that.
>>
>> I copy and paste in the original disscusion with @Tony from mailist.[1]
>>
>>> On x86 the "action required" cases are signaled by a synchronous machine check
>>> that is delivered before the instruction that is attempting to consume the uncorrected
>>> data retires. I.e., it is guaranteed that the uncorrected error has not been propagated
>>> because it is not visible in any architectural state.
>>
>>> APEI signaled errors don't fall into that category on x86 ... the uncorrected data
>>> could have been consumed and propagated long before the signaling used for
>>> APEI can alert the OS.
>>
>> I also add comments in the code.
>>
>> /*
>>   * A platform may describe one error source for the handling of synchronous
>>   * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
>>   * or External Interrupt). On x86, the HEST notifications are always
>>   * asynchronous, so only SEA on ARM is delivered as a synchronous
>>   * notification.
>>   */
>> static inline bool is_hest_sync_notify(struct ghes *ghes)
>> {
>>      u8 notify_type = ghes->generic->notify.type;
>>
>>      return notify_type == ACPI_HEST_NOTIFY_SEA;
>> }
>>
>>
>> If you are happy with code, please explictly give me your reviewed-by tags :)
> 
> Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite,
> but I can bear that, please add
> 
> Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
> 
> Thanks
> Hanjun

Thanks. Hanjun.

@Rafael, @Catalin,

Both patch 1 and 2 have reviewed-by tag from the arm64 ACPI maintainers, Hanjun,
now. Are you happpy to pick and queue this patch set to acpi tree or arm tree?

If you need me to send a new version to collect the reviewed-by tag, please
let me know.

Thanks.

Best Regards,
Shuai

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-04-25  1:10             ` Shuai Xue
@ 2025-04-28 15:23               ` Will Deacon
  2025-05-14  1:35                 ` Shuai Xue
  2025-07-01 11:00                 ` Shuai Xue
  0 siblings, 2 replies; 22+ messages in thread
From: Will Deacon @ 2025-04-28 15:23 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Hanjun Guo, Luck, Tony, rafael, Catalin Marinas, linux-mm,
	linux-kernel, akpm, linux-edac, x86, justin.he, ardb, ying.huang,
	ashish.kalra, baolin.wang, tglx, dave.hansen, lenb, hpa,
	robert.moore, lvying6, xiexiuqi, zhuo.song, sudeep.holla,
	lpieralisi, linux-acpi, yazen.ghannam, mark.rutland, mingo,
	robin.murphy, Jonathan.Cameron, bp, linux-arm-kernel,
	wangkefeng.wang, tanxiaofei, mawupeng1, linmiaohe,
	naoya.horiguchi, james.morse, tongtiangen, gregkh, jarkko

On Fri, Apr 25, 2025 at 09:10:09AM +0800, Shuai Xue wrote:
> 在 2025/4/25 09:00, Hanjun Guo 写道:
> > Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite,
> > but I can bear that, please add
> > 
> > Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
> > 
> > Thanks
> > Hanjun
> 
> Thanks. Hanjun.
> 
> @Rafael, @Catalin,
> 
> Both patch 1 and 2 have reviewed-by tag from the arm64 ACPI maintainers, Hanjun,
> now. Are you happpy to pick and queue this patch set to acpi tree or arm tree?

Since this primarily touches drivers/acpi/apei/ghes.c, I think it should
go via the ACPI tree and not the arm64 one.

Will

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-04-28 15:23               ` Will Deacon
@ 2025-05-14  1:35                 ` Shuai Xue
  2025-07-01 11:00                 ` Shuai Xue
  1 sibling, 0 replies; 22+ messages in thread
From: Shuai Xue @ 2025-05-14  1:35 UTC (permalink / raw)
  To: Will Deacon, rafael
  Cc: Hanjun Guo, Luck, Tony, rafael, Catalin Marinas, linux-mm,
	linux-kernel, akpm, linux-edac, x86, justin.he, ardb, ying.huang,
	ashish.kalra, baolin.wang, tglx, dave.hansen, lenb, hpa,
	robert.moore, lvying6, xiexiuqi, zhuo.song, sudeep.holla,
	lpieralisi, linux-acpi, yazen.ghannam, mark.rutland, mingo,
	robin.murphy, Jonathan.Cameron, bp, linux-arm-kernel,
	wangkefeng.wang, tanxiaofei, mawupeng1, linmiaohe,
	naoya.horiguchi, james.morse, tongtiangen, gregkh, jarkko



在 2025/4/28 23:23, Will Deacon 写道:
> On Fri, Apr 25, 2025 at 09:10:09AM +0800, Shuai Xue wrote:
>> 在 2025/4/25 09:00, Hanjun Guo 写道:
>>> Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite,
>>> but I can bear that, please add
>>>
>>> Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
>>>
>>> Thanks
>>> Hanjun
>>
>> Thanks. Hanjun.
>>
>> @Rafael, @Catalin,
>>
>> Both patch 1 and 2 have reviewed-by tag from the arm64 ACPI maintainers, Hanjun,
>> now. Are you happpy to pick and queue this patch set to acpi tree or arm tree?
> 
> Since this primarily touches drivers/acpi/apei/ghes.c, I think it should
> go via the ACPI tree and not the arm64 one.
> 
> Will

Hi, Will,

Thank you for your confirmation :)

@Rafael, do you have more comments on this patch set?

Thanks you.

Best Regards,
Shuai

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-04-28 15:23               ` Will Deacon
  2025-05-14  1:35                 ` Shuai Xue
@ 2025-07-01 11:00                 ` Shuai Xue
  2025-07-01 13:56                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 22+ messages in thread
From: Shuai Xue @ 2025-07-01 11:00 UTC (permalink / raw)
  To: Will Deacon, Hanjun Guo, rafael, Catalin Marinas
  Cc: Luck, Tony, linux-mm, linux-kernel, akpm, linux-edac, x86,
	justin.he, ardb, ying.huang, ashish.kalra, baolin.wang, tglx,
	dave.hansen, lenb, hpa, robert.moore, lvying6, xiexiuqi,
	zhuo.song, sudeep.holla, lpieralisi, linux-acpi, yazen.ghannam,
	mark.rutland, mingo, robin.murphy, Jonathan.Cameron, bp,
	linux-arm-kernel, wangkefeng.wang, tanxiaofei, mawupeng1,
	linmiaohe, naoya.horiguchi, james.morse, tongtiangen, gregkh,
	jarkko

 >在 2025/4/28 23:23, Will Deacon 写道:
 >> On Fri, Apr 25, 2025 at 09:10:09AM +0800, Shuai Xue wrote:
 >>> 在 2025/4/25 09:00, Hanjun Guo 写道:
 >>>> Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite,
 >>>> but I can bear that, please add
 >>>>
 >>>> Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
 >>>>
 >>>> Thanks
 >>>> Hanjun
 >>>
 >>> Thanks. Hanjun.
 >>>
 >>> @Rafael, @Catalin,
 >>>
 >>> Both patch 1 and 2 have reviewed-by tag from the arm64 ACPI 
maintainers, Hanjun,
 >>> now. Are you happpy to pick and queue this patch set to acpi tree 
or arm tree?
 >>
 >> Since this primarily touches drivers/acpi/apei/ghes.c, I think it should
 >> go via the ACPI tree and not the arm64 one.
 >>
 >> Will
 >
 >Hi, Will,
 >
 >Thank you for your confirmation :)
 >
 >@Rafael, do you have more comments on this patch set?
 >
 >Thanks you.
 >
 >Best Regards,
 >Shuai

Hi, all,

Gentle ping.

Does ACPI or APEI tree still active? Looking forward to any response.

Thanks.

Best Regards,
Shuai

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-07-01 11:00                 ` Shuai Xue
@ 2025-07-01 13:56                   ` Rafael J. Wysocki
  2025-07-14 11:54                     ` Shuai Xue
  0 siblings, 1 reply; 22+ messages in thread
From: Rafael J. Wysocki @ 2025-07-01 13:56 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Will Deacon, Hanjun Guo, rafael, Catalin Marinas, Luck, Tony,
	linux-mm, linux-kernel, akpm, linux-edac, x86, justin.he, ardb,
	ying.huang, ashish.kalra, baolin.wang, tglx, dave.hansen, lenb,
	hpa, robert.moore, lvying6, xiexiuqi, zhuo.song, sudeep.holla,
	lpieralisi, linux-acpi, yazen.ghannam, mark.rutland, mingo,
	robin.murphy, Jonathan.Cameron, bp, linux-arm-kernel,
	wangkefeng.wang, tanxiaofei, mawupeng1, linmiaohe,
	naoya.horiguchi, james.morse, tongtiangen, gregkh, jarkko

On Tue, Jul 1, 2025 at 1:00 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>
>  >在 2025/4/28 23:23, Will Deacon 写道:
>  >> On Fri, Apr 25, 2025 at 09:10:09AM +0800, Shuai Xue wrote:
>  >>> 在 2025/4/25 09:00, Hanjun Guo 写道:
>  >>>> Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite,
>  >>>> but I can bear that, please add
>  >>>>
>  >>>> Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
>  >>>>
>  >>>> Thanks
>  >>>> Hanjun
>  >>>
>  >>> Thanks. Hanjun.
>  >>>
>  >>> @Rafael, @Catalin,
>  >>>
>  >>> Both patch 1 and 2 have reviewed-by tag from the arm64 ACPI
> maintainers, Hanjun,
>  >>> now. Are you happpy to pick and queue this patch set to acpi tree
> or arm tree?
>  >>
>  >> Since this primarily touches drivers/acpi/apei/ghes.c, I think it should
>  >> go via the ACPI tree and not the arm64 one.
>  >>
>  >> Will
>  >
>  >Hi, Will,
>  >
>  >Thank you for your confirmation :)
>  >
>  >@Rafael, do you have more comments on this patch set?
>  >
>  >Thanks you.
>  >
>  >Best Regards,
>  >Shuai
>
> Hi, all,
>
> Gentle ping.
>
> Does ACPI or APEI tree still active? Looking forward to any response.

For APEI changes, you need an ACK from one of the reviewers listed in
the MAINTAINERS entry for APEI.

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-07-01 13:56                   ` Rafael J. Wysocki
@ 2025-07-14 11:54                     ` Shuai Xue
  2025-07-14 17:30                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 22+ messages in thread
From: Shuai Xue @ 2025-07-14 11:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Catalin Marinas
  Cc: Will Deacon, Hanjun Guo, Catalin Marinas, Luck, Tony, linux-mm,
	linux-kernel, akpm, linux-edac, x86, justin.he, ardb, ying.huang,
	ashish.kalra, baolin.wang, tglx, dave.hansen, lenb, hpa,
	robert.moore, lvying6, xiexiuqi, zhuo.song, sudeep.holla,
	lpieralisi, linux-acpi, yazen.ghannam, mark.rutland, mingo,
	robin.murphy, Jonathan.Cameron, bp, linux-arm-kernel,
	wangkefeng.wang, tanxiaofei, mawupeng1, linmiaohe,
	naoya.horiguchi, james.morse, tongtiangen, gregkh, jarkko



在 2025/7/1 21:56, Rafael J. Wysocki 写道:
> On Tue, Jul 1, 2025 at 1:00 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>>
>>   >在 2025/4/28 23:23, Will Deacon 写道:
>>   >> On Fri, Apr 25, 2025 at 09:10:09AM +0800, Shuai Xue wrote:
>>   >>> 在 2025/4/25 09:00, Hanjun Guo 写道:
>>   >>>> Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite,
>>   >>>> but I can bear that, please add
>>   >>>>
>>   >>>> Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
>>   >>>>
>>   >>>> Thanks
>>   >>>> Hanjun
>>   >>>
>>   >>> Thanks. Hanjun.
>>   >>>
>>   >>> @Rafael, @Catalin,
>>   >>>
>>   >>> Both patch 1 and 2 have reviewed-by tag from the arm64 ACPI
>> maintainers, Hanjun,
>>   >>> now. Are you happpy to pick and queue this patch set to acpi tree
>> or arm tree?
>>   >>
>>   >> Since this primarily touches drivers/acpi/apei/ghes.c, I think it should
>>   >> go via the ACPI tree and not the arm64 one.
>>   >>
>>   >> Will
>>   >
>>   >Hi, Will,
>>   >
>>   >Thank you for your confirmation :)
>>   >
>>   >@Rafael, do you have more comments on this patch set?
>>   >
>>   >Thanks you.
>>   >
>>   >Best Regards,
>>   >Shuai
>>
>> Hi, all,
>>
>> Gentle ping.
>>
>> Does ACPI or APEI tree still active? Looking forward to any response.
> 
> For APEI changes, you need an ACK from one of the reviewers listed in
> the MAINTAINERS entry for APEI.
> 
> Thanks!

Hi, Rafael

Sorry, I missed your email which goes in span (:

ARM maintain @Catalin points that:

 > James Morse is listed as reviewer of the ACPI APEI code but he's busy
 > with resctrl/MPAM. Adding Lorenzo, Sudeep and Hanjun as arm64 ACPI
 > maintainers, hopefully they can help.

And Hanjun explictly gived his Reviewed-by tag in this thread, is that 
happy for you for merge?

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-07-14 11:54                     ` Shuai Xue
@ 2025-07-14 17:30                       ` Rafael J. Wysocki
  2025-07-15  2:03                         ` Shuai Xue
                                           ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Rafael J. Wysocki @ 2025-07-14 17:30 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Rafael J. Wysocki, Catalin Marinas, Will Deacon, Hanjun Guo,
	Luck, Tony, linux-mm, linux-kernel, akpm, linux-edac, x86,
	justin.he, ardb, ying.huang, ashish.kalra, baolin.wang, tglx,
	dave.hansen, lenb, hpa, robert.moore, lvying6, xiexiuqi,
	zhuo.song, sudeep.holla, lpieralisi, linux-acpi, yazen.ghannam,
	mark.rutland, mingo, robin.murphy, Jonathan.Cameron, bp,
	linux-arm-kernel, wangkefeng.wang, tanxiaofei, mawupeng1,
	linmiaohe, naoya.horiguchi, james.morse, tongtiangen, gregkh,
	jarkko

Hi,

On Mon, Jul 14, 2025 at 1:54 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>
> 在 2025/7/1 21:56, Rafael J. Wysocki 写道:
> > On Tue, Jul 1, 2025 at 1:00 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
> >>
> >>   >在 2025/4/28 23:23, Will Deacon 写道:
> >>   >> On Fri, Apr 25, 2025 at 09:10:09AM +0800, Shuai Xue wrote:
> >>   >>> 在 2025/4/25 09:00, Hanjun Guo 写道:
> >>   >>>> Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite,
> >>   >>>> but I can bear that, please add
> >>   >>>>
> >>   >>>> Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
> >>   >>>>
> >>   >>>> Thanks
> >>   >>>> Hanjun
> >>   >>>
> >>   >>> Thanks. Hanjun.
> >>   >>>
> >>   >>> @Rafael, @Catalin,
> >>   >>>
> >>   >>> Both patch 1 and 2 have reviewed-by tag from the arm64 ACPI
> >> maintainers, Hanjun,
> >>   >>> now. Are you happpy to pick and queue this patch set to acpi tree
> >> or arm tree?
> >>   >>
> >>   >> Since this primarily touches drivers/acpi/apei/ghes.c, I think it should
> >>   >> go via the ACPI tree and not the arm64 one.
> >>   >>
> >>   >> Will
> >>   >
> >>   >Hi, Will,
> >>   >
> >>   >Thank you for your confirmation :)
> >>   >
> >>   >@Rafael, do you have more comments on this patch set?
> >>   >
> >>   >Thanks you.
> >>   >
> >>   >Best Regards,
> >>   >Shuai
> >>
> >> Hi, all,
> >>
> >> Gentle ping.
> >>
> >> Does ACPI or APEI tree still active? Looking forward to any response.
> >
> > For APEI changes, you need an ACK from one of the reviewers listed in
> > the MAINTAINERS entry for APEI.
> >
> > Thanks!
>
> Hi, Rafael
>
> Sorry, I missed your email which goes in span (:
>
> ARM maintain @Catalin points that:
>
>  > James Morse is listed as reviewer of the ACPI APEI code but he's busy
>  > with resctrl/MPAM. Adding Lorenzo, Sudeep and Hanjun as arm64 ACPI
>  > maintainers, hopefully they can help.
>
> And Hanjun explictly gived his Reviewed-by tag in this thread, is that
> happy for you for merge?

Not really.

I need an ACK or R-by from a reviewer listed in the APEI entry in MAINTAINERS.

If James Morse is not able to fill that role (and AFAICS he's not been
for quite some time now), I'd expect someone else to step up.

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-07-14 17:30                       ` Rafael J. Wysocki
@ 2025-07-15  2:03                         ` Shuai Xue
  2025-07-15  2:46                         ` Hanjun Guo
  2025-07-15 12:06                         ` Mauro Carvalho Chehab
  2 siblings, 0 replies; 22+ messages in thread
From: Shuai Xue @ 2025-07-15  2:03 UTC (permalink / raw)
  To: Rafael J. Wysocki, Luck, Tony, lenb, bp, james.morse,
	Catalin Marinas
  Cc: Will Deacon, Hanjun Guo, linux-mm, linux-kernel, akpm, linux-edac,
	x86, justin.he, ardb, ying.huang, ashish.kalra, baolin.wang, tglx,
	dave.hansen, hpa, robert.moore, lvying6, xiexiuqi, zhuo.song,
	sudeep.holla, lpieralisi, linux-acpi, yazen.ghannam, mark.rutland,
	mingo, robin.murphy, Jonathan.Cameron, linux-arm-kernel,
	wangkefeng.wang, tanxiaofei, mawupeng1, linmiaohe,
	naoya.horiguchi, james.morse, tongtiangen, gregkh, jarkko



在 2025/7/15 01:30, Rafael J. Wysocki 写道:
> Hi,
> 
> On Mon, Jul 14, 2025 at 1:54 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>>
>> 在 2025/7/1 21:56, Rafael J. Wysocki 写道:
>>> On Tue, Jul 1, 2025 at 1:00 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>>>>
>>>>    >在 2025/4/28 23:23, Will Deacon 写道:
>>>>    >> On Fri, Apr 25, 2025 at 09:10:09AM +0800, Shuai Xue wrote:
>>>>    >>> 在 2025/4/25 09:00, Hanjun Guo 写道:
>>>>    >>>> Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite,
>>>>    >>>> but I can bear that, please add
>>>>    >>>>
>>>>    >>>> Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
>>>>    >>>>
>>>>    >>>> Thanks
>>>>    >>>> Hanjun
>>>>    >>>
>>>>    >>> Thanks. Hanjun.
>>>>    >>>
>>>>    >>> @Rafael, @Catalin,
>>>>    >>>
>>>>    >>> Both patch 1 and 2 have reviewed-by tag from the arm64 ACPI
>>>> maintainers, Hanjun,
>>>>    >>> now. Are you happpy to pick and queue this patch set to acpi tree
>>>> or arm tree?
>>>>    >>
>>>>    >> Since this primarily touches drivers/acpi/apei/ghes.c, I think it should
>>>>    >> go via the ACPI tree and not the arm64 one.
>>>>    >>
>>>>    >> Will
>>>>    >
>>>>    >Hi, Will,
>>>>    >
>>>>    >Thank you for your confirmation :)
>>>>    >
>>>>    >@Rafael, do you have more comments on this patch set?
>>>>    >
>>>>    >Thanks you.
>>>>    >
>>>>    >Best Regards,
>>>>    >Shuai
>>>>
>>>> Hi, all,
>>>>
>>>> Gentle ping.
>>>>
>>>> Does ACPI or APEI tree still active? Looking forward to any response.
>>>
>>> For APEI changes, you need an ACK from one of the reviewers listed in
>>> the MAINTAINERS entry for APEI.
>>>
>>> Thanks!
>>
>> Hi, Rafael
>>
>> Sorry, I missed your email which goes in span (:
>>
>> ARM maintain @Catalin points that:
>>
>>   > James Morse is listed as reviewer of the ACPI APEI code but he's busy
>>   > with resctrl/MPAM. Adding Lorenzo, Sudeep and Hanjun as arm64 ACPI
>>   > maintainers, hopefully they can help.
>>
>> And Hanjun explictly gived his Reviewed-by tag in this thread, is that
>> happy for you for merge?
> 
> Not really.
> 
> I need an ACK or R-by from a reviewer listed in the APEI entry in MAINTAINERS.

Hi Rafael,

I understand your requirement for an ACK/R-by from the APEI reviewers 
listed in MAINTAINERS.

So, @Tony, @James, @Borislav, @Len,

Gentle ping, we need your help to review and ack this patch set.

If I recall correctly, Rafael has mentioned this issue at least three 
times in previous emails, but we still haven't received explict response 
from the APEI maintainer.


> 
> If James Morse is not able to fill that role (and AFAICS he's not been
> for quite some time now), I'd expect someone else to step up.
> 
> Thanks!

Thank you for the clarification.

I'd like to volunteer to help with APEI code reviews. I have been 
working with APEI-related code and am familiar with the ACPI error 
handling mechanisms.

I'm willing to start by contributing reviews and help move pending APEI 
patches forward. If the community finds my contributions valuable and 
believes I have the necessary expertise, I would welcome the opportunity 
to be formally set up as an APEI reviewer in the future.

Thanks for considering this, and I look forward to your guidance.

Thanks.
Shuai

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-07-14 17:30                       ` Rafael J. Wysocki
  2025-07-15  2:03                         ` Shuai Xue
@ 2025-07-15  2:46                         ` Hanjun Guo
  2025-07-15 12:06                         ` Mauro Carvalho Chehab
  2 siblings, 0 replies; 22+ messages in thread
From: Hanjun Guo @ 2025-07-15  2:46 UTC (permalink / raw)
  To: Rafael J. Wysocki, Shuai Xue
  Cc: Catalin Marinas, Will Deacon, Luck, Tony, linux-mm, linux-kernel,
	akpm, linux-edac, x86, justin.he, ardb, ying.huang, ashish.kalra,
	baolin.wang, tglx, dave.hansen, lenb, hpa, robert.moore, lvying6,
	xiexiuqi, zhuo.song, sudeep.holla, lpieralisi, linux-acpi,
	yazen.ghannam, mark.rutland, mingo, robin.murphy,
	Jonathan.Cameron, bp, linux-arm-kernel, wangkefeng.wang,
	tanxiaofei, mawupeng1, linmiaohe, naoya.horiguchi, james.morse,
	tongtiangen, gregkh, jarkko

On 2025/7/15 1:30, Rafael J. Wysocki wrote:
>>> For APEI changes, you need an ACK from one of the reviewers listed in
>>> the MAINTAINERS entry for APEI.
>>>
>>> Thanks!
>> Hi, Rafael
>>
>> Sorry, I missed your email which goes in span (:
>>
>> ARM maintain @Catalin points that:
>>
>>   > James Morse is listed as reviewer of the ACPI APEI code but he's busy
>>   > with resctrl/MPAM. Adding Lorenzo, Sudeep and Hanjun as arm64 ACPI
>>   > maintainers, hopefully they can help.
>>
>> And Hanjun explictly gived his Reviewed-by tag in this thread, is that
>> happy for you for merge?
> Not really.
> 
> I need an ACK or R-by from a reviewer listed in the APEI entry in MAINTAINERS.
> 
> If James Morse is not able to fill that role (and AFAICS he's not been
> for quite some time now), I'd expect someone else to step up.

Please count me in. I have been working in ACPI for years, and RAS 
feature development for both x86 and arm64 architectures.

I'm pretty familiar with ACPI spec including APEI, it will help me
to do the review work.

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-07-14 17:30                       ` Rafael J. Wysocki
  2025-07-15  2:03                         ` Shuai Xue
  2025-07-15  2:46                         ` Hanjun Guo
@ 2025-07-15 12:06                         ` Mauro Carvalho Chehab
  2025-07-15 12:40                           ` Rafael J. Wysocki
  2 siblings, 1 reply; 22+ messages in thread
From: Mauro Carvalho Chehab @ 2025-07-15 12:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Shuai Xue, Catalin Marinas, Will Deacon, Hanjun Guo, Luck, Tony,
	linux-mm, linux-kernel, akpm, linux-edac, x86, justin.he, ardb,
	ying.huang, ashish.kalra, baolin.wang, tglx, dave.hansen, lenb,
	hpa, robert.moore, lvying6, xiexiuqi, zhuo.song, sudeep.holla,
	lpieralisi, linux-acpi, yazen.ghannam, mark.rutland, mingo,
	robin.murphy, Jonathan.Cameron, bp, linux-arm-kernel,
	wangkefeng.wang, tanxiaofei, mawupeng1, linmiaohe,
	naoya.horiguchi, james.morse, tongtiangen, gregkh, jarkko

Em Mon, 14 Jul 2025 19:30:19 +0200
"Rafael J. Wysocki" <rafael@kernel.org> escreveu:

> Hi,
> 
> On Mon, Jul 14, 2025 at 1:54 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
> >
> > 在 2025/7/1 21:56, Rafael J. Wysocki 写道:  
> > > On Tue, Jul 1, 2025 at 1:00 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:  
> > >>  
> > >>   >在 2025/4/28 23:23, Will Deacon 写道:  
> > >>   >> On Fri, Apr 25, 2025 at 09:10:09AM +0800, Shuai Xue wrote:  
> > >>   >>> 在 2025/4/25 09:00, Hanjun Guo 写道:  
> > >>   >>>> Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite,
> > >>   >>>> but I can bear that, please add
> > >>   >>>>
> > >>   >>>> Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
> > >>   >>>>
> > >>   >>>> Thanks
> > >>   >>>> Hanjun  
> > >>   >>>
> > >>   >>> Thanks. Hanjun.
> > >>   >>>
> > >>   >>> @Rafael, @Catalin,
> > >>   >>>
> > >>   >>> Both patch 1 and 2 have reviewed-by tag from the arm64 ACPI  
> > >> maintainers, Hanjun,  
> > >>   >>> now. Are you happpy to pick and queue this patch set to acpi tree  
> > >> or arm tree?  
> > >>   >>
> > >>   >> Since this primarily touches drivers/acpi/apei/ghes.c, I think it should
> > >>   >> go via the ACPI tree and not the arm64 one.
> > >>   >>
> > >>   >> Will  
> > >>   >
> > >>   >Hi, Will,
> > >>   >
> > >>   >Thank you for your confirmation :)
> > >>   >
> > >>   >@Rafael, do you have more comments on this patch set?
> > >>   >
> > >>   >Thanks you.
> > >>   >
> > >>   >Best Regards,
> > >>   >Shuai  
> > >>
> > >> Hi, all,
> > >>
> > >> Gentle ping.
> > >>
> > >> Does ACPI or APEI tree still active? Looking forward to any response.  
> > >
> > > For APEI changes, you need an ACK from one of the reviewers listed in
> > > the MAINTAINERS entry for APEI.
> > >
> > > Thanks!  
> >
> > Hi, Rafael
> >
> > Sorry, I missed your email which goes in span (:
> >
> > ARM maintain @Catalin points that:
> >  
> >  > James Morse is listed as reviewer of the ACPI APEI code but he's busy
> >  > with resctrl/MPAM. Adding Lorenzo, Sudeep and Hanjun as arm64 ACPI
> >  > maintainers, hopefully they can help.  
> >
> > And Hanjun explictly gived his Reviewed-by tag in this thread, is that
> > happy for you for merge?  
> 
> Not really.
> 
> I need an ACK or R-by from a reviewer listed in the APEI entry in MAINTAINERS.
> 
> If James Morse is not able to fill that role (and AFAICS he's not been
> for quite some time now), I'd expect someone else to step up.

Rafael,

If you want, I can step-up to help with APEI review. Besides my work
with RAS/EDAC in the past, I'm doing some work those days adding
APEI injection in QEMU those days (currently focused on ARM error
injection via SEA and GPIO).

Regards,
Mauro

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
  2025-07-15 12:06                         ` Mauro Carvalho Chehab
@ 2025-07-15 12:40                           ` Rafael J. Wysocki
  0 siblings, 0 replies; 22+ messages in thread
From: Rafael J. Wysocki @ 2025-07-15 12:40 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Rafael J. Wysocki, Shuai Xue, Catalin Marinas, Will Deacon,
	Hanjun Guo, Luck, Tony, linux-mm, linux-kernel, akpm, linux-edac,
	x86, justin.he, ardb, ying.huang, ashish.kalra, baolin.wang, tglx,
	dave.hansen, lenb, hpa, robert.moore, lvying6, xiexiuqi,
	zhuo.song, sudeep.holla, lpieralisi, linux-acpi, yazen.ghannam,
	mark.rutland, mingo, robin.murphy, Jonathan.Cameron, bp,
	linux-arm-kernel, wangkefeng.wang, tanxiaofei, mawupeng1,
	linmiaohe, naoya.horiguchi, james.morse, tongtiangen, gregkh,
	jarkko

On Tue, Jul 15, 2025 at 2:06 PM Mauro Carvalho Chehab
<mchehab+huawei@kernel.org> wrote:
>
> Em Mon, 14 Jul 2025 19:30:19 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> escreveu:
>
> > Hi,
> >
> > On Mon, Jul 14, 2025 at 1:54 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
> > >
> > > 在 2025/7/1 21:56, Rafael J. Wysocki 写道:
> > > > On Tue, Jul 1, 2025 at 1:00 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
> > > >>
> > > >>   >在 2025/4/28 23:23, Will Deacon 写道:
> > > >>   >> On Fri, Apr 25, 2025 at 09:10:09AM +0800, Shuai Xue wrote:
> > > >>   >>> 在 2025/4/25 09:00, Hanjun Guo 写道:
> > > >>   >>>> Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite,
> > > >>   >>>> but I can bear that, please add
> > > >>   >>>>
> > > >>   >>>> Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
> > > >>   >>>>
> > > >>   >>>> Thanks
> > > >>   >>>> Hanjun
> > > >>   >>>
> > > >>   >>> Thanks. Hanjun.
> > > >>   >>>
> > > >>   >>> @Rafael, @Catalin,
> > > >>   >>>
> > > >>   >>> Both patch 1 and 2 have reviewed-by tag from the arm64 ACPI
> > > >> maintainers, Hanjun,
> > > >>   >>> now. Are you happpy to pick and queue this patch set to acpi tree
> > > >> or arm tree?
> > > >>   >>
> > > >>   >> Since this primarily touches drivers/acpi/apei/ghes.c, I think it should
> > > >>   >> go via the ACPI tree and not the arm64 one.
> > > >>   >>
> > > >>   >> Will
> > > >>   >
> > > >>   >Hi, Will,
> > > >>   >
> > > >>   >Thank you for your confirmation :)
> > > >>   >
> > > >>   >@Rafael, do you have more comments on this patch set?
> > > >>   >
> > > >>   >Thanks you.
> > > >>   >
> > > >>   >Best Regards,
> > > >>   >Shuai
> > > >>
> > > >> Hi, all,
> > > >>
> > > >> Gentle ping.
> > > >>
> > > >> Does ACPI or APEI tree still active? Looking forward to any response.
> > > >
> > > > For APEI changes, you need an ACK from one of the reviewers listed in
> > > > the MAINTAINERS entry for APEI.
> > > >
> > > > Thanks!
> > >
> > > Hi, Rafael
> > >
> > > Sorry, I missed your email which goes in span (:
> > >
> > > ARM maintain @Catalin points that:
> > >
> > >  > James Morse is listed as reviewer of the ACPI APEI code but he's busy
> > >  > with resctrl/MPAM. Adding Lorenzo, Sudeep and Hanjun as arm64 ACPI
> > >  > maintainers, hopefully they can help.
> > >
> > > And Hanjun explictly gived his Reviewed-by tag in this thread, is that
> > > happy for you for merge?
> >
> > Not really.
> >
> > I need an ACK or R-by from a reviewer listed in the APEI entry in MAINTAINERS.
> >
> > If James Morse is not able to fill that role (and AFAICS he's not been
> > for quite some time now), I'd expect someone else to step up.

Hi Mauro,

> Rafael,
>
> If you want, I can step-up to help with APEI review. Besides my work
> with RAS/EDAC in the past, I'm doing some work those days adding
> APEI injection in QEMU those days (currently focused on ARM error
> injection via SEA and GPIO).

Thank you!

OK, let me send a MAINTAINERS update with a list of new APEI reviewers.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-07-15 12:40 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-04 11:20 [RESEND PATCH v18 0/2] ACPI: APEI: handle synchronous errors in task work Shuai Xue
2025-04-04 11:20 ` [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered Shuai Xue
2025-04-14 14:37   ` Hanjun Guo
2025-04-14 15:02     ` Shuai Xue
2025-04-18  7:48       ` Hanjun Guo
2025-04-18 12:35         ` Shuai Xue
2025-04-25  1:00           ` Hanjun Guo
2025-04-25  1:10             ` Shuai Xue
2025-04-28 15:23               ` Will Deacon
2025-05-14  1:35                 ` Shuai Xue
2025-07-01 11:00                 ` Shuai Xue
2025-07-01 13:56                   ` Rafael J. Wysocki
2025-07-14 11:54                     ` Shuai Xue
2025-07-14 17:30                       ` Rafael J. Wysocki
2025-07-15  2:03                         ` Shuai Xue
2025-07-15  2:46                         ` Hanjun Guo
2025-07-15 12:06                         ` Mauro Carvalho Chehab
2025-07-15 12:40                           ` Rafael J. Wysocki
2025-04-04 11:20 ` [RESEND PATCH v18 2/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
2025-04-14 14:48   ` Hanjun Guo
2025-04-14 14:56     ` Shuai Xue
2025-04-08  2:34 ` [RESEND PATCH v18 0/2] ACPI: APEI: handle synchronous errors " Hanjun Guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).