Archive-only list for patches
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Shuai Xue <xueshuai@linux.alibaba.com>,
	Jarkko Sakkinen <jarkko@kernel.org>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Yazen Ghannam <yazen.ghannam@amd.com>,
	Jane Chu <jane.chu@oracle.com>, Hanjun Guo <guohanjun@huawei.com>,
	"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
	Sasha Levin <sashal@kernel.org>,
	rafael@kernel.org, ira.weiny@intel.com, dave.jiang@intel.com,
	tony.luck@intel.com, Smita.KoralahalliChannabasappa@amd.com,
	peterz@infradead.org, quic_hyiwei@quicinc.com, bp@alien8.de,
	linux-acpi@vger.kernel.org
Subject: [PATCH AUTOSEL 6.12 12/69] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered
Date: Sun,  3 Aug 2025 20:30:22 -0400	[thread overview]
Message-ID: <20250804003119.3620476-12-sashal@kernel.org> (raw)
In-Reply-To: <20250804003119.3620476-1-sashal@kernel.org>

From: Shuai Xue <xueshuai@linux.alibaba.com>

[ Upstream commit 79a5ae3c4c5eb7e38e0ebe4d6bf602d296080060 ]

If a synchronous error is detected as a result of user-space process
triggering a 2-bit uncorrected error, the CPU will take a synchronous
error exception such as Synchronous External Abort (SEA) on Arm64. The
kernel will queue a memory_failure() work which poisons the related
page, unmaps the page, and then sends a SIGBUS to the process, so that
a system wide panic can be avoided.

However, no memory_failure() work will be queued when abnormal
synchronous errors occur. These errors can include situations like
invalid PA, unexpected severity, no memory failure config support,
invalid GUID section, etc. In such a case, the user-space process will
trigger SEA again.  This loop can potentially exceed the platform
firmware threshold or even trigger a kernel hard lockup, leading to a
system reboot.

Fix it by performing a force kill if no memory_failure() work is queued
for synchronous errors.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Link: https://patch.msgid.link/20250714114212.31660-2-xueshuai@linux.alibaba.com
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

**Backport Status: YES**

This commit should be backported to stable kernel trees for the
following reasons:

1. **Critical Bug Fix**: The commit fixes a serious issue where
   synchronous memory errors that cannot be recovered (abnormal
   synchronous errors) would cause an infinite loop. The commit message
   clearly describes how "the user-space process will trigger SEA again.
   This loop can potentially exceed the platform firmware threshold or
   even trigger a kernel hard lockup, leading to a system reboot."

2. **Security and Stability Impact**: The bug can lead to:
   - Infinite exception loops
   - Platform firmware threshold exceeded
   - Kernel hard lockups
   - Unexpected system reboots

   These are severe stability issues that affect system reliability.

3. **Small and Contained Fix**: The code change is minimal and well-
   contained:
  ```c
  + /*
  +  * If no memory failure work is queued for abnormal synchronous
  +  * errors, do a force kill.
  +  */
  + if (sync && !queued) {
  +     dev_err(ghes->dev,
  +         HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error
  (SIGBUS)\n",
  +         current->comm, task_pid_nr(current));
  +     force_sig(SIGBUS);
  + }
  ```
  The fix adds only 10 lines of code that check if we're in a
  synchronous context (`sync`) and no memory failure work was queued
  (`!queued`), then sends SIGBUS to the current process.

4. **Clear Problem and Solution**: The commit addresses a specific gap
   in error handling. When `ghes_handle_memory_failure()` returns false
   (meaning no memory_failure() work was queued) for synchronous errors,
   the process that triggered the error continues execution and will hit
   the same error again, creating an infinite loop.

5. **Follows Stable Rules**: This fix meets the stable kernel criteria:
   - Fixes a real bug that affects users
   - Small change (< 100 lines)
   - Obviously correct and tested (has multiple Reviewed-by tags)
   - Does not add new features
   - Addresses a serious issue (system stability/reboot)

6. **Related to Previous Work**: This appears to be part of a series
   addressing synchronous error handling issues in GHES. The commit
   c1f1fda14137 mentioned in the git log shows ongoing work to properly
   handle synchronous exceptions, and this commit addresses a critical
   gap where abnormal synchronous errors weren't being handled at all.

The fix ensures that when a synchronous memory error cannot be properly
handled through the normal memory_failure() path, the kernel will at
least terminate the offending process with SIGBUS rather than allowing
it to continue and create an infinite exception loop that can crash the
system.

 drivers/acpi/apei/ghes.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 6cf40e8ac321..94e3d3fe11ae 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -799,6 +799,17 @@ static bool ghes_do_proc(struct ghes *ghes,
 		}
 	}
 
+	/*
+	 * If no memory failure work is queued for abnormal synchronous
+	 * errors, do a force kill.
+	 */
+	if (sync && !queued) {
+		dev_err(ghes->dev,
+			HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n",
+			current->comm, task_pid_nr(current));
+		force_sig(SIGBUS);
+	}
+
 	return queued;
 }
 
-- 
2.39.5


  parent reply	other threads:[~2025-08-04  0:31 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-04  0:30 [PATCH AUTOSEL 6.12 01/69] usb: xhci: print xhci->xhc_state when queue_command failed Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 02/69] platform/x86/amd: pmc: Add Lenovo Yoga 6 13ALC6 to pmc quirk list Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 03/69] cpufreq: CPPC: Mark driver with NEED_UPDATE_LIMITS flag Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 04/69] selftests/futex: Define SYS_futex on 32-bit architectures with 64-bit time_t Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 05/69] usb: typec: ucsi: psy: Set current max to 100mA for BC 1.2 and Default Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 06/69] regulator: core: repeat voltage setting request for stepped regulators Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 07/69] usb: xhci: Avoid showing warnings for dying controller Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 08/69] usb: xhci: Set avg_trb_len = 8 for EP0 during Address Device Command Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 09/69] usb: xhci: Avoid showing errors during surprise removal Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 10/69] firmware: qcom: scm: initialize tzmem before marking SCM as available Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 11/69] soc: qcom: rpmh-rsc: Add RSC version 4 support Sasha Levin
2025-08-04  0:30 ` Sasha Levin [this message]
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 13/69] remoteproc: imx_rproc: skip clock enable when M-core is managed by the SCU Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 14/69] usb: typec: tcpm/tcpci_maxim: fix irq wake usage Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 15/69] pmdomain: ti: Select PM_GENERIC_DOMAINS Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 16/69] gpio: wcd934x: check the return value of regmap_update_bits() Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 17/69] cpufreq: Exit governor when failed to start old governor Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 18/69] cpufreq: intel_pstate: Add Granite Rapids support in no-HWP mode Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 19/69] ARM: rockchip: fix kernel hang during smp initialization Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 20/69] PM / devfreq: governor: Replace sscanf() with kstrtoul() in set_freq_store() Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 21/69] EDAC/synopsys: Clear the ECC counters on init Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 22/69] ASoC: soc-dapm: set bias_level if snd_soc_dapm_set_bias_level() was successed Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 23/69] thermal/drivers/qcom-spmi-temp-alarm: Enable stage 2 shutdown when required Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 24/69] tools/nolibc: define time_t in terms of __kernel_old_time_t Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 25/69] iio: adc: ad_sigma_delta: don't overallocate scan buffer Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 26/69] gpio: tps65912: check the return value of regmap_update_bits() Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 27/69] mfd: tps6594: Add TI TPS652G1 support Sasha Levin
2025-08-18  6:34   ` Michael Walle
2025-08-19  2:01     ` Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 28/69] ARM: tegra: Use I/O memcpy to write to IRAM Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 29/69] tools/build: Fix s390(x) cross-compilation with clang Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 30/69] selftests: tracing: Use mutex_unlock for testing glob filter Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 31/69] ACPI: PRM: Reduce unnecessary printing to avoid user confusion Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 32/69] firmware: arm_scmi: power_control: Ensure SCMI_SYSPOWER_IDLE is set early during resume Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 33/69] firmware: tegra: Fix IVC dependency problems Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 34/69] pwm: sifive: Fix PWM algorithm and clarify inverted compare behavior Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 35/69] PM: runtime: Clear power.needs_force_resume in pm_runtime_reinit() Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 36/69] thermal: sysfs: Return ENODATA instead of EAGAIN for reads Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 37/69] PM: sleep: console: Fix the black screen issue Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 38/69] ACPI: processor: fix acpi_object initialization Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 39/69] mmc: sdhci-msm: Ensure SD card power isn't ON when card removed Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 40/69] ACPI: APEI: GHES: add TAINT_MACHINE_CHECK on GHES panic path Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 41/69] selftests: vDSO: vdso_test_getrandom: Always print TAP header Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 42/69] pps: clients: gpio: fix interrupt handling order in remove path Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 43/69] reset: brcmstb: Enable reset drivers for ARCH_BCM2835 Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 44/69] char: misc: Fix improper and inaccurate error code returned by misc_init() Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 45/69] mei: bus: Check for still connected devices in mei_cl_bus_dev_release() Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 46/69] mmc: rtsx_usb_sdmmc: Fix error-path in sd_set_power_mode() Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 47/69] platform/chrome: cros_ec_sensorhub: Retries when a sensor is not ready Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 48/69] ALSA: hda: Handle the jack polling always via a work Sasha Levin
2025-08-04  0:30 ` [PATCH AUTOSEL 6.12 49/69] ALSA: hda: Disable jack polling at shutdown Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 50/69] x86/bugs: Avoid warning when overriding return thunk Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 51/69] ASoC: hdac_hdmi: Rate limit logging on connection and disconnection Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 52/69] ALSA: intel8x0: Fix incorrect codec index usage in mixer for ICH4 Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 53/69] ASoC: SOF: topology: Parse the dapm_widget_tokens in case of DSPless mode Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 54/69] tty: serial: fix print format specifiers Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 55/69] ASoC: core: Check for rtd == NULL in snd_soc_remove_pcm_runtime() Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 56/69] usb: typec: intel_pmc_mux: Defer probe if SCU IPC isn't present Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 57/69] usb: core: usb_submit_urb: downgrade type check Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 58/69] usb: typec: fusb302: fix scheduling while atomic when using virtio-gpio Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 59/69] pm: cpupower: Fix the snapshot-order of tsc,mperf, clock in mperf_stop() Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 60/69] imx8m-blk-ctrl: set ISI panic write hurry level Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 61/69] soc: qcom: mdt_loader: Actually use the e_phoff Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 62/69] platform/x86: thinkpad_acpi: Handle KCOV __init vs inline mismatches Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 63/69] platform/chrome: cros_ec_typec: Defer probe on missing EC parent Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 64/69] ALSA: hda/ca0132: Fix buffer overflow in add_tuning_control Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 65/69] ALSA: pcm: Rewrite recalculate_boundary() to avoid costly loop Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 66/69] ALSA: usb-audio: Avoid precedence issues in mixer_quirks macros Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 67/69] iio: adc: ad7768-1: Ensure SYNC_IN pulse minimum timing requirement Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 68/69] ASoC: codecs: rt5640: Retry DEVICE_ID verification Sasha Levin
2025-08-04  0:31 ` [PATCH AUTOSEL 6.12 69/69] ASoC: qcom: use drvdata instead of component to keep id Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250804003119.3620476-12-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=bp@alien8.de \
    --cc=dave.jiang@intel.com \
    --cc=guohanjun@huawei.com \
    --cc=ira.weiny@intel.com \
    --cc=jane.chu@oracle.com \
    --cc=jarkko@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=peterz@infradead.org \
    --cc=quic_hyiwei@quicinc.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=rafael@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tony.luck@intel.com \
    --cc=xueshuai@linux.alibaba.com \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox