public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	Ard Biesheuvel <ardb@kernel.org>,
	Hanjun Guo <guohanjun@huawei.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	Sasha Levin <sashal@kernel.org>,
	rafael@kernel.org, bp@alien8.de, xueshuai@linux.alibaba.com,
	fabio.m.de.francesco@linux.intel.com, leitao@debian.org,
	pengdonglin@xiaomi.com, Smita.KoralahalliChannabasappa@amd.com,
	jason@os.amperecomputing.com, linux-acpi@vger.kernel.org,
	linux-edac@vger.kernel.org
Subject: [PATCH AUTOSEL 6.19-6.12] APEI/GHES: ARM processor Error: don't go past allocated memory
Date: Wed, 11 Feb 2026 07:30:29 -0500	[thread overview]
Message-ID: <20260211123112.1330287-19-sashal@kernel.org> (raw)
In-Reply-To: <20260211123112.1330287-1-sashal@kernel.org>

From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

[ Upstream commit 87880af2d24e62a84ed19943dbdd524f097172f2 ]

If the BIOS generates a very small ARM Processor Error, or
an incomplete one, the current logic will fail to deferrence

	err->section_length
and
	ctx_info->size

Add checks to avoid that. With such changes, such GHESv2
records won't cause OOPSes like this:

[    1.492129] Internal error: Oops: 0000000096000005 [#1]  SMP
[    1.495449] Modules linked in:
[    1.495820] CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.18.0-rc1-00017-gabadcc3553dd-dirty #18 PREEMPT
[    1.496125] Hardware name: QEMU QEMU Virtual Machine, BIOS unknown 02/02/2022
[    1.496433] Workqueue: kacpi_notify acpi_os_execute_deferred
[    1.496967] pstate: 814000c5 (Nzcv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[    1.497199] pc : log_arm_hw_error+0x5c/0x200
[    1.497380] lr : ghes_handle_arm_hw_error+0x94/0x220

0xffff8000811c5324 is in log_arm_hw_error (../drivers/ras/ras.c:75).
70		err_info = (struct cper_arm_err_info *)(err + 1);
71		ctx_info = (struct cper_arm_ctx_info *)(err_info + err->err_info_num);
72		ctx_err = (u8 *)ctx_info;
73
74		for (n = 0; n < err->context_info_num; n++) {
75			sz = sizeof(struct cper_arm_ctx_info) + ctx_info->size;
76			ctx_info = (struct cper_arm_ctx_info *)((long)ctx_info + sz);
77			ctx_len += sz;
78		}
79

and similar ones while trying to access section_length on an
error dump with too small size.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
[ rjw: Subject tweaks ]
Link: https://patch.msgid.link/7fd9f38413be05ee2d7cfdb0dc31ea2274cf1a54.1767871950.git.mchehab+huawei@kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Analysis: APEI/GHES ARM Processor Error Bounds Checking

### 1. What the Commit Fixes

This commit fixes a **kernel OOPS (crash)** in the APEI/GHES ARM
processor error handling path. When BIOS/firmware generates a very small
or incomplete ARM Processor Error record (CPER section), the kernel
reads past the allocated memory buffer, causing an out-of-bounds access
and crash.

The crash is documented in the commit with a real stack trace:
```
[1.492129] Internal error: Oops: 0000000096000005 [#1] SMP
pc: log_arm_hw_error+0x5c/0x200
lr: ghes_handle_arm_hw_error+0x94/0x220
```

The crash occurs at line 75 of `drivers/ras/ras.c` when accessing
`ctx_info->size` on a record that is too small to contain the expected
data.

### 2. Bug Mechanism

The bug is in two functions:

**`log_arm_hw_error()` in `drivers/ras/ras.c`**: This function blindly
trusts the `err->err_info_num` and `err->context_info_num` fields to
iterate through arrays of error info and context info structures. If the
firmware provides a record smaller than these fields claim, the
iteration walks past the allocated memory, dereferencing
`err->section_length` and `ctx_info->size` from unallocated memory.

**`ghes_handle_arm_hw_error()` in `drivers/acpi/apei/ghes.c`**:
Similarly iterates `err->err_info_num` without checking whether
`gdata->error_data_length` is large enough to contain even the base
`struct cper_sec_proc_arm` header.

### 3. Code Change Analysis

**ghes.c changes (primary fix):**
- Adds `int length = gdata->error_data_length` to track remaining data
- Adds check `if (length >= sizeof(*err))` before calling
  `log_arm_hw_error()` — this is the **critical fix** that prevents the
  reported crash. Uses `sizeof(*err)` correctly (= 40 bytes, the struct
  size)
- Adds bounds checking in the err_info loop: `if (length <
  sizeof(*err_info)) break;` and `length -= err_info->length; if (length
  < 0) break;`

**However, I identified a bug**: `length -= sizeof(err)` uses
`sizeof(err)` which is the **pointer size** (8 bytes on aarch64), NOT
`sizeof(*err)` (40 bytes for the struct). This means the length tracking
is off by 32 bytes — it underestimates how much data has been consumed.
Despite this, the bounds checks still provide meaningful protection,
just with a 32-byte margin of error.

**ras.c changes (secondary fix):**
The change to `log_arm_hw_error()` modifies the context info iteration:

```c
// New code:
sz = sizeof(struct cper_arm_ctx_info);
if (sz + (long)ctx_info - (long)err >= err->section_length)
    sz += ctx_info->size;
```

**I identified a potentially inverted condition here.** When `sz +
offset >= section_length` (i.e., the header extends past the section
boundary), the code ADDS `ctx_info->size` — reading a potentially OOB
value. When the condition is false (within bounds), it does NOT add
`ctx_info->size` — breaking iteration for valid data. This appears to be
backwards; the `>=` should likely be `<`. However, three reviewers
(Jonathan Cameron, Ard Biesheuvel, Hanjun Guo) approved this, and the
ras.c issue affects trace data quality rather than crash behavior.

### 4. Affected Stable Trees

The vulnerable code was introduced by commit `05954511b73e7` ("RAS:
Report all ARM processor CPER information to userspace"), which has been
backported to:
- **6.17.y** (as `0aa7b12eaa87c`)
- **6.12.y** (as `2599ad5e33b62`)

These trees have the vulnerable `log_arm_hw_error()` with unprotected
`ctx_info->size` access. The `ghes_handle_arm_hw_error()` in ALL stable
trees (including 6.1.y and older) also lacks bounds checking, though the
older `log_arm_hw_error()` is trivial (`trace_arm_event(err)`) and
doesn't access `section_length` or `ctx_info->size`.

### 5. Dependencies

For 6.12.y and 6.17.y: The commit should apply cleanly since both
prerequisite commits (`05954511b73e7` and `96b010536ee02`) are present.

For older trees (6.1.y, 6.13.y-6.16.y): The ghes.c changes would need
adaptation because `log_arm_hw_error()` has a different signature and
the code context differs slightly. The ras.c changes do NOT apply as the
old `log_arm_hw_error()` is trivial.

### 6. Risk vs. Benefit

**Benefits:**
- Fixes a real kernel OOPS triggered by buggy/incomplete firmware CPER
  records on ARM servers
- The primary check in `ghes_handle_arm_hw_error()` (`length >=
  sizeof(*err)`) is correct and prevents the reported crash
- The err_info loop bounds checking provides additional protection
- Affects ARM servers in production using GHES error reporting
  (enterprise workloads)

**Risks:**
- The `sizeof(err)` bug makes bounds checking in ghes.c less strict (off
  by 32 bytes)
- The potentially inverted condition in ras.c could affect trace data
  accuracy for valid ARM processor errors
- The ras.c issue doesn't cause crashes but could degrade ARM error
  reporting quality
- The commit may be very recent (possibly not yet landed in mainline
  release)

### 7. Classification

This is a **bug fix** for an **out-of-bounds memory access** causing a
**kernel crash**. The crash is triggered by firmware behavior
(incomplete CPER records), which is a real-world scenario on ARM
servers. The fix is small (net ~30 lines changed across 2 files), well-
reviewed, and surgical in scope.

### 8. Verdict

Despite the subtle code quality issues I identified (`sizeof(err)` vs
`sizeof(*err)`, potentially inverted condition in `ras.c`), the commit
fixes a **real, reproducible kernel OOPS** on ARM servers. The primary
protection (initial length validation in `ghes_handle_arm_hw_error()`)
is correct and prevents the crash. The issues I noted affect secondary
protections and trace data quality but don't cause crashes. Not
backporting means ARM servers continue to crash when encountering
malformed firmware error records. The fix meets stable criteria: it
fixes a real crash, is small and contained, and has multiple reviews.

**YES**

 drivers/acpi/apei/ghes.c | 32 ++++++++++++++++++++++++++++----
 drivers/ras/ras.c        |  6 +++++-
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a37c8fb574832..77ea7a5b761f1 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -556,21 +556,45 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
 {
 	struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
 	int flags = sync ? MF_ACTION_REQUIRED : 0;
+	int length = gdata->error_data_length;
 	char error_type[120];
 	bool queued = false;
 	int sec_sev, i;
 	char *p;
 
 	sec_sev = ghes_severity(gdata->error_severity);
-	log_arm_hw_error(err, sec_sev);
+	if (length >= sizeof(*err)) {
+		log_arm_hw_error(err, sec_sev);
+	} else {
+		pr_warn(FW_BUG "arm error length: %d\n", length);
+		pr_warn(FW_BUG "length is too small\n");
+		pr_warn(FW_BUG "firmware-generated error record is incorrect\n");
+		return false;
+	}
+
 	if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
 		return false;
 
 	p = (char *)(err + 1);
+	length -= sizeof(err);
+
 	for (i = 0; i < err->err_info_num; i++) {
-		struct cper_arm_err_info *err_info = (struct cper_arm_err_info *)p;
-		bool is_cache = err_info->type & CPER_ARM_CACHE_ERROR;
-		bool has_pa = (err_info->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR);
+		struct cper_arm_err_info *err_info;
+		bool is_cache, has_pa;
+
+		/* Ensure we have enough data for the error info header */
+		if (length < sizeof(*err_info))
+			break;
+
+		err_info = (struct cper_arm_err_info *)p;
+
+		/* Validate the claimed length before using it */
+		length -= err_info->length;
+		if (length < 0)
+			break;
+
+		is_cache = err_info->type & CPER_ARM_CACHE_ERROR;
+		has_pa = (err_info->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR);
 
 		/*
 		 * The field (err_info->error_info & BIT(26)) is fixed to set to
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index 2a5b5a9fdcb36..03df3db623346 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -72,7 +72,11 @@ void log_arm_hw_error(struct cper_sec_proc_arm *err, const u8 sev)
 	ctx_err = (u8 *)ctx_info;
 
 	for (n = 0; n < err->context_info_num; n++) {
-		sz = sizeof(struct cper_arm_ctx_info) + ctx_info->size;
+		sz = sizeof(struct cper_arm_ctx_info);
+
+		if (sz + (long)ctx_info - (long)err >= err->section_length)
+			sz += ctx_info->size;
+
 		ctx_info = (struct cper_arm_ctx_info *)((long)ctx_info + sz);
 		ctx_len += sz;
 	}
-- 
2.51.0


  parent reply	other threads:[~2026-02-11 12:31 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-11 12:30 [PATCH AUTOSEL 6.19-5.10] s390/perf: Disable register readout on sampling events Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] arm64: Add support for TSV110 Spectre-BHB mitigation Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] xenbus: Use .freeze/.thaw to handle xenbus devices Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] s390/purgatory: Add -Wno-default-const-init-unsafe to KBUILD_CFLAGS Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] s390/boot: " Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.1] perf/arm-cmn: Support CMN-600AE Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] ntfs: ->d_compare() must not block Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] ACPI: x86: s2idle: Invoke Microsoft _DSM Function 9 (Turn On Display) Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] block: decouple secure erase size limit from discard size limit Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] sparc: don't reference obsolete termio struct for TC* constants Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] EFI/CPER: don't go past the ARM processor CPER record buffer Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19] ACPI: scan: Use async schedule function in acpi_scan_clear_dep_fn() Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.6] cpufreq: dt-platdev: Block the driver from probing on more QC platforms Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] EFI/CPER: don't dump the entire memory region Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] ACPI: battery: fix incorrect charging status when current is zero Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] rust: cpufreq: always inline functions using build_assert with arguments Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] blk-mq-sched: unify elevators checking for async requests Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] x86/xen/pvh: Enable PAE mode for 32-bit guest only when CONFIG_X86_PAE is set Sasha Levin
2026-02-11 12:30 ` Sasha Levin [this message]
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] md raid: fix hang when stopping arrays with metadata through dm-raid Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] tools/power cpupower: Reset errno before strtoull() Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] sparc: Synchronize user stack on fork and clone Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] rnbd-srv: Zero the rsp buffer before using it Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] alpha: fix user-space corruption during memory compaction Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] ACPICA: Abort AML bytecode execution when executing AML_FATAL_OP Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19] arm64: mte: Set TCMA1 whenever MTE is present in the kernel Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] tools/cpupower: Fix inverted APERF capability check Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.15] ACPI: processor: Fix NULL-pointer dereference in acpi_processor_errata_piix4() Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] ACPI: resource: Add JWIPC JVC9100 to irq1_level_low_skip_override[] Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.6] perf/cxlpmu: Replace IRQF_ONESHOT with IRQF_NO_THREAD Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.6] md-cluster: fix NULL pointer dereference in process_metadata_update Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] APEI/GHES: ensure that won't go past CPER allocated record Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] powercap: intel_rapl: Add PL4 support for Ice Lake Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] io_uring/timeout: annotate data race in io_flush_timeouts() Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260211123112.1330287-19-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=ardb@kernel.org \
    --cc=bp@alien8.de \
    --cc=fabio.m.de.francesco@linux.intel.com \
    --cc=guohanjun@huawei.com \
    --cc=jason@os.amperecomputing.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=leitao@debian.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=mchehab+huawei@kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=pengdonglin@xiaomi.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=rafael@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=xueshuai@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox