From: sashiko-bot@kernel.org
To: "Ahmed Tiba" <ahmed.tiba@arm.com>
Cc: linux-cxl@vger.kernel.org, devicetree@vger.kernel.org,
robh@kernel.org, conor+dt@kernel.org
Subject: Re: [PATCH v6 08/10] ACPI: APEI: share GHES CPER helpers
Date: Wed, 17 Jun 2026 14:11:13 +0000 [thread overview]
Message-ID: <20260617141114.3061B1F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260617-topics-ahmtib01-ras_ffh_arm_internal_review-v6-8-91f725174aa0@arm.com>
Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
Pre-existing issues:
- [Critical] Incorrect size subtraction in `ghes_handle_arm_hw_error()` inflates the remaining payload length, leading to out-of-bounds reads.
- [High] Memory allocated for aer_info is permanently leaked if the aer_recover_ring kfifo overflows.
- [High] The `queued` boolean flag is overwritten on each loop iteration, which can unnecessarily force a task to be killed with SIGBUS.
- [High] Memory allocated for PCI AER recovery is leaked if the associated PCI device is not found.
- [High] Potential deadlock due to missing IRQ disable in CXL work registration functions.
--
commit 552b224fbe0e1a9a6cbfb11eef23d5902aa40cf8
Author: Ahmed Tiba <ahmed.tiba@arm.com>
ACPI: APEI: share GHES CPER helpers
This patch moves the GHES CPER helper routines from ghes.c to ghes_cper.c to
allow them to be shared with other firmware-first providers. It also
serializes the ghes_estatus_pool initialization to prevent race conditions
and pool reuse issues.
> diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
> index 66bf1af4db00b..460fe12b6513c 100644
> --- a/drivers/acpi/apei/ghes_cper.c
> +++ b/drivers/acpi/apei/ghes_cper.c
> @@ -36,9 +47,449 @@
[ ... ]
> +bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
> + int sev, bool sync)
> +{
> + struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
> + int flags = sync ? MF_ACTION_REQUIRED : 0;
> + int length = gdata->error_data_length;
> + char error_type[120];
> + bool queued = false;
> + int sec_sev, i;
> + char *p;
> +
> + sec_sev = ghes_severity(gdata->error_severity);
> + if (length >= sizeof(*err)) {
> + log_arm_hw_error(err, sec_sev);
> + } else {
> + pr_warn(FW_BUG "arm error length: %d\n", length);
> + pr_warn(FW_BUG "length is too small\n");
> + pr_warn(FW_BUG "firmware-generated error record is incorrect\n");
> + return false;
> + }
> +
> + if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
> + return false;
> +
> + p = (char *)(err + 1);
> + length -= sizeof(err);
[Severity: Critical]
This is a pre-existing issue, but does this subtract the size of the pointer
instead of the struct?
In drivers/acpi/apei/ghes_cper.c:ghes_handle_arm_hw_error(), err is a
pointer to struct cper_sec_proc_arm. Subtracting sizeof(err) subtracts
8 bytes instead of the actual structure size (40 bytes).
This artificially inflates the remaining length by 32 bytes. During the
subsequent loop, the boundary check if (length < sizeof(*err_info)) can
evaluate to true even when p has advanced past the end of the gdata
buffer, potentially causing out-of-bounds reads.
Should this use sizeof(*err) instead?
> +
> + for (i = 0; i < err->err_info_num; i++) {
> + struct cper_arm_err_info *err_info;
> + bool is_cache, has_pa;
> +
> + /* Ensure we have enough data for the error info header */
> + if (length < sizeof(*err_info))
> + break;
> +
> + err_info = (struct cper_arm_err_info *)p;
> +
> + /* Validate the claimed length before using it */
> + length -= err_info->length;
> + if (length < 0)
> + break;
> +
> + is_cache = err_info->type & CPER_ARM_CACHE_ERROR;
> + has_pa = (err_info->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR);
> +
> + /*
> + * The field (err_info->error_info & BIT(26)) is fixed to set to
> + * 1 in some old firmware of HiSilicon Kunpeng920. We assume that
> + * firmware won't mix corrected errors in an uncorrected section,
> + * and don't filter out 'corrected' error here.
> + */
> + if (is_cache && has_pa) {
> + queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
[Severity: High]
This isn't a regression introduced by this patch, but is it possible for
queued to be overwritten if there are multiple error infos?
In drivers/acpi/apei/ghes_cper.c:ghes_handle_arm_hw_error(), if an early error
info successfully queues memory failure work, but a subsequent one does not,
queued is overwritten to false.
Later in the caller, this might cause if (sync && !queued) to evaluate to true,
force-killing the task with SIGBUS even though the memory error was
actually queued for recovery.
Should this use queued |= ghes_do_memory_failure(...) to accumulate the
results?
> + p += err_info->length;
> + continue;
> + }
[ ... ]
> +void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
> +{
> +#ifdef CONFIG_ACPI_APEI_PCIEAER
> + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
> +
> + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
> + unsigned int devfn;
> + int aer_severity;
> + u8 *aer_info;
> +
> + devfn = PCI_DEVFN(pcie_err->device_id.device,
> + pcie_err->device_id.function);
> + aer_severity = cper_severity_to_aer(gdata->error_severity);
> +
> + /*
> + * If firmware reset the component to contain
> + * the error, we must reinitialize it before
> + * use, so treat it as a fatal AER error.
> + */
> + if (gdata->flags & CPER_SEC_RESET)
> + aer_severity = AER_FATAL;
> +
> + aer_info = (void *)gen_pool_alloc(ghes_estatus_pool,
> + sizeof(struct aer_capability_regs));
> + if (!aer_info)
> + return;
> + memcpy(aer_info, pcie_err->aer_info, sizeof(struct aer_capability_regs));
> +
> + aer_recover_queue(pcie_err->device_id.segment,
> + pcie_err->device_id.bus,
> + devfn, aer_severity,
> + (struct aer_capability_regs *)
> + aer_info);
[Severity: High]
This is a pre-existing issue, but does aer_info leak if aer_recover_queue()
fails to add the entry to the kfifo or if the device isn't found?
In drivers/pci/pcie/aer.c:aer_recover_queue():
if (kfifo_in_spinlocked(&aer_recover_ring, &entry, 1,
&aer_recover_ring_lock))
schedule_work(&aer_recover_work);
else
pr_err(...);
If the kfifo overflows, it drops the entry and returns without freeing the
buffer allocated from the fixed-size ghes_estatus_pool.
Additionally, in drivers/pci/pcie/aer.c:aer_recover_work_func(), if
pci_get_domain_bus_and_slot() returns NULL (e.g. if the device fell off the
bus), the code executes a continue statement which skips the
ghes_estatus_pool_region_free() call, leaking the memory.
Since ghes_estatus_pool is a fixed-size pool, repeated leaks could exhaust it
and prevent future RAS errors from being processed. Could these error paths
free the memory?
> + }
> +#endif
> +}
[ ... ]
> +void ghes_cper_handle_status(struct device *dev,
> + const struct acpi_hest_generic *generic,
> + const struct acpi_hest_generic_status *estatus,
> + bool sync)
> +{
> + int sev, sec_sev;
> + struct acpi_hest_generic_data *gdata;
> + guid_t *sec_type;
> + const guid_t *fru_id = &guid_null;
> + char *fru_text = "";
> + bool queued = false;
> +
> + sev = ghes_severity(estatus->error_severity);
> + apei_estatus_for_each_section(estatus, gdata) {
> + sec_type = (guid_t *)gdata->section_type;
> + sec_sev = ghes_severity(gdata->error_severity);
> + if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
> + fru_id = (guid_t *)gdata->fru_id;
> +
> + if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
> + fru_text = gdata->fru_text;
> +
> + ghes_log_hwerr(sev, sec_type);
> + if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
> + struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
> +
> + atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
> +
> + arch_apei_report_mem_error(sev, mem_err);
> + queued = ghes_handle_memory_failure(gdata, sev, sync);
[Severity: High]
This isn't a regression introduced by this patch, but does queued get
overwritten here as well?
Just like in ghes_handle_arm_hw_error() above, if an early section in
drivers/acpi/apei/ghes_cper.c:ghes_cper_handle_status() succeeds but a
later section does not, queued is overwritten to false.
This can result in the loop completing with queued == false despite a previous
successful queueing, which will then incorrectly trigger the sync && !queued
force kill logic below.
> + } else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
> + ghes_handle_aer(gdata);
> + } else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
> + queued = ghes_handle_arm_hw_error(gdata, sev, sync);
[Severity: High]
This is a pre-existing issue, but this direct assignment of queued has the
same overwriting problem across multiple sections as noted above.
> + } else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
[ ... ]
> diff --git a/include/acpi/ghes_cper.h b/include/acpi/ghes_cper.h
> index a853a5996cdfa..e09a33d343d79 100644
> --- a/include/acpi/ghes_cper.h
> +++ b/include/acpi/ghes_cper.h
[ ... ]
> @@ -107,6 +110,23 @@ void ghes_estatus_cache_add(struct acpi_hest_generic *generic,
> struct acpi_hest_generic_status *estatus);
> void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
> int sev);
[ ... ]
> void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
> int severity);
> int cxl_cper_register_prot_err_work(struct work_struct *work);
[Severity: High]
This is a pre-existing issue, but could the CXL work registration functions
deadlock with the GHES interrupt handler?
In drivers/acpi/apei/ghes_cper.c, the spinlocks cxl_cper_work_lock and
cxl_cper_prot_err_work_lock are acquired in hardirq/NMI-deferred context by
cxl_cper_post_event() and cxl_cper_post_prot_err() using guard(spinlock_irqsave).
However, the registration functions cxl_cper_register_work() and
cxl_cper_register_prot_err_work() acquire these locks using guard(spinlock),
leaving interrupts enabled.
If a GHES hardware error interrupt fires on the same CPU while the lock is held
by a registration function, the interrupt handler will attempt to acquire the
same lock, resulting in a deadlock. Could these registration functions use
guard(spinlock_irqsave)?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260617-topics-ahmtib01-ras_ffh_arm_internal_review-v6-0-91f725174aa0@arm.com?part=8
next prev parent reply other threads:[~2026-06-17 14:11 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-17 13:54 [PATCH v6 00/10] ACPI: APEI: share GHES CPER helpers and add DT FFH provider Ahmed Tiba
2026-06-17 13:54 ` [PATCH v6 01/10] ACPI: APEI: GHES: share macros via a private header Ahmed Tiba
2026-06-17 14:09 ` sashiko-bot
2026-06-17 13:54 ` [PATCH v6 02/10] ACPI: APEI: GHES: move CPER read helpers Ahmed Tiba
2026-06-17 14:15 ` sashiko-bot
2026-06-17 13:54 ` [PATCH v6 03/10] ACPI: APEI: GHES: move GHESv2 ack and alloc helpers Ahmed Tiba
2026-06-17 13:54 ` [PATCH v6 04/10] ACPI: APEI: GHES: move estatus cache helpers Ahmed Tiba
2026-06-17 14:10 ` sashiko-bot
2026-06-17 13:54 ` [PATCH v6 05/10] ACPI: APEI: GHES: move vendor record helpers Ahmed Tiba
2026-06-17 13:54 ` [PATCH v6 06/10] ACPI: APEI: GHES: move CXL CPER helpers Ahmed Tiba
2026-06-17 14:14 ` sashiko-bot
2026-06-17 13:54 ` [PATCH v6 07/10] ACPI: APEI: introduce GHES helper Ahmed Tiba
2026-06-17 17:17 ` Julian Braha
2026-06-17 13:54 ` [PATCH v6 08/10] ACPI: APEI: share GHES CPER helpers Ahmed Tiba
2026-06-17 14:11 ` sashiko-bot [this message]
2026-06-17 13:54 ` [PATCH v6 09/10] dt-bindings: firmware: add arm,ras-cper Ahmed Tiba
2026-06-17 13:54 ` [PATCH v6 10/10] RAS: add firmware-first CPER provider Ahmed Tiba
2026-06-17 14:10 ` sashiko-bot
2026-06-17 17:40 ` Julian Braha
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260617141114.3061B1F000E9@smtp.kernel.org \
--to=sashiko-bot@kernel.org \
--cc=ahmed.tiba@arm.com \
--cc=conor+dt@kernel.org \
--cc=devicetree@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=robh@kernel.org \
--cc=sashiko-reviews@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox