Re: [PATCH v4 08/10] ACPI: APEI: share GHES CPER helpers

Devicetree
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Ahmed Tiba" <ahmed.tiba@arm.com>
Cc: robh@kernel.org, conor+dt@kernel.org, devicetree@vger.kernel.org
Subject: Re: [PATCH v4 08/10] ACPI: APEI: share GHES CPER helpers
Date: Mon, 18 May 2026 12:39:39 +0000	[thread overview]
Message-ID: <20260518123939.A9E15C2BCB7@smtp.kernel.org> (raw)
In-Reply-To: <20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-8-42698675ba61@arm.com>

Thank you for your contribution! Sashiko AI review found 6 potential issue(s) to consider:
- [High] Incorrect pointer arithmetic leads to out-of-bounds reads in ARM hardware error handling.
- [High] State overwrite of the `queued` flag can mask unrecoverable synchronous errors.
- [High] Permanent memory leak from the fixed-size `ghes_estatus_pool` when `task_work_add` fails.
- [High] Permanent memory leak from the fixed-size `ghes_estatus_pool` when the AER recovery kfifo is full.
- [High] Unsafe initialization semantics of `ghes_estatus_pool_init` risk pool corruption or NULL pointer dereferences for shared providers.
- [Medium] Incomplete decoupling of APEI Kconfig dependencies prevents functional sharing with non-APEI providers.
--

commit 593ad4fbd91409a6d18986b04b8adfb0d9f1ad5a
Author: Ahmed Tiba <ahmed.tiba@arm.com>

ACPI: APEI: share GHES CPER helpers

Wire GHES up to the helper routines in ghes_cper.c and remove the local
copies from ghes.c. This keeps the control flow identical while letting
the helpers be shared with other firmware-first providers.

> diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
> index d7a666a163c3a..0ff9d06eb78f9 100644
> --- a/drivers/acpi/apei/ghes_cper.c
> +++ b/drivers/acpi/apei/ghes_cper.c
[ ... ]
> +int ghes_estatus_pool_init(unsigned int num_ghes)
> +{
> +	unsigned long addr, len;
> +	int rc;
> +
> +	ghes_estatus_pool = gen_pool_create(GHES_ESTATUS_POOL_MIN_ALLOC_ORDER, -1);
> +	if (!ghes_estatus_pool)
> +		return -ENOMEM;
[ ... ]
> +	return 0;
> +
> +err_pool_add:
> +	vfree((void *)addr);
> +
> +err_pool_alloc:
> +	gen_pool_destroy(ghes_estatus_pool);
> +
> +	return -ENOMEM;
> +}
> +EXPORT_SYMBOL_GPL(ghes_estatus_pool_init);

Is it safe to export this initialization function without safeguards against
multiple invocations?

If multiple firmware-first providers call it, ghes_estatus_pool might be
unconditionally overwritten, leading to leaked memory and corruption when
older objects are freed into the new pool. Also, if a non-GHES provider
uses the helpers without ensuring initialization, could gen_pool_alloc crash
on a NULL pool?

[ ... ]
> +static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> +{
> +	struct ghes_task_work *twcb;
> +	unsigned long pfn;
> +
> +	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> +		return false;

The commit message states the goal is to let these helpers be shared with
other firmware-first providers. However, functions like
ghes_do_memory_failure() and ghes_handle_aer() are guarded by
CONFIG_ACPI_APEI_MEMORY_FAILURE and CONFIG_ACPI_APEI_PCIEAER.

Since a non-APEI provider might not have these enabled, will memory and
AER errors be silently dropped?

[ ... ]
> +	if (flags == MF_ACTION_REQUIRED && current->mm) {
> +		twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
> +		if (!twcb)
> +			return false;
> +
> +		twcb->pfn = pfn;
> +		twcb->flags = flags;
> +		init_task_work(&twcb->twork, memory_failure_cb);
> +		task_work_add(current, &twcb->twork, TWA_RESUME);
> +		return true;
> +	}

This isn't a bug introduced by this patch, but pre-exists in the moved code.
Does this code leak the twcb allocation if task_work_add() fails?

If the target task is exiting, the callback might never be executed,
permanently leaking memory from the boot-time fixed-size ghes_estatus_pool.

[ ... ]
> +bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
> +			      int sev, bool sync)
> +{
> +	struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
[ ... ]
> +	p = (char *)(err + 1);
> +	length -= sizeof(err);

This isn't a bug introduced by this patch, but pre-exists in the moved code.
Does this incorrectly subtract the size of the pointer err (8 bytes) instead
of the size of the struct it points to?

This leaves length artificially inflated, which might allow the subsequent
bounds check to falsely pass and read past the end of the firmware-provided
payload into adjacent kernel memory.

[ ... ]
> +void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
> +{
> +#ifdef CONFIG_ACPI_APEI_PCIEAER
> +	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
[ ... ]
> +		aer_info = (void *)gen_pool_alloc(ghes_estatus_pool,
> +						  sizeof(struct aer_capability_regs));
> +		if (!aer_info)
> +			return;
> +		memcpy(aer_info, pcie_err->aer_info, sizeof(struct aer_capability_regs));
> +
> +		aer_recover_queue(pcie_err->device_id.segment,
> +				  pcie_err->device_id.bus,
> +				  devfn, aer_severity,
> +				  (struct aer_capability_regs *)
> +				  aer_info);
> +	}
> +#endif
> +}

This isn't a bug introduced by this patch, but pre-exists in the moved code.
Is it possible for aer_info to be leaked here?

If the internal aer_recover_ring kfifo in aer_recover_queue() is full,
the item is dropped with a log message, but the aer_info memory allocated
from ghes_estatus_pool is never freed. Could this silent leak eventually
deplete the finite error status pool during an AER storm?

[ ... ]
> +void ghes_cper_handle_status(struct device *dev,
> +			     const struct acpi_hest_generic *generic,
> +			     const struct acpi_hest_generic_status *estatus,
> +			     bool sync)
> +{
> +	int sev, sec_sev;
> +	struct acpi_hest_generic_data *gdata;
> +	guid_t *sec_type;
> +	const guid_t *fru_id = &guid_null;
> +	char *fru_text = "";
> +	bool queued = false;

Also regarding sharing helpers with other modules, should
ghes_cper_handle_status be exported so that loadable modules can use it
as intended?

> +
> +	sev = ghes_severity(estatus->error_severity);
> +	apei_estatus_for_each_section(estatus, gdata) {
[ ... ]
> +		if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
> +			struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
> +
> +			atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
> +
> +			arch_apei_report_mem_error(sev, mem_err);
> +			queued = ghes_handle_memory_failure(gdata, sev, sync);
> +		} else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
> +			ghes_handle_aer(gdata);
> +		} else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
> +			queued = ghes_handle_arm_hw_error(gdata, sev, sync);
[ ... ]
> +	}
> +
> +	/*
> +	 * If no memory failure work is queued for abnormal synchronous
> +	 * errors, do a force kill.
> +	 */
> +	if (sync && !queued) {
> +		dev_err(dev,
> +			HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n",
> +			current->comm, task_pid_nr(current));
> +		force_sig(SIGBUS);
> +	}
> +}

This isn't a bug introduced by this patch, but pre-exists in the moved code.
Can the queued flag be incorrectly overwritten during later loop iterations?

If an early section contains a synchronous unrecoverable error but fails to
queue recovery, and a later section successfully queues its recovery, the
queued flag might be overwritten to true. Could this bypass the force_sig
fallback at the end of the function, allowing the task to continue executing
despite a fatal memory error?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-0-42698675ba61@arm.com?part=8

next prev parent reply	other threads:[~2026-05-18 12:39 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-18 11:57 [PATCH v4 00/10] ACPI: APEI: share GHES CPER helpers and add DT FFH provider Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 01/10] ACPI: APEI: GHES: share macros via a private header Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 02/10] ACPI: APEI: GHES: move CPER read helpers Ahmed Tiba
2026-05-18 12:37   ` sashiko-bot
2026-05-18 11:57 ` [PATCH v4 03/10] ACPI: APEI: GHES: move GHESv2 ack and alloc helpers Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 04/10] ACPI: APEI: GHES: move estatus cache helpers Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 05/10] ACPI: APEI: GHES: move vendor record helpers Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 06/10] ACPI: APEI: GHES: move CXL CPER helpers Ahmed Tiba
2026-05-18 12:30   ` sashiko-bot
2026-05-18 11:57 ` [PATCH v4 07/10] ACPI: APEI: introduce GHES helper Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 08/10] ACPI: APEI: share GHES CPER helpers Ahmed Tiba
2026-05-18 12:39   ` sashiko-bot [this message]
2026-05-18 11:57 ` [PATCH v4 09/10] dt-bindings: firmware: add arm,ras-cper Ahmed Tiba
2026-05-18 12:23   ` sashiko-bot
2026-05-19  7:04   ` Krzysztof Kozlowski
2026-05-19  9:02     ` Ahmed Tiba
2026-05-19  9:22       ` Krzysztof Kozlowski
2026-05-19  9:57         ` Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 10/10] RAS: add firmware-first CPER provider Ahmed Tiba
2026-05-18 12:52   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260518123939.A9E15C2BCB7@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=ahmed.tiba@arm.com \
    --cc=conor+dt@kernel.org \
    --cc=devicetree@vger.kernel.org \
    --cc=robh@kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox