Re: [PATCH v4 10/10] RAS: add firmware-first CPER provider

Devicetree
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Ahmed Tiba" <ahmed.tiba@arm.com>
Cc: devicetree@vger.kernel.org, robh@kernel.org, conor+dt@kernel.org
Subject: Re: [PATCH v4 10/10] RAS: add firmware-first CPER provider
Date: Mon, 18 May 2026 12:52:08 +0000	[thread overview]
Message-ID: <20260518125209.47973C2BCB7@smtp.kernel.org> (raw)
In-Reply-To: <20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-10-42698675ba61@arm.com>

Thank you for your contribution! Sashiko AI review found 7 potential issue(s) to consider:
- [Critical] Broken Synchronous External Abort (SEA) notification handling causes kernel thread crashes and probe failures.
- [High] Missing validation of firmware-provided CPER data lengths leads to out-of-bounds heap reads.
- [High] Lockless initialization of the global `ghes_estatus_pool` causes TOCTOU races and memory exhaustion.
- [High] Unbinding the device leaves a dangling pointer to `ctx->generic` in the global GHES estatus cache.
- [High] Failure to verify the `block_status` field before processing the CPER record.
- [Medium] Implicit truncation of the dynamically allocated `source_id`.
- [Low] Unconditional use of 64-bit I/O accessors will break compilation on 32-bit platforms.
--

commit cfb39f0834e377f821505be4d4466300f20e3265
Author: Ahmed Tiba <ahmed.tiba@arm.com>

RAS: add firmware-first CPER provider

Add a firmware-first CPER provider that reuses the shared
GHES helpers, wire it into the RAS Kconfig/Makefile and
document it in the admin guide.

> diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
> index fc4f4bb94a4c6..3c1c63b2fefc3 100644
> --- a/drivers/ras/Kconfig
> +++ b/drivers/ras/Kconfig
> @@ -34,6 +34,17 @@ if RAS
>  source "arch/x86/ras/Kconfig"
>  source "drivers/ras/amd/atl/Kconfig"
>  
> +config RAS_CPER_ESOURCE
> +	bool "Firmware-first CPER error source block provider"
> +	select GHES_CPER_HELPERS

Should this depend on 64BIT? The driver uses readq() and writeq()
unconditionally, which will break compilation on 32-bit architectures.

> +	help
> +	  Enable support for firmware-first Common Platform Error Record

[ ... ]

> diff --git a/drivers/ras/cper-esource.c b/drivers/ras/cper-esource.c
> new file mode 100644
> index 0000000000000..83f7a910e50a7
> --- /dev/null
> +++ b/drivers/ras/cper-esource.c
> @@ -0,0 +1,257 @@

[ ... ]

> +static int cper_esource_init_pool(void)
> +{
> +	if (ghes_estatus_pool)
> +		return 0;
> +
> +	return ghes_estatus_pool_init(1);
> +}

Is there a race condition here if multiple instances of this driver probe
concurrently?

Both could observe a NULL pool and call ghes_estatus_pool_init()
simultaneously, causing a double allocation.

Also, if the pool already exists, this returns without expanding its capacity.
Will this reduce the safety margin for gen_pool_alloc() and risk dropping
records during an error storm since the pool wasn't scaled for the new
sources?

[ ... ]

> +static void cper_esource_process(struct cper_esource *ctx)
> +{
> +	int sev;
> +
> +	guard(spinlock_irqsave)(&ctx->lock);
> +
> +	if (cper_esource_copy_status(ctx))
> +		return;

Should we verify that ctx->estatus->block_status is non-zero before
processing the record?

If the interrupt is shared or fires spuriously, we might process a zeroed
block. A zeroed error_severity maps to CPER_SEV_RECOVERABLE, which could
falsely report a hardware error and spuriously trigger the hardware ACK.

> +
> +	sev = ghes_severity(ctx->estatus->error_severity);
> +	if (sev >= GHES_SEV_PANIC)
> +		cper_esource_fatal(ctx);
> +
> +	if (!ghes_estatus_cached(ctx->estatus) &&
> +	    ghes_print_estatus(NULL, ctx->generic, ctx->estatus))
> +		ghes_estatus_cache_add(ctx->generic, ctx->estatus);

Could this lead to a use-after-free if the device is unbound?

ctx->generic is allocated with devm_kzalloc(), so it will be freed
automatically on device unbind. However, ghes_estatus_cache_add() places
this pointer into the global ghes_estatus_caches array, leaving a dangling
pointer if the device is ever unbound.

> +
> +	ghes_cper_handle_status(ctx->dev, ctx->generic, ctx->estatus, ctx->sync);

If ctx->sync is true, ghes_cper_handle_status() will eventually call
force_sig(SIGBUS) on current. Since this is called from a threaded IRQ
handler:

cper_esource_irq() -> cper_esource_process() -> ghes_cper_handle_status()

Would current point to the kernel IRQ thread rather than the faulting
userspace process? This seems like it would crash the kernel IRQ thread
instead of delivering the signal to the correct task.

> +	cper_esource_ack(ctx);
> +}

[ ... ]

> +static int cper_esource_probe(struct platform_device *pdev)
> +{
> +	struct device *dev = &pdev->dev;
> +	struct cper_esource *ctx;
> +	struct resource *res;
> +	int source_id;
> +	int rc;

[ ... ]

> +	ctx->status = devm_platform_get_and_ioremap_resource(pdev, 0, &res);
> +	if (IS_ERR(ctx->status))
> +		return dev_err_probe(dev, PTR_ERR(ctx->status),
> +				     "failed to map status region\n");
> +
> +	ctx->status_len = resource_size(res);
> +	if (!ctx->status_len)
> +		return dev_err_probe(dev, -EINVAL, "status region has zero length\n");
> +
> +	rc = cper_esource_init_ack(pdev, ctx);
> +	if (rc)
> +		return rc;
> +
> +	rc = cper_esource_init_pool();
> +	if (rc)
> +		return rc;
> +
> +	ctx->estatus = devm_kzalloc(dev, ctx->status_len, GFP_KERNEL);

Does this adequately validate the size of the firmware-provided data?

ctx->status_len is based strictly on the MMIO resource size. If the resource
size is smaller than sizeof(struct acpi_hest_generic_status), or if the
firmware populates an estatus->data_length that exceeds the allocated
ctx->status_len buffer, will the subsequent CPER parsing functions read
out-of-bounds on the heap?

> +	if (!ctx->estatus)
> +		return -ENOMEM;
> +
> +	ctx->generic = devm_kzalloc(dev, sizeof(*ctx->generic), GFP_KERNEL);
> +	if (!ctx->generic)
> +		return -ENOMEM;
> +
> +	source_id = ida_alloc_min(&cper_esource_source_ids, 1, GFP_KERNEL);
> +	if (source_id < 0)
> +		return source_id;
> +
> +	ctx->generic->header.type = ACPI_HEST_TYPE_GENERIC_ERROR;
> +	ctx->generic->header.source_id = source_id;

Could source_id be truncated here?

ida_alloc_min() can return values up to INT_MAX, but
ctx->generic->header.source_id is a u16. If the IDA allocates a value
greater than 65535, it will be silently truncated, potentially causing ID
collisions with other error sources.

> +
> +	rc = devm_add_action_or_reset(dev, cper_esource_release_source_id,
> +				      ctx->generic);
> +	if (rc)
> +		return rc;
> +
> +	ctx->generic->notify.type = ctx->sync ?
> +		ACPI_HEST_NOTIFY_SEA : ACPI_HEST_NOTIFY_EXTERNAL;
> +	ctx->generic->error_block_length = ctx->status_len;
> +
> +	ctx->irq = platform_get_irq(pdev, 0);

If the device tree specifies that it uses a Synchronous External Abort (SEA)
for notification via the arm,sea-notify property, does the platform still
provide a standard IRQ?

If not, platform_get_irq() will fail and prevent the driver from probing
when only the SEA is described.

> +	if (ctx->irq < 0)
> +		return ctx->irq;
> +
> +	rc = devm_request_threaded_irq(dev, ctx->irq, NULL, cper_esource_irq,
> +				       IRQF_ONESHOT,
> +				       dev_name(dev), ctx);
> +	if (rc)
> +		return dev_err_probe(dev, rc, "failed to request interrupt\n");
> +
> +	return 0;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-0-42698675ba61@arm.com?part=10

     prev parent reply	other threads:[~2026-05-18 12:52 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-18 11:57 [PATCH v4 00/10] ACPI: APEI: share GHES CPER helpers and add DT FFH provider Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 01/10] ACPI: APEI: GHES: share macros via a private header Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 02/10] ACPI: APEI: GHES: move CPER read helpers Ahmed Tiba
2026-05-18 12:37   ` sashiko-bot
2026-05-18 11:57 ` [PATCH v4 03/10] ACPI: APEI: GHES: move GHESv2 ack and alloc helpers Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 04/10] ACPI: APEI: GHES: move estatus cache helpers Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 05/10] ACPI: APEI: GHES: move vendor record helpers Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 06/10] ACPI: APEI: GHES: move CXL CPER helpers Ahmed Tiba
2026-05-18 12:30   ` sashiko-bot
2026-05-18 11:57 ` [PATCH v4 07/10] ACPI: APEI: introduce GHES helper Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 08/10] ACPI: APEI: share GHES CPER helpers Ahmed Tiba
2026-05-18 12:39   ` sashiko-bot
2026-05-18 11:57 ` [PATCH v4 09/10] dt-bindings: firmware: add arm,ras-cper Ahmed Tiba
2026-05-18 12:23   ` sashiko-bot
2026-05-19  7:04   ` Krzysztof Kozlowski
2026-05-19  9:02     ` Ahmed Tiba
2026-05-19  9:22       ` Krzysztof Kozlowski
2026-05-19  9:57         ` Ahmed Tiba
2026-05-18 11:57 ` [PATCH v4 10/10] RAS: add firmware-first CPER provider Ahmed Tiba
2026-05-18 12:52   ` sashiko-bot [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260518125209.47973C2BCB7@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=ahmed.tiba@arm.com \
    --cc=conor+dt@kernel.org \
    --cc=devicetree@vger.kernel.org \
    --cc=robh@kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox