From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C132E44B686 for ; Mon, 18 May 2026 12:52:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779108729; cv=none; b=Co+mFskHbZjhnGEXYK0RAqwXCti5GAZ1296JqhYC8/WtaC3LH6IKEjFKFZIX2TY68A6Sa2ILrtquvdMCQtTDu46izjMR2AVdVmDLa8Qb9aywwUWYtG1sGFC1w63bk0nOuzA1bp2ukTjb0XPYPJH2m/SMZw3KdwsiVVXbWq2sH4k= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779108729; c=relaxed/simple; bh=2Xik32R0JwiOc7I7Pp7ha6ZClywgnnYcuSngp55dzeQ=; h=From:Subject:To:Cc:In-Reply-To:References:Content-Type:Date: Message-Id; b=FUTOy1ogG/DF1wMSrLUEtCTN+2krlfMnsZTzq2N8a71n8+aRg58WC6igE/eXM042vcV1EvLH8hm4tpVhVA7t0MPEdkWMOENEvoGSxZQZTSEx2g8IGLLq+A5FzL7IfNhAGpPBpfRO/DxVFZ639wckqapxkrE28wm3dnhFojaSxQ0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=KvmJYPVj; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="KvmJYPVj" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 47973C2BCB7; Mon, 18 May 2026 12:52:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779108729; bh=2Xik32R0JwiOc7I7Pp7ha6ZClywgnnYcuSngp55dzeQ=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date:From; b=KvmJYPVjEWfBoZO4APRWrU2n/ebh3Nghg2nXCLjf5TposN8Ph8jTa34PIaH4FSmpq HnR/nOyoRj6bCNRymW0r4UrtfzR1JLhHNSgwQoJDWyeWoANQyepWzoDk1ZoiXrf8Ad c+gc+pjSjiOF6KBV53lVDqPRz1HrNYLCcyJ+Zs0QOjLlaGEjLl/RiqEKWyexZNunAo cOiUfn7cBLyZZFJf6d7iaBiTc3FAUV/GniGfo+nv2zmo6mfYR4gzWpzBfm2NCKu5WM 1zfdE/D4Ko77cMtnL/svixh3Z9j67In9K9hFG2kpjHq7ieecdWkixESzkM9mbmV5iR Sr9QtpbWI9jMw== From: sashiko-bot@kernel.org Subject: Re: [PATCH v4 10/10] RAS: add firmware-first CPER provider Reply-To: sashiko-reviews@lists.linux.dev To: "Ahmed Tiba" Cc: devicetree@vger.kernel.org, robh@kernel.org, conor+dt@kernel.org In-Reply-To: <20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-10-42698675ba61@arm.com> References: <20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-10-42698675ba61@arm.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Mon, 18 May 2026 12:52:08 +0000 Message-Id: <20260518125209.47973C2BCB7@smtp.kernel.org> Precedence: bulk X-Mailing-List: devicetree@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Thank you for your contribution! Sashiko AI review found 7 potential issue(= s) to consider: - [Critical] Broken Synchronous External Abort (SEA) notification handling = causes kernel thread crashes and probe failures. - [High] Missing validation of firmware-provided CPER data lengths leads to= out-of-bounds heap reads. - [High] Lockless initialization of the global `ghes_estatus_pool` causes T= OCTOU races and memory exhaustion. - [High] Unbinding the device leaves a dangling pointer to `ctx->generic` i= n the global GHES estatus cache. - [High] Failure to verify the `block_status` field before processing the C= PER record. - [Medium] Implicit truncation of the dynamically allocated `source_id`. - [Low] Unconditional use of 64-bit I/O accessors will break compilation on= 32-bit platforms. -- commit cfb39f0834e377f821505be4d4466300f20e3265 Author: Ahmed Tiba RAS: add firmware-first CPER provider =20 Add a firmware-first CPER provider that reuses the shared GHES helpers, wire it into the RAS Kconfig/Makefile and document it in the admin guide. > diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig > index fc4f4bb94a4c6..3c1c63b2fefc3 100644 > --- a/drivers/ras/Kconfig > +++ b/drivers/ras/Kconfig > @@ -34,6 +34,17 @@ if RAS > source "arch/x86/ras/Kconfig" > source "drivers/ras/amd/atl/Kconfig" > =20 > +config RAS_CPER_ESOURCE > + bool "Firmware-first CPER error source block provider" > + select GHES_CPER_HELPERS Should this depend on 64BIT? The driver uses readq() and writeq() unconditionally, which will break compilation on 32-bit architectures. > + help > + Enable support for firmware-first Common Platform Error Record [ ... ] > diff --git a/drivers/ras/cper-esource.c b/drivers/ras/cper-esource.c > new file mode 100644 > index 0000000000000..83f7a910e50a7 > --- /dev/null > +++ b/drivers/ras/cper-esource.c > @@ -0,0 +1,257 @@ [ ... ] > +static int cper_esource_init_pool(void) > +{ > + if (ghes_estatus_pool) > + return 0; > + > + return ghes_estatus_pool_init(1); > +} Is there a race condition here if multiple instances of this driver probe concurrently? Both could observe a NULL pool and call ghes_estatus_pool_init() simultaneously, causing a double allocation. Also, if the pool already exists, this returns without expanding its capaci= ty. Will this reduce the safety margin for gen_pool_alloc() and risk dropping records during an error storm since the pool wasn't scaled for the new sources? [ ... ] > +static void cper_esource_process(struct cper_esource *ctx) > +{ > + int sev; > + > + guard(spinlock_irqsave)(&ctx->lock); > + > + if (cper_esource_copy_status(ctx)) > + return; Should we verify that ctx->estatus->block_status is non-zero before processing the record? If the interrupt is shared or fires spuriously, we might process a zeroed block. A zeroed error_severity maps to CPER_SEV_RECOVERABLE, which could falsely report a hardware error and spuriously trigger the hardware ACK. > + > + sev =3D ghes_severity(ctx->estatus->error_severity); > + if (sev >=3D GHES_SEV_PANIC) > + cper_esource_fatal(ctx); > + > + if (!ghes_estatus_cached(ctx->estatus) && > + ghes_print_estatus(NULL, ctx->generic, ctx->estatus)) > + ghes_estatus_cache_add(ctx->generic, ctx->estatus); Could this lead to a use-after-free if the device is unbound? ctx->generic is allocated with devm_kzalloc(), so it will be freed automatically on device unbind. However, ghes_estatus_cache_add() places this pointer into the global ghes_estatus_caches array, leaving a dangling pointer if the device is ever unbound. > + > + ghes_cper_handle_status(ctx->dev, ctx->generic, ctx->estatus, ctx->sync= ); If ctx->sync is true, ghes_cper_handle_status() will eventually call force_sig(SIGBUS) on current. Since this is called from a threaded IRQ handler: cper_esource_irq() -> cper_esource_process() -> ghes_cper_handle_status() Would current point to the kernel IRQ thread rather than the faulting userspace process? This seems like it would crash the kernel IRQ thread instead of delivering the signal to the correct task. > + cper_esource_ack(ctx); > +} [ ... ] > +static int cper_esource_probe(struct platform_device *pdev) > +{ > + struct device *dev =3D &pdev->dev; > + struct cper_esource *ctx; > + struct resource *res; > + int source_id; > + int rc; [ ... ] > + ctx->status =3D devm_platform_get_and_ioremap_resource(pdev, 0, &res); > + if (IS_ERR(ctx->status)) > + return dev_err_probe(dev, PTR_ERR(ctx->status), > + "failed to map status region\n"); > + > + ctx->status_len =3D resource_size(res); > + if (!ctx->status_len) > + return dev_err_probe(dev, -EINVAL, "status region has zero length\n"); > + > + rc =3D cper_esource_init_ack(pdev, ctx); > + if (rc) > + return rc; > + > + rc =3D cper_esource_init_pool(); > + if (rc) > + return rc; > + > + ctx->estatus =3D devm_kzalloc(dev, ctx->status_len, GFP_KERNEL); Does this adequately validate the size of the firmware-provided data? ctx->status_len is based strictly on the MMIO resource size. If the resource size is smaller than sizeof(struct acpi_hest_generic_status), or if the firmware populates an estatus->data_length that exceeds the allocated ctx->status_len buffer, will the subsequent CPER parsing functions read out-of-bounds on the heap? > + if (!ctx->estatus) > + return -ENOMEM; > + > + ctx->generic =3D devm_kzalloc(dev, sizeof(*ctx->generic), GFP_KERNEL); > + if (!ctx->generic) > + return -ENOMEM; > + > + source_id =3D ida_alloc_min(&cper_esource_source_ids, 1, GFP_KERNEL); > + if (source_id < 0) > + return source_id; > + > + ctx->generic->header.type =3D ACPI_HEST_TYPE_GENERIC_ERROR; > + ctx->generic->header.source_id =3D source_id; Could source_id be truncated here? ida_alloc_min() can return values up to INT_MAX, but ctx->generic->header.source_id is a u16. If the IDA allocates a value greater than 65535, it will be silently truncated, potentially causing ID collisions with other error sources. > + > + rc =3D devm_add_action_or_reset(dev, cper_esource_release_source_id, > + ctx->generic); > + if (rc) > + return rc; > + > + ctx->generic->notify.type =3D ctx->sync ? > + ACPI_HEST_NOTIFY_SEA : ACPI_HEST_NOTIFY_EXTERNAL; > + ctx->generic->error_block_length =3D ctx->status_len; > + > + ctx->irq =3D platform_get_irq(pdev, 0); If the device tree specifies that it uses a Synchronous External Abort (SEA) for notification via the arm,sea-notify property, does the platform still provide a standard IRQ? If not, platform_get_irq() will fail and prevent the driver from probing when only the SEA is described. > + if (ctx->irq < 0) > + return ctx->irq; > + > + rc =3D devm_request_threaded_irq(dev, ctx->irq, NULL, cper_esource_irq, > + IRQF_ONESHOT, > + dev_name(dev), ctx); > + if (rc) > + return dev_err_probe(dev, rc, "failed to request interrupt\n"); > + > + return 0; > +} --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260518-topics-ahm= tib01-ras_ffh_arm_internal_review-v4-0-42698675ba61@arm.com?part=3D10