From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F935291864 for ; Tue, 5 May 2026 21:07:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778015274; cv=none; b=h9S1qWVzbQ5u0mI95g4erYRJ3c0dL9GEnzFLRXihVmJi9bAGNQ1o2/AjXNOuHCEIItBkB0k9HdlbbzALy0TQCRfc0V3UAtRZXenpxGL1fc9RIW80D7R9AUVK2WcOWJfHUUW1hBDzWmIltJnhqpSPMXWyKLCvpu5kEXbao4zKDvE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778015274; c=relaxed/simple; bh=Mu5QqxyqVGswjgV1q0Kxf4rbENVMtQfqfy3dr4PFfUc=; h=From:Subject:To:Cc:In-Reply-To:References:Content-Type:Date: Message-Id; b=fgh4ehp5lMX17VgUbl1gnmcNyjJAzHurkr05SckEpSmIq5gob6a3BoCVhCrnUxAVRzp/vzMdz6EAIzLrQ1uK4TBkjvbuj4s8XxfkNyYy6Wc4GPDh37xxkSzu8ZwZk94o2BpVsEXHfhzYU9R4XQy2UPp68/ksxr7sG3P76rXZGgg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=q1k08Uy6; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="q1k08Uy6" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B9B8FC2BCB4; Tue, 5 May 2026 21:07:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778015273; bh=Mu5QqxyqVGswjgV1q0Kxf4rbENVMtQfqfy3dr4PFfUc=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date:From; b=q1k08Uy6PnaulxxSgI5Jjsrq8tiK52wYGSuO51FV7x8VJ6q7+ZRet4n6n8pDDf8lI nMUiqalG5eKsB68qj9ZI8Fgg4GZjJ595WihDA9PCUobobPDL6bv/h4PEdE6bxm+PoE wczCUvClfvWJ7mvGBJhRSvUhLZ3srpFgpYlhvUN99kwoDIvImKXVjHgt8cojaBx5q+ lw31pjS3/Tao34So19feGKblT7F+PhKrzCosicSVYzilwrHzs6Deu2SpMZtZg+wjyc YDkjjw8h5RCzSuCiokS5M3LpW2jl7FC6444rWp0k7TuQYn3wmd+QvnKWyIk2jOqbFI heEnMxjj+Cffw== From: sashiko-bot@kernel.org Subject: Re: [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Reply-To: sashiko@lists.linux.dev To: "Terry Bowman" Cc: linux-pci@vger.kernel.org In-Reply-To: <20260505173029.2718246-3-terry.bowman@amd.com> References: <20260505173029.2718246-3-terry.bowman@amd.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Tue, 05 May 2026 21:07:53 +0000 Message-Id: <20260505210753.B9B8FC2BCB4@smtp.kernel.org> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Thank you for your contribution! Sashiko AI review found 4 potential issue(= s) to consider: - [Critical] ABI breakage for user-space RAS monitoring tools. - [Critical] Massive kernel stack out-of-bounds read (Information Leak) in = trace_cxl_aer_uncorrectable_error. - [High] Unconditional PCI configuration space reads via pci_get_dsn() on p= otentially frozen links. - [High] Out-of-bounds MMIO read in header_log_copy() causing potential sys= tem crash. -- commit 7ac043ff49acf52a4133043e3ef651a4e11ba73f Author: Dan Williams cxl/ras: Unify Endpoint and Port AER trace events This commit unifies the trace events for CXL Endpoint and Port devices on both the CPER and native AER paths. It generalizes the correctable and uncorrectable error events to use a common device and serial number format. It also renames the memdev field to device to support all CXL device types. Link: https://lore.kernel.org/linux-cxl/69cb2d5ba3111_178904100b7@dwillia2-= mobl4.notmuch/ > **WARNING: ABI BREAK** > Rename the trace event field "memdev" to "device" so all CXL device t= ypes > (Ports and Endpoints) can be reported under a common field name. Note= this > is an ABI break for userspace tools that key off the old "memdev" fie= ld. > Specifically, rasdaemon's ras-cxl-handler.c looks up "memdev" and bai= ls on > NULL, so an unmodified rasdaemon will drop every CXL CE/UCE event onc= e this > kernel ships. A rasdaemon update is needed in a separate series. Does renaming this trace event field violate the stable ABI policy for trace events? Since unmodified system monitoring utilities like rasdaemon will drop these memory errors after this change, should a backward-compatible approach be used, such as retaining the memdev field or adding a new event without deleting the old one? > diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c > index 006c6ffc2f567..d7081caaf5d3f 100644 > --- a/drivers/cxl/core/ras.c > +++ b/drivers/cxl/core/ras.c > @@ -28,20 +28,24 @@ static void cxl_cper_trace_uncorr_port_prot_err(struc= t pci_dev *pdev, > else > fe =3D status; > =20 > - trace_cxl_port_aer_uncorrectable_error(&pdev->dev, status, fe, > - ras_cap.header_log); > + trace_cxl_aer_uncorrectable_error(&pdev->dev, status, fe, > + ras_cap.header_log, > + pci_get_dsn(pdev)); > } Can passing ras_cap.header_log here cause an out-of-bounds stack read? The ras_cap.header_log field is an array of 16 u32 elements (64 bytes). However, inside the trace_cxl_aer_uncorrectable_error() fast assign block, it copies CXL_HEADERLOG_SIZE (512 bytes): TP_fast_assign( ... memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE); ) Will this read 448 bytes past the end of the header_log array and leak adjacent uninitialized kernel stack data into the trace ring buffer, making it accessible to user-space? Also, can calling pci_get_dsn() here cause issues if the PCIe link is in a frozen state? Since pci_get_dsn() performs live PCI config space reads, executing it on a disconnected or frozen device could result in system hangs, extended timeou= ts, or Unsupported Request completions. Furthermore, since pci_get_dsn() is passed as an argument to the trace wrapper, won't the config space read execute unconditionally on every error even if tracing is disabled? [ ... ] > @@ -247,7 +252,7 @@ bool cxl_handle_ras(struct device *dev, void __iomem = *ras_base) > } > =20 > header_log_copy(ras_base, hl); > - trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl); > + trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial); > writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr); Can the call to header_log_copy() just above this trace event cause an out-of-bounds MMIO read? The header_log_copy() function iterates 128 times to read 512 bytes from the MMIO address: static void header_log_copy(void __iomem *ras_base, u32 *log) { ... int i, log_u32_size =3D CXL_HEADERLOG_SIZE / sizeof(u32); addr =3D ras_base + CXL_RAS_HEADER_LOG_OFFSET; log_addr =3D log; for (i =3D 0; i < log_u32_size; i++) { *log_addr =3D readl(addr); ... } However, during component probe, the CXL RAS capability block is mapped with a length of CXL_RAS_CAPABILITY_LENGTH, which is only 88 bytes. Could attempting to read 512 bytes from an 88-byte MMIO mapping send PCIe memory reads to unintended device offsets, potentially triggering Master Aborts, Unsupported Requests, or unhandled page faults? [ ... ] > @@ -299,7 +305,8 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *p= dev, > * chance the situation is recoverable dump the status of the RAS > * capability registers and bounce the active state of the memdev. > */ > - ue =3D cxl_handle_ras(&cxlds->cxlmd->dev, cxlmd->endpoint->regs.ras); > + ue =3D cxl_handle_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev), > + cxlmd->endpoint->regs.ras); > } In cxl_error_detected(), if the PCIe link is in a frozen state, will the call to pci_get_dsn() fail? Because pci_get_dsn() performs live PCI config space reads, calling it when the channel state is pci_channel_io_frozen could cause system hangs or Unsupported Request completions. Will this vulnerability persist even if subsequent patches in the series (like "cxl: Update Endpoint AER uncorrectable handler") refactor the error flow into a workqueue, since the pci_get_dsn() call remains on the fatal error path? --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260505173029.2718= 246-1-terry.bowman@amd.com?part=3D2