Re: [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events

Linux PCI subsystem development
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Terry Bowman" <terry.bowman@amd.com>
Cc: linux-pci@vger.kernel.org
Subject: Re: [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events
Date: Tue, 05 May 2026 21:07:53 +0000	[thread overview]
Message-ID: <20260505210753.B9B8FC2BCB4@smtp.kernel.org> (raw)
In-Reply-To: <20260505173029.2718246-3-terry.bowman@amd.com>

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [Critical] ABI breakage for user-space RAS monitoring tools.
- [Critical] Massive kernel stack out-of-bounds read (Information Leak) in trace_cxl_aer_uncorrectable_error.
- [High] Unconditional PCI configuration space reads via pci_get_dsn() on potentially frozen links.
- [High] Out-of-bounds MMIO read in header_log_copy() causing potential system crash.
--

commit 7ac043ff49acf52a4133043e3ef651a4e11ba73f
Author: Dan Williams <djbw@kernel.org>

cxl/ras: Unify Endpoint and Port AER trace events

This commit unifies the trace events for CXL Endpoint and Port devices on
both the CPER and native AER paths. It generalizes the correctable and
uncorrectable error events to use a common device and serial number format.
It also renames the memdev field to device to support all CXL device types.

Link: https://lore.kernel.org/linux-cxl/69cb2d5ba3111_178904100b7@dwillia2-mobl4.notmuch/

>     **WARNING: ABI BREAK**
>     Rename the trace event field "memdev" to "device" so all CXL device types
>     (Ports and Endpoints) can be reported under a common field name. Note this
>     is an ABI break for userspace tools that key off the old "memdev" field.
>     Specifically, rasdaemon's ras-cxl-handler.c looks up "memdev" and bails on
>     NULL, so an unmodified rasdaemon will drop every CXL CE/UCE event once this
>     kernel ships. A rasdaemon update is needed in a separate series.

Does renaming this trace event field violate the stable ABI policy for
trace events?

Since unmodified system monitoring utilities like rasdaemon will drop
these memory errors after this change, should a backward-compatible approach
be used, such as retaining the memdev field or adding a new event without
deleting the old one?

> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 006c6ffc2f567..d7081caaf5d3f 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -28,20 +28,24 @@ static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
>  	else
>  		fe = status;
>  
> -	trace_cxl_port_aer_uncorrectable_error(&pdev->dev, status, fe,
> -					       ras_cap.header_log);
> +	trace_cxl_aer_uncorrectable_error(&pdev->dev, status, fe,
> +					  ras_cap.header_log,
> +					  pci_get_dsn(pdev));
>  }

Can passing ras_cap.header_log here cause an out-of-bounds stack read?

The ras_cap.header_log field is an array of 16 u32 elements (64 bytes).
However, inside the trace_cxl_aer_uncorrectable_error() fast assign block,
it copies CXL_HEADERLOG_SIZE (512 bytes):

	TP_fast_assign(
		...
		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
	)

Will this read 448 bytes past the end of the header_log array and leak
adjacent uninitialized kernel stack data into the trace ring buffer, making
it accessible to user-space?

Also, can calling pci_get_dsn() here cause issues if the PCIe link is in a
frozen state?

Since pci_get_dsn() performs live PCI config space reads, executing it on a
disconnected or frozen device could result in system hangs, extended timeouts,
or Unsupported Request completions.

Furthermore, since pci_get_dsn() is passed as an argument to the trace
wrapper, won't the config space read execute unconditionally on every
error even if tracing is disabled?

[ ... ]

> @@ -247,7 +252,7 @@ bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>  	}
>  
>  	header_log_copy(ras_base, hl);
> -	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> +	trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);

Can the call to header_log_copy() just above this trace event cause an
out-of-bounds MMIO read?

The header_log_copy() function iterates 128 times to read 512 bytes from
the MMIO address:

static void header_log_copy(void __iomem *ras_base, u32 *log)
{
	...
	int i, log_u32_size = CXL_HEADERLOG_SIZE / sizeof(u32);

	addr = ras_base + CXL_RAS_HEADER_LOG_OFFSET;
	log_addr = log;

	for (i = 0; i < log_u32_size; i++) {
		*log_addr = readl(addr);
		...
}

However, during component probe, the CXL RAS capability block is mapped
with a length of CXL_RAS_CAPABILITY_LENGTH, which is only 88 bytes.

Could attempting to read 512 bytes from an 88-byte MMIO mapping send PCIe
memory reads to unintended device offsets, potentially triggering
Master Aborts, Unsupported Requests, or unhandled page faults?

[ ... ]

> @@ -299,7 +305,8 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  		 * chance the situation is recoverable dump the status of the RAS
>  		 * capability registers and bounce the active state of the memdev.
>  		 */
> -		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlmd->endpoint->regs.ras);
> +		ue = cxl_handle_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
> +				    cxlmd->endpoint->regs.ras);
>  	}

In cxl_error_detected(), if the PCIe link is in a frozen state, will the
call to pci_get_dsn() fail?

Because pci_get_dsn() performs live PCI config space reads, calling it
when the channel state is pci_channel_io_frozen could cause system hangs
or Unsupported Request completions.

Will this vulnerability persist even if subsequent patches in the series
(like "cxl: Update Endpoint AER uncorrectable handler") refactor the error
flow into a workqueue, since the pci_get_dsn() call remains on the
fatal error path?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260505173029.2718246-1-terry.bowman@amd.com?part=2

next prev parent reply	other threads:[~2026-05-05 21:07 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-05-05 20:26   ` sashiko-bot
2026-05-05 21:17   ` Dave Jiang
2026-05-07 17:53   ` Jonathan Cameron
2026-05-07 18:26     ` Bowman, Terry
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
2026-05-05 21:07   ` sashiko-bot [this message]
2026-05-05 21:46   ` Dave Jiang
2026-05-07 18:08   ` Jonathan Cameron
2026-05-07 18:33     ` Bowman, Terry
2026-05-08 14:05       ` Jonathan Cameron
2026-05-09  3:49         ` Dan Williams (nvidia)
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
2026-05-05 21:30   ` sashiko-bot
2026-05-05 22:02   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
2026-05-05 22:06   ` Dave Jiang
2026-05-07 18:11     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
2026-05-05 21:52   ` sashiko-bot
2026-05-05 22:16   ` Dave Jiang
2026-05-07 18:14   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-05-05 22:28   ` sashiko-bot
2026-05-07 18:22   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
2026-05-05 23:34   ` sashiko-bot
2026-05-05 23:59   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-05-06 17:43   ` Dave Jiang
2026-05-07 18:25     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
2026-05-06  1:01   ` sashiko-bot
2026-05-06 18:00   ` Dave Jiang
2026-05-07 18:29   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
2026-05-06 18:34   ` Dave Jiang
2026-05-07 18:51   ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260505210753.B9B8FC2BCB4@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=sashiko@lists.linux.dev \
    --cc=terry.bowman@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox