Re: [PATCH v3 4/4] edac: xilinx: Add EDAC support for Versal XilSem

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Rama devi Veggalam" <rama.devi.veggalam@amd.com>
Cc: devicetree@vger.kernel.org, conor+dt@kernel.org, robh@kernel.org
Subject: Re: [PATCH v3 4/4] edac: xilinx: Add EDAC support for Versal XilSem
Date: Wed, 24 Jun 2026 21:37:04 +0000	[thread overview]
Message-ID: <20260624213705.026D81F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260624212545.2850787-5-rama.devi.veggalam@amd.com>

Thank you for your contribution! Sashiko AI review found 13 potential issue(s) to consider:
- [Critical] Passing a 64-bit `dma_addr_t` to a 32-bit firmware API parameter truncates the physical address, causing memory corruption.
- [High] Missing 'semrtca' resource breaks driver probe for older Device Trees.
- [High] The sysfs `store` functions use the virtual EDAC device for DMA allocations instead of the platform device.
- [High] Error paths introduced in `mc_probe` fail to unregister the `VERSAL_EVENT_ERROR_PMC_ERR1` PM notifier, causing a Use-After-Free.
- [High] The error event handler uses an `if - else if` chain for bitmask evaluation, silently dropping uncorrectable errors.
- [Medium] Hexadecimal prefixes mistakenly used for decimal bit-shift macros cause undefined behavior and break CRAM frame decoding.
- [Medium] The `xsem_total_cframes_ssit_show` function mutates the driver's cached CRAM frames state in-place, corrupting data on subsequent reads.
- [Medium] Sysfs `store` functions unconditionally truncate the last character of user input if it is not newline-terminated.
- [Medium] XilSEM fabric and interconnect errors are incorrectly attributed to DDR memory DIMM 0.
- [Medium] Failure to handle an unrecognized `family_code` causes the driver to register an event with uninitialized parameters.
- [Medium] Lack of synchronization in sysfs callbacks causes data races on shared buffers.
- [Medium] The `xsem_err_callback` event handler performs unprotected read-modify-write operations on shared error counters.
- [Low] Copy-paste error in `xsem_scan_control_show` format string causes incorrect output.
--

commit 3ba94d35140e37f0feb1403bd80442cef2ad8ed1
Author: Rama devi Veggalam <rama.devi.veggalam@amd.com>
Subject: edac: xilinx: Add EDAC support for Versal XilSem

This patch adds support for Xilinx Versal Soft Error Mitigation (XilSEM)
to handle and optionally correct soft errors in Configuration Memory.
It also introduces sysfs interfaces for XilSEM scan operations, error
injection, and configuration reads.

> diff --git a/drivers/edac/versal_edac.c b/drivers/edac/versal_edac.c
> index 5a43b5d43ca28..024ddd90f6996 100644
> --- a/drivers/edac/versal_edac.c
> +++ b/drivers/edac/versal_edac.c
[ ... ]
> @@ -11,12 +11,88 @@
>  #include <linux/of_address.h>
>  #include <linux/of_device.h>
>  #include <linux/platform_device.h>
> +#include <linux/dma-mapping.h>
>  #include <linux/sizes.h>
> +#include <linux/io.h>
>  #include <linux/firmware/xlnx-zynqmp.h>
> +#include <linux/firmware/xlnx-versal-error-events.h>
>  #include <linux/firmware/xlnx-event-manager.h>
>  
>  #include "edac_module.h"
>  
> +/* XilSem CE Error log count */
> +#define XILSEM_MAX_CE_LOG_CNT	7
> +
> +/** Maximum CRAM error register count */
> +#define MAX_CRAMERR_REG_CNT		14
> +/** Maximum NPI slave skip count */
> +#define MAX_NPI_SLV_SKIP_CNT	8
> +/** Maximum NPI Error info count */
> +#define MAX_NPI_ERR_INFO_CNT	2
> +
> +/* Maximum SLR count */
> +#define MAX_SLR_ID	3
> +
> +/** Maximum number of cframe types  */
> +#define CFRAME_MAX_TYPE	7
> +
> +/** Mask for getting Type_0, Type_4 frames */
> +#define CFRAME_TYPE_0_4_MASK	GENMASK(19, 0)
> +
> +/** Low mask, High mask for getting Type_1, Type_5 frames */
> +#define CFRAME_TYPE_1_5_MASK_L	GENMASK(39, 20)
> +#define CFRAME_TYPE_1_5_MASK_H	GENMASK(7, 0)
> +
> +/** Shift for getting Type_1, Type_5 frames */
> +#define CFRAME_TYPE_1_5_SHIFT_R		0x20
> +#define CFRAME_TYPE_1_5_SHIFT_L		0x12
> +
> +/** Mask for getting Type_2, Type_6 frames */
> +#define CFRAME_TYPE_2_6_MASK	GENMASK(27, 8)
> +
> +/** Shift for getting Type_2, Type_6 frames */
> +#define CFRAME_TYPE_2_6_SHIFT_R		0x8
> +
> +/** Low mask, high mask for getting Type_3 frames */
> +#define CFRAME_TYPE_3_MASK_L		GENMASK(31, 28)
> +#define CFRAME_TYPE_3_MASK_H		GENMASK(15, 0)
> +
> +/** Shift for getting Type_3 */
> +#define CFRAME_TYPE_3_SHIFT_R	0x28

[Severity: Medium]
Are these macro definitions incorrectly using a hex prefix for decimal shift
values? Shifting a 32-bit value by 0x20 (32) or 0x28 (40) bits will invoke
undefined behavior.

> +#define CFRAME_TYPE_3_SHIFT_L	0x4
[ ... ]
> +static ssize_t xsem_scan_control_show(struct device *dev,
> +				      struct device_attribute *mattr,
> +				      char *data)
> +{
> +	struct mem_ctl_info *mci = to_mci(dev);
> +	struct edac_priv *priv = mci->pvt_info;
> +
> +	if (!priv->xsem_rtca)
> +		return -EINVAL;
> +
> +	return sprintf(data, "[0x%x][0x%x][0x%x][0x%x][0x%x][0x%x][0x%x]\n\r",
> +			priv->xsem_rtca->scan_ctrl_status[0],
> +			priv->xsem_rtca->scan_ctrl_status[0],

[Severity: Low]
Is there a typo here where scan_ctrl_status[0] is printed twice, causing the
output to miss the last element of the array?

> +			priv->xsem_rtca->scan_ctrl_status[1],
> +			priv->xsem_rtca->scan_ctrl_status[2],
> +			priv->xsem_rtca->scan_ctrl_status[3],
> +			priv->xsem_rtca->scan_ctrl_status[4],
> +			priv->xsem_rtca->scan_ctrl_status[5]);
> +}
> +
> +/**
> + * xsem_scan_control_store - Set scan control operation
[ ... ]
> +static ssize_t xsem_scan_control_store(struct device *dev,
> +				       struct device_attribute *mattr,
> +				       const char *data, size_t count)
> +{
> +	struct mem_ctl_info *mci = to_mci(dev);
> +	struct edac_priv *priv = mci->pvt_info;
> +	char *kern_buff, *inbuf, *tok;
> +	u32 cmd;
> +	u32 slrid;
> +	int ret;
> +
> +	if (!priv->xsem_rtca)
> +		return -EINVAL;
> +
> +	kern_buff = kzalloc(count, GFP_KERNEL);
> +	if (!kern_buff)
> +		return -ENOMEM;
> +	strscpy(kern_buff, data, count);

[Severity: Medium]
Does this silently truncate the last character of the user input if it is not
newline-terminated? When count matches the string length exactly, strscpy()
copies at most count - 1 bytes and inserts a null terminator.

> +
> +	inbuf = kern_buff;
[ ... ]
> +static ssize_t xsem_total_cframes_ssit_show(struct device *dev,
> +					    struct device_attribute *mattr,
> +					    char *data)
> +{
> +	struct mem_ctl_info *mci = to_mci(dev);
> +	struct edac_priv *priv = mci->pvt_info;
> +	u32 temp_buf[CFRAME_MAX_TYPE] = {0};
> +	u32 id;
> +	int offset = 0;
> +
> +	if (!priv->xsem_rtca)
> +		return -EINVAL;
> +
> +	for (id = 0; id < CFRAME_MAX_TYPE; id++)
> +		temp_buf[id] = priv->xsem_rtca->cram_total_frames[id];
> +
> +	priv->xsem_rtca->cram_total_frames[0]  = (temp_buf[0] & CFRAME_TYPE_0_4_MASK);
> +	priv->xsem_rtca->cram_total_frames[1]  = (temp_buf[0] &
> +					CFRAME_TYPE_1_5_MASK_L) >> CFRAME_TYPE_1_5_SHIFT_R;

[Severity: Medium]
Does this modify the shared cram_total_frames array in-place during a read
operation? Subsequent reads might apply the bitwise shifts to already-shifted
data, returning garbled output.

> +	priv->xsem_rtca->cram_total_frames[1] |= (temp_buf[1] &
[ ... ]
> +static ssize_t xsem_total_cframes_ssit_store(struct device *dev,
> +					     struct device_attribute *mattr,
> +					     const char *data, size_t count)
> +{
> +	struct mem_ctl_info *mci = to_mci(dev);
> +	struct edac_priv *priv = mci->pvt_info;
> +	char *kern_buff, *inbuf, *tok, *kbuf1;
> +	dma_addr_t dma_addr = 0;
> +	u32 row, slrid, dma_size;
> +	int ret;
[ ... ]
> +	dma_size = sizeof(priv->xsem_rtca->cram_total_frames);
> +	kbuf1 = dma_alloc_coherent(dev, dma_size, &dma_addr, GFP_KERNEL);

[Severity: High]
Are we passing the virtual EDAC device pointer instead of the platform device
here? Sysfs store callbacks for this attribute receive &mci->dev, which lacks
DMA configuration and could cause allocation failures.

> +	if (!kbuf1) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +
> +	ret = zynqmp_pm_xilsem_cram_ssit_totframes(slrid, row, dma_addr,
> +						   priv->xsem_rtca->cram_get_frames_status);

[Severity: Critical]
Can this truncate the physical address and cause memory corruption? On ARM64,
dma_addr_t is 64-bit, but the firmware API parameters are 32-bit.

> +	if (ret) {
> +		edac_printk(KERN_ERR, EDAC_MC, "ERROR: XilSEM Status PM API failed\n");
> +		dma_free_coherent(dev, dma_size, kbuf1, dma_addr);
> +		goto err;
> +	}
> +
> +	memcpy(priv->xsem_rtca->cram_total_frames, kbuf1, dma_size);

[Severity: Medium]
Can concurrent sysfs store operations corrupt this shared buffer due to a lack
of locking?

> +	dma_free_coherent(dev, dma_size, kbuf1, dma_addr);
[ ... ]
> +static ssize_t xsem_read_ssit_status_store(struct device *dev,
> +					   struct device_attribute *mattr,
> +					   const char *data, size_t count)
> +{
> +	struct mem_ctl_info *mci = to_mci(dev);
> +	struct edac_priv *priv = mci->pvt_info;
> +	char *kern_buff, *inbuf, *tok, *kbuf1;
> +	size_t dma_size;
> +	dma_addr_t dma_addr = 0;
> +	u32 slrid;
> +	int ret;
[ ... ]
> +	dma_size  = sizeof(struct xsem_ssit_status);
> +
> +	kbuf1 = dma_alloc_coherent(dev, dma_size, &dma_addr, GFP_KERNEL);

[Severity: High]
Does passing the virtual EDAC device pointer (which lacks DMA configuration)
instead of the platform device cause DMA allocation failures?

> +	if (!kbuf1) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +
> +	ret = zynqmp_pm_xilsem_read_ssit_status(slrid, dma_addr,
> +						priv->xsem_rtca->xilsem_ssit_status);

[Severity: Critical]
Could passing the 64-bit dma_addr to this 32-bit firmware API parameter result
in physical address truncation and subsequent memory corruption?

> +	if (ret) {
> +		edac_printk(KERN_ERR, EDAC_MC, "ERROR: XilSEM Status PM API failed\n");
> +		dma_free_coherent(dev, dma_size, kbuf1, dma_addr);
> +		goto err;
> +	}
> +
> +	memcpy(priv->xsem_rtca->slr_info, kbuf1, dma_size);

[Severity: Medium]
Can concurrent executions of this sysfs store operation result in data races on
this shared slr_info buffer?

> +
> +	dma_free_coherent(dev, dma_size, kbuf1, dma_addr);
[ ... ]
> +static void xsem_geterror_info(struct mem_ctl_info *mci, struct xsem_error_status *p,
> +			       int mask)
> +{
> +	struct edac_priv *priv = mci->pvt_info;
> +	u32 error_word_0, error_word_1, ce_count;
> +	u8 index;
> +
> +	if (!priv->xsem_rtca || !priv->sem_baseaddr)
> +		return;
> +
> +	if (mask & priv->xsem_rtca->cram_ce_mask) {
> +		p->ce_cnt++;
[ ... ]
> +		/* Read CRAM status */
> +		p->ceinfo.status = readl(priv->sem_baseaddr + CRAM_STS_INFO_OFFSET);
> +	} else if (mask & priv->xsem_rtca->cram_ue_mask) {

[Severity: High]
Does this if-else chain cause uncorrectable errors to be silently ignored if
the payload mask contains both a correctable and uncorrectable error?

> +		p->ue_cnt++;
> +		p->ueinfo.data0 = 0;
> +		p->ueinfo.data1 = 0;
> +		p->ueinfo.status = readl(priv->sem_baseaddr + CRAM_STS_INFO_OFFSET);
> +	} else if (mask & priv->xsem_rtca->npi_ue_mask) {
> +		p->ue_cnt++;
> +		p->ueinfo.data0 = readl(priv->sem_baseaddr + NPI_ERR0_INFO_OFFSET);
> +		p->ueinfo.data1 = readl(priv->sem_baseaddr + NPI_ERR1_INFO_OFFSET);
> +		p->ueinfo.status = readl(priv->sem_baseaddr);
> +	} else {
> +		edac_printk(KERN_ERR, EDAC_MC, "Invalid Event received %d\n", mask);
> +	}
> +}
> +
> +/**
> + * xsem_handle_error - Handle XilSem error types CE and UE
> + * @mci:	Pointer to the memory controller instance
> + * @p:		Pointer to the xilsem error status structure
> + *
> + * Handles the correctable and uncorrectable error.
> + */
> +static void xsem_handle_error(struct mem_ctl_info *mci, struct xsem_error_status *p)
> +{
> +	struct xsem_ecc_error_info *pinf;
> +	char message[XDDR_EDAC_MSG_SIZE];
> +
> +	if (p->ce_cnt) {
> +		pinf = &p->ceinfo;
> +		snprintf(message, XDDR_EDAC_MSG_SIZE,
> +			 "\n\rXILSEM CRAM error type :%s\n\r"
> +			 "\nFrame_Addr: [0x%X]\t Row_num: [0x%X]\t Bit_loc: [0x%X]\t Qword: [0x%X]\n\r",
> +			 "CE", pinf->frame_addr, pinf->row_id,
> +			 pinf->bit_loc, pinf->qword);
> +
> +		edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
> +				     p->ce_cnt, 0, 0, 0, 0, 0, -1,
> +				     message, "");

[Severity: Medium]
Are these fabric and interconnect errors being incorrectly attributed to the
first DDR memory module? Passing top_layer=0 and mid_layer=0 explicitly maps
them to csrow 0, channel 0.

> +	}
> +
> +	if (p->ue_cnt) {
> +		pinf = &p->ueinfo;
> +		snprintf(message, XDDR_EDAC_MSG_SIZE,
> +			 "\n\rXILSEM error type :%s\n\r"
> +			 "status: [0x%X]\n\rError_Info0: [0x%X]\n\r"
> +			 "Error_Info1: [0x%X]",
> +			 "UE", pinf->status, pinf->data0, pinf->data1);
> +
> +		edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
> +				     p->ue_cnt, 0, 0, 0, 0, 0, -1,
> +				     message, "");

[Severity: Medium]
Does mapping uncorrectable XilSEM errors to top_layer=0 incorrectly attribute
them to the first DDR memory DIMM?

> +	}
> +}
> +
> +/**
> + * xsem_err_callback - Handle Correctable and Uncorrectable errors.
> + * @payload:	payload data.
> + * @data:	controller data.
> + *
> + * Handles ECC correctable and uncorrectable errors.
> + */
> +static void xsem_err_callback(const u32 *payload, void *data)
> +{
> +	struct xsem_error_status stat;
> +	struct edac_priv *priv;
> +	struct mem_ctl_info *mci = (struct mem_ctl_info *)data;
> +	int event;
> +
> +	priv = mci->pvt_info;
> +	if (!priv->xsem_rtca)
> +		return;
> +
> +	memset(&stat, 0, sizeof(stat));
> +	/* Read payload to get the event type */
> +	event = payload[2];
> +	edac_printk(KERN_INFO, EDAC_MC, "Event received %x\n", event);
> +	xsem_geterror_info(mci, &stat, event);
> +
> +	priv->xsem_rtca->cram_ce_cnt += stat.ce_cnt;
> +	priv->xsem_rtca->cram_ue_cnt += stat.ue_cnt;

[Severity: Medium]
Could these read-modify-write operations on shared counters lose increments if
the hardware event callback is executed concurrently on multiple CPUs?

> +	xsem_handle_error(mci, &stat);
> +}
[ ... ]
> @@ -1094,6 +2365,10 @@ static int mc_probe(struct platform_device *pdev)
>  	if (!get_ecc_state(ddrmc_baseaddr))
>  		return -ENXIO;
>  
> +	sem_baseaddr = devm_platform_ioremap_resource_byname(pdev, "semrtca");
> +	if (IS_ERR(sem_baseaddr))
> +		return PTR_ERR(sem_baseaddr);

[Severity: High]
Does this make the "semrtca" resource mandatory, breaking the driver probe for
older device trees that do not define it?

> +
>  	/* Allocate ID number for the EMIF controller */
>  	edac_mc_id = emif_get_id(pdev->dev.of_node);
>  
[ ... ]
> +	rc = zynqmp_pm_get_family_info(&family_code);
> +	if (rc) {
> +		if (rc == -ENODEV)
> +			rc = -EPROBE_DEFER;
> +
> +		goto del_mc;
> +	}
> +	if (family_code == PM_VERSAL_FAMILY_CODE) {
> +		priv->xsem_rtca->sw_event_node_id = VERSAL_EVENT_ERROR_SW_ERR;
> +		priv->xsem_rtca->cram_ce_mask = XPM_VERSAL_EVENT_ERROR_MASK_XSEM_CRAM_CE_5;
> +		priv->xsem_rtca->cram_ue_mask = XPM_VERSAL_EVENT_ERROR_MASK_XSEM_CRAM_UE_6;
> +		priv->xsem_rtca->npi_ue_mask = XPM_VERSAL_EVENT_ERROR_MASK_XSEM_NPI_UE_7;
> +	} else {
> +		edac_printk(KERN_ERR, EDAC_MC, "Invalid Device family code %d\n", family_code);
> +	}
> +
> +	rc = xlnx_register_event(PM_NOTIFY_CB, priv->xsem_rtca->sw_event_node_id,

[Severity: Medium]
Does this execution path fall through when the family code is unrecognized,
causing the driver to register an event using uninitialized parameters (zeros)?

> +				 priv->xsem_rtca->cram_ce_mask | priv->xsem_rtca->cram_ue_mask |
> +				 priv->xsem_rtca->npi_ue_mask,
> +				 false, xsem_err_callback, mci);
> +	if (rc)
> +		goto del_mc;
> +
>  #ifdef CONFIG_EDAC_DEBUG
>  	create_debugfs_attributes(mci);
>  	setup_address_map(priv);
> @@ -1154,6 +2484,8 @@ static int mc_probe(struct platform_device *pdev)
>  	enable_intr(priv);
>  	return rc;
>  
> +remove_sysfs:
> +	xsem_edac_remove_sysfs_attributes(mci);
>  del_mc:
>  	edac_mc_del_mc(&pdev->dev);
>  free_edac_mc:
> @@ -1173,9 +2505,21 @@ static void mc_remove(struct platform_device *pdev)

[Severity: High]
Does this error path fail to unregister the VERSAL_EVENT_ERROR_PMC_ERR1 event
that was registered earlier in the function? If the event triggers after mci is
freed, could it cause a use-after-free in the callback?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624212545.2850787-1-rama.devi.veggalam@amd.com?part=4

     prev parent reply	other threads:[~2026-06-24 21:37 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-24 21:25 [PATCH v3 0/4] Add support for Versal Xilsem edac Rama devi Veggalam
2026-06-24 21:25 ` [PATCH v3 1/4] dt-bindings: edac: Add bindings for Xilinx Versal XilSEM Rama devi Veggalam
2026-06-24 21:33   ` sashiko-bot
2026-06-25  6:37   ` Krzysztof Kozlowski
2026-06-25  6:39   ` Krzysztof Kozlowski
2026-06-24 21:25 ` [PATCH v3 2/4] Documentation: ABI: Add ABI doc for versal edac sysfs Rama devi Veggalam
2026-06-24 21:32   ` sashiko-bot
2026-06-24 21:25 ` [PATCH v3 3/4] firmware: xilinx: Add support for Xilsem scan operations Rama devi Veggalam
2026-06-24 21:39   ` sashiko-bot
2026-06-24 21:25 ` [PATCH v3 4/4] edac: xilinx: Add EDAC support for Versal XilSem Rama devi Veggalam
2026-06-24 21:37   ` sashiko-bot [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260624213705.026D81F000E9@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=conor+dt@kernel.org \
    --cc=devicetree@vger.kernel.org \
    --cc=rama.devi.veggalam@amd.com \
    --cc=robh@kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.