* [PATCH v7 0/5] EDAC/Versal NET: Add support for error notification
@ 2025-05-29 7:00 Shubhrajyoti Datta
2025-05-29 7:00 ` [PATCH v7 1/5] cdx: add the headers to include/linux Shubhrajyoti Datta
` (5 more replies)
0 siblings, 6 replies; 14+ messages in thread
From: Shubhrajyoti Datta @ 2025-05-29 7:00 UTC (permalink / raw)
To: linux-kernel, devicetree, linux-edac
Cc: git, Krzysztof Kozlowski, Rob Herring, Conor Dooley,
Borislav Petkov, Tony Luck, James Morse, Mauro Carvalho Chehab,
Robert Richter, Nipun Gupta, Nikhil Agarwal, Shubhrajyoti Datta
Adds support for the error notification for the Versal NET EDAC driver.
The driver receives error events via RPMsg instead of directly accessing
hardware registers. The NMC((Network management controller), which has
secure access to DDRMC registers, gathers the necessary information and
transmits it through RPMsg.
During probe, the driver registers with RPMsg and retrieves DDR
configuration by scheduling a work item from the NMC.
Once this is completed, it registers the EDAC controller.
When an error occurs, the NMC sends an RPMsg, notifying the driver.
The EDAC driver handles error reporting for all events.
Also we register the EDAC once and it reports the errors for all the
events including the 8 DDRMC controllers. So while registering we give
the particulars of the 1st controller.
Currently 20 errors has been tested.
Changes in v7:
- add a minimal header instead moving them
- Add the kernel doc description
- Add the prototype from first patch to export patch
- Add the reviewed by tag
- Update the header paths
- merge edac_cdx_pcol.h
Changes in v6:
- Patch added
- Update commit description
- Update the commit message.
- update to the chip name as xlnx,versal-net
- Correct indentation
- Update to xlnx,versal-net-ddrmc5
- Update the kconfig message
- Make the messages uniform
- Add some more supported events
- rename regval to reglo
- combine/ reformat functions
- remove trailing comments
- Remove unneeded comments
- make the amd_mcdi function void
- rename versalnet_rpmsg_edac to versalnet_edac
- Remove the column bit and use them directly
- Update the comments
- Update the mod_name to versalnet_edac
- remove the global priv col and rows
- rename edac_priv to mc_priv
- Update the comment description for dwidth
- Remove error_id enum
- rename the variable par to parity
- make get_ddr_config void
- Fix memory leak of the mcdi structure
- Update the spelling
- Remove the workqueue
Changes in v5:
- Update the binding
- Update the compatible
- Update the handle_error documentation
Changes in v4:
- Update the compatible
- align the example
- Enhance the description for rproc
- Update the compatible
Changes in v3:
- make remove void
Changes in v2:
- Export the symbols for module compilation
- New patch addition
- rename EDAC to memory controller
- update the compatible name
- Add remote proc handle
- Read the data width from the registers
- Remove the dwidth, rank and channel number the same is
read from the RpMsg.
- remove reset
- Add the remote proc requests
- remove probe_once
- reorder the rpmsg registration
- the data width , rank and number of channel is read from message.
Shubhrajyoti Datta (5):
cdx: add the headers to include/linux
cdx: Export Symbols for MCDI RPC and Initialization
ras: Export log_non_standard_event for External Usage
dt-bindings: memory-controllers: Add support for Versal NET EDAC
EDAC/VersalNET: Add support for error notification
.../xlnx,versal-net-ddrmc5.yaml | 41 +
drivers/cdx/controller/mcdi.c | 29 +
drivers/edac/Kconfig | 11 +
drivers/edac/Makefile | 1 +
drivers/edac/versalnet_edac.c | 1108 +++++++++++++++++
drivers/ras/ras.c | 1 +
include/linux/cdx/bitfield.h | 78 ++
include/linux/cdx/edac_cdx_pcol.h | 28 +
include/linux/cdx/mcdi.h | 198 +++
9 files changed, 1495 insertions(+)
create mode 100644 Documentation/devicetree/bindings/memory-controllers/xlnx,versal-net-ddrmc5.yaml
create mode 100644 drivers/edac/versalnet_edac.c
create mode 100644 include/linux/cdx/bitfield.h
create mode 100644 include/linux/cdx/edac_cdx_pcol.h
create mode 100644 include/linux/cdx/mcdi.h
--
2.34.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v7 1/5] cdx: add the headers to include/linux
2025-05-29 7:00 [PATCH v7 0/5] EDAC/Versal NET: Add support for error notification Shubhrajyoti Datta
@ 2025-05-29 7:00 ` Shubhrajyoti Datta
2025-06-13 19:59 ` Yazen Ghannam
2025-05-29 7:00 ` [PATCH v7 2/5] cdx: Export Symbols for MCDI RPC and Initialization Shubhrajyoti Datta
` (4 subsequent siblings)
5 siblings, 1 reply; 14+ messages in thread
From: Shubhrajyoti Datta @ 2025-05-29 7:00 UTC (permalink / raw)
To: linux-kernel, devicetree, linux-edac
Cc: git, Krzysztof Kozlowski, Rob Herring, Conor Dooley,
Borislav Petkov, Tony Luck, James Morse, Mauro Carvalho Chehab,
Robert Richter, Nipun Gupta, Nikhil Agarwal, Shubhrajyoti Datta
Add a the bitfield.h and mcdi.h headers.
This is in preparation for VersalNET EDAC
driver that relies on it.
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
---
Changes in v7:
- add a minimal header instead moving them
Changes in v6:
- Patch added
include/linux/cdx/bitfield.h | 78 ++++++++++++++
include/linux/cdx/mcdi.h | 192 +++++++++++++++++++++++++++++++++++
2 files changed, 270 insertions(+)
create mode 100644 include/linux/cdx/bitfield.h
create mode 100644 include/linux/cdx/mcdi.h
diff --git a/include/linux/cdx/bitfield.h b/include/linux/cdx/bitfield.h
new file mode 100644
index 000000000000..3f0bf2dbb2c7
--- /dev/null
+++ b/include/linux/cdx/bitfield.h
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2005-2006 Fen Systems Ltd.
+ * Copyright 2006-2013 Solarflare Communications Inc.
+ * Copyright (C) 2022-2025, Advanced Micro Devices, Inc.
+ */
+
+#ifndef CDX_BITFIELD_H
+#define CDX_BITFIELD_H
+
+#include <linux/bitfield.h>
+
+/* Lowest bit numbers and widths */
+#define CDX_DWORD_LBN 0
+#define CDX_DWORD_WIDTH 32
+
+/* Specified attribute (e.g. LBN) of the specified field */
+#define CDX_VAL(field, attribute) field ## _ ## attribute
+/* Low bit number of the specified field */
+#define CDX_LOW_BIT(field) CDX_VAL(field, LBN)
+/* Bit width of the specified field */
+#define CDX_WIDTH(field) CDX_VAL(field, WIDTH)
+/* High bit number of the specified field */
+#define CDX_HIGH_BIT(field) (CDX_LOW_BIT(field) + CDX_WIDTH(field) - 1)
+
+/* A doubleword (i.e. 4 byte) datatype - little-endian in HW */
+struct cdx_dword {
+ __le32 cdx_u32;
+};
+
+/*
+ * Creates the portion of the named bit field that lies within the
+ * range [min,max).
+ */
+#define CDX_INSERT_FIELD(field, value) \
+ (FIELD_PREP(GENMASK(CDX_HIGH_BIT(field), \
+ CDX_LOW_BIT(field)), value))
+
+/*
+ * Creates the portion of the named bit fields that lie within the
+ * range [min,max).
+ */
+#define CDX_INSERT_FIELDS(field1, value1, \
+ field2, value2, \
+ field3, value3, \
+ field4, value4, \
+ field5, value5, \
+ field6, value6, \
+ field7, value7) \
+ (CDX_INSERT_FIELD(field1, (value1)) | \
+ CDX_INSERT_FIELD(field2, (value2)) | \
+ CDX_INSERT_FIELD(field3, (value3)) | \
+ CDX_INSERT_FIELD(field4, (value4)) | \
+ CDX_INSERT_FIELD(field5, (value5)) | \
+ CDX_INSERT_FIELD(field6, (value6)) | \
+ CDX_INSERT_FIELD(field7, (value7)))
+
+#define CDX_POPULATE_DWORD(dword, ...) \
+ (dword).cdx_u32 = cpu_to_le32(CDX_INSERT_FIELDS(__VA_ARGS__))
+
+/* Populate a dword field with various numbers of arguments */
+#define CDX_POPULATE_DWORD_7 CDX_POPULATE_DWORD
+#define CDX_POPULATE_DWORD_6(dword, ...) \
+ CDX_POPULATE_DWORD_7(dword, CDX_DWORD, 0, __VA_ARGS__)
+#define CDX_POPULATE_DWORD_5(dword, ...) \
+ CDX_POPULATE_DWORD_6(dword, CDX_DWORD, 0, __VA_ARGS__)
+#define CDX_POPULATE_DWORD_4(dword, ...) \
+ CDX_POPULATE_DWORD_5(dword, CDX_DWORD, 0, __VA_ARGS__)
+#define CDX_POPULATE_DWORD_3(dword, ...) \
+ CDX_POPULATE_DWORD_4(dword, CDX_DWORD, 0, __VA_ARGS__)
+#define CDX_POPULATE_DWORD_2(dword, ...) \
+ CDX_POPULATE_DWORD_3(dword, CDX_DWORD, 0, __VA_ARGS__)
+#define CDX_POPULATE_DWORD_1(dword, ...) \
+ CDX_POPULATE_DWORD_2(dword, CDX_DWORD, 0, __VA_ARGS__)
+#define CDX_SET_DWORD(dword) \
+ CDX_POPULATE_DWORD_1(dword, CDX_DWORD, 0xffffffff)
+
+#endif /* CDX_BITFIELD_H */
diff --git a/include/linux/cdx/mcdi.h b/include/linux/cdx/mcdi.h
new file mode 100644
index 000000000000..46e3f63b062a
--- /dev/null
+++ b/include/linux/cdx/mcdi.h
@@ -0,0 +1,192 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2008-2013 Solarflare Communications Inc.
+ * Copyright (C) 2022-2023, Advanced Micro Devices, Inc.
+ */
+
+#ifndef CDX_MCDI_H
+#define CDX_MCDI_H
+
+#include <linux/mutex.h>
+#include <linux/kref.h>
+#include <linux/rpmsg.h>
+
+#include "linux/cdx/bitfield.h"
+
+/**
+ * enum cdx_mcdi_mode - MCDI transaction mode
+ * @MCDI_MODE_EVENTS: wait for an mcdi response callback.
+ * @MCDI_MODE_FAIL: we think MCDI is dead, so fail-fast all calls
+ */
+enum cdx_mcdi_mode {
+ MCDI_MODE_EVENTS,
+ MCDI_MODE_FAIL,
+};
+
+#define MCDI_RPC_TIMEOUT (10 * HZ)
+#define MCDI_RPC_LONG_TIMEOU (60 * HZ)
+#define MCDI_RPC_POST_RST_TIME (10 * HZ)
+
+/**
+ * enum cdx_mcdi_cmd_state - State for an individual MCDI command
+ * @MCDI_STATE_QUEUED: Command not started and is waiting to run.
+ * @MCDI_STATE_RETRY: Command was submitted and MC rejected with no resources,
+ * as MC have too many outstanding commands. Command will be retried once
+ * another command returns.
+ * @MCDI_STATE_RUNNING: Command was accepted and is running.
+ * @MCDI_STATE_RUNNING_CANCELLED: Command is running but the issuer cancelled
+ * the command.
+ * @MCDI_STATE_FINISHED: Processing of this command has completed.
+ */
+
+enum cdx_mcdi_cmd_state {
+ MCDI_STATE_QUEUED,
+ MCDI_STATE_RETRY,
+ MCDI_STATE_RUNNING,
+ MCDI_STATE_RUNNING_CANCELLED,
+ MCDI_STATE_FINISHED,
+};
+
+/**
+ * struct cdx_mcdi - CDX MCDI Firmware interface, to interact
+ * with CDX controller.
+ * @mcdi: MCDI interface
+ * @mcdi_ops: MCDI operations
+ * @r5_rproc : R5 Remoteproc device handle
+ * @rpdev: RPMsg device
+ * @ept: RPMsg endpoint
+ * @work: Post probe work
+ */
+struct cdx_mcdi {
+ /* MCDI interface */
+ struct cdx_mcdi_data *mcdi;
+ const struct cdx_mcdi_ops *mcdi_ops;
+
+ struct rproc *r5_rproc;
+ struct rpmsg_device *rpdev;
+ struct rpmsg_endpoint *ept;
+ struct work_struct work;
+};
+
+struct cdx_mcdi_ops {
+ void (*mcdi_request)(struct cdx_mcdi *cdx,
+ const struct cdx_dword *hdr, size_t hdr_len,
+ const struct cdx_dword *sdu, size_t sdu_len);
+ unsigned int (*mcdi_rpc_timeout)(struct cdx_mcdi *cdx, unsigned int cmd);
+};
+
+typedef void cdx_mcdi_async_completer(struct cdx_mcdi *cdx,
+ unsigned long cookie, int rc,
+ struct cdx_dword *outbuf,
+ size_t outlen_actual);
+
+/**
+ * struct cdx_mcdi_cmd - An outstanding MCDI command
+ * @ref: Reference count. There will be one reference if the command is
+ * in the mcdi_iface cmd_list, another if it's on a cleanup list,
+ * and a third if it's queued in the work queue.
+ * @list: The data for this entry in mcdi->cmd_list
+ * @cleanup_list: The data for this entry in a cleanup list
+ * @work: The work item for this command, queued in mcdi->workqueue
+ * @mcdi: The mcdi_iface for this command
+ * @state: The state of this command
+ * @inlen: inbuf length
+ * @inbuf: Input buffer
+ * @quiet: Whether to silence errors
+ * @reboot_seen: Whether a reboot has been seen during this command,
+ * to prevent duplicates
+ * @seq: Sequence number
+ * @started: Jiffies this command was started at
+ * @cookie: Context for completion function
+ * @completer: Completion function
+ * @handle: Command handle
+ * @cmd: Command number
+ * @rc: Return code
+ * @outlen: Length of output buffer
+ * @outbuf: Output buffer
+ */
+struct cdx_mcdi_cmd {
+ struct kref ref;
+ struct list_head list;
+ struct list_head cleanup_list;
+ struct work_struct work;
+ struct cdx_mcdi_iface *mcdi;
+ enum cdx_mcdi_cmd_state state;
+ size_t inlen;
+ const struct cdx_dword *inbuf;
+ bool quiet;
+ bool reboot_seen;
+ u8 seq;
+ unsigned long started;
+ unsigned long cookie;
+ cdx_mcdi_async_completer *completer;
+ unsigned int handle;
+ unsigned int cmd;
+ int rc;
+ size_t outlen;
+ struct cdx_dword *outbuf;
+ /* followed by inbuf data if necessary */
+};
+
+/**
+ * struct cdx_mcdi_iface - MCDI protocol context
+ * @cdx: The associated NIC
+ * @iface_lock: Serialise access to this structure
+ * @outstanding_cleanups: Count of cleanups
+ * @cmd_list: List of outstanding and running commands
+ * @workqueue: Workqueue used for delayed processing
+ * @cmd_complete_wq: Waitqueue for command completion
+ * @db_held_by: Command the MC doorbell is in use by
+ * @seq_held_by: Command each sequence number is in use by
+ * @prev_handle: The last used command handle
+ * @mode: Poll for mcdi completion, or wait for an mcdi_event
+ * @prev_seq: The last used sequence number
+ * @new_epoch: Indicates start of day or start of MC reboot recovery
+ */
+struct cdx_mcdi_iface {
+ struct cdx_mcdi *cdx;
+ /* Serialise access */
+ struct mutex iface_lock;
+ unsigned int outstanding_cleanups;
+ struct list_head cmd_list;
+ struct workqueue_struct *workqueue;
+ wait_queue_head_t cmd_complete_wq;
+ struct cdx_mcdi_cmd *db_held_by;
+ struct cdx_mcdi_cmd *seq_held_by[16];
+ unsigned int prev_handle;
+ enum cdx_mcdi_mode mode;
+ u8 prev_seq;
+ bool new_epoch;
+};
+
+/**
+ * struct cdx_mcdi_data - extra state for NICs that implement MCDI
+ * @iface: Interface/protocol state
+ * @fn_flags: Flags for this function, as returned by %MC_CMD_DRV_ATTACH.
+ */
+struct cdx_mcdi_data {
+ struct cdx_mcdi_iface iface;
+ u32 fn_flags;
+};
+
+/*
+ * We expect that 16- and 32-bit fields in MCDI requests and responses
+ * are appropriately aligned, but 64-bit fields are only
+ * 32-bit-aligned.
+ */
+#define MCDI_DECLARE_BUF(_name, _len) struct cdx_dword _name[DIV_ROUND_UP(_len, 4)] = {{0}}
+#define _MCDI_PTR(_buf, _offset) \
+ ((u8 *)(_buf) + (_offset))
+#define MCDI_PTR(_buf, _field) \
+ _MCDI_PTR(_buf, MC_CMD_ ## _field ## _OFST)
+#define _MCDI_CHECK_ALIGN(_ofst, _align) \
+ ((void)BUILD_BUG_ON_ZERO((_ofst) & ((_align) - 1)), \
+ (_ofst))
+#define _MCDI_DWORD(_buf, _field) \
+ ((_buf) + (_MCDI_CHECK_ALIGN(MC_CMD_ ## _field ## _OFST, 4) >> 2))
+
+#define MCDI_SET_DWORD(_buf, _field, _value) \
+ CDX_POPULATE_DWORD_1(*_MCDI_DWORD(_buf, _field), CDX_DWORD, _value)
+#define MCDI_DWORD(_buf, _field) \
+ CDX_DWORD_FIELD(*_MCDI_DWORD(_buf, _field), CDX_DWORD)
+#endif /* CDX_MCDI_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v7 2/5] cdx: Export Symbols for MCDI RPC and Initialization
2025-05-29 7:00 [PATCH v7 0/5] EDAC/Versal NET: Add support for error notification Shubhrajyoti Datta
2025-05-29 7:00 ` [PATCH v7 1/5] cdx: add the headers to include/linux Shubhrajyoti Datta
@ 2025-05-29 7:00 ` Shubhrajyoti Datta
2025-06-13 20:10 ` Yazen Ghannam
2025-05-29 7:00 ` [PATCH v7 3/5] ras: Export log_non_standard_event for External Usage Shubhrajyoti Datta
` (3 subsequent siblings)
5 siblings, 1 reply; 14+ messages in thread
From: Shubhrajyoti Datta @ 2025-05-29 7:00 UTC (permalink / raw)
To: linux-kernel, devicetree, linux-edac
Cc: git, Krzysztof Kozlowski, Rob Herring, Conor Dooley,
Borislav Petkov, Tony Luck, James Morse, Mauro Carvalho Chehab,
Robert Richter, Nipun Gupta, Nikhil Agarwal, Shubhrajyoti Datta
The cdx_mcdi_init, cdx_mcdi_process_cmd, and cdx_mcdi_rpc functions are
needed by VersalNET EDAC modules that interact with the MCDI (Management
Controller Direct Interface) framework. These functions facilitate
communication between different hardware components by enabling command
execution and status management.
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
---
Changes in v7:
- Add the kernel doc description
- Add the prototype from first patch to here
Changes in v6:
- Update commit description
Changes in v2:
- Export the symbols for module compilation
drivers/cdx/controller/mcdi.c | 29 +++++++++++++++++++++++++++++
include/linux/cdx/mcdi.h | 6 ++++++
2 files changed, 35 insertions(+)
diff --git a/drivers/cdx/controller/mcdi.c b/drivers/cdx/controller/mcdi.c
index e760f8d347cc..f3cca4c884ff 100644
--- a/drivers/cdx/controller/mcdi.c
+++ b/drivers/cdx/controller/mcdi.c
@@ -99,6 +99,19 @@ static unsigned long cdx_mcdi_rpc_timeout(struct cdx_mcdi *cdx, unsigned int cmd
return cdx->mcdi_ops->mcdi_rpc_timeout(cdx, cmd);
}
+/**
+ * cdx_mcdi_init - Initialize MCDI (Management Controller Driver Interface) state
+ * @cdx: NIC through which to issue the command
+ *
+ * This function allocates and initializes internal MCDI structures and resources
+ * for the CDX device, including the workqueue, locking primitives, and command
+ * tracking mechanisms. It sets the initial operating mode and prepares the device
+ * for MCDI operations.
+ *
+ * Return:
+ * * 0 - on success
+ * * -ENOMEM - if memory allocation or workqueue creation fails
+ */
int cdx_mcdi_init(struct cdx_mcdi *cdx)
{
struct cdx_mcdi_iface *mcdi;
@@ -128,6 +141,7 @@ int cdx_mcdi_init(struct cdx_mcdi *cdx)
fail:
return rc;
}
+EXPORT_SYMBOL_GPL(cdx_mcdi_init);
void cdx_mcdi_finish(struct cdx_mcdi *cdx)
{
@@ -553,6 +567,19 @@ static void cdx_mcdi_start_or_queue(struct cdx_mcdi_iface *mcdi,
cdx_mcdi_cmd_start_or_queue(mcdi, cmd);
}
+/**
+ * cdx_mcdi_process_cmd - Process an incoming MCDI response
+ * @cdx: NIC through which to issue the command
+ * @outbuf: Pointer to the response buffer received from the management controller
+ * @len: Length of the response buffer in bytes
+ *
+ * This function handles a response from the management controller. It locates the
+ * corresponding command using the sequence number embedded in the header,
+ * completes the command if it is still pending, and initiates any necessary cleanup.
+ *
+ * The function assumes that the response buffer is well-formed and at least one
+ * dword in size.
+ */
void cdx_mcdi_process_cmd(struct cdx_mcdi *cdx, struct cdx_dword *outbuf, int len)
{
struct cdx_mcdi_iface *mcdi;
@@ -590,6 +617,7 @@ void cdx_mcdi_process_cmd(struct cdx_mcdi *cdx, struct cdx_dword *outbuf, int le
cdx_mcdi_process_cleanup_list(mcdi->cdx, &cleanup_list);
}
+EXPORT_SYMBOL_GPL(cdx_mcdi_process_cmd);
static void cdx_mcdi_cmd_work(struct work_struct *context)
{
@@ -757,6 +785,7 @@ int cdx_mcdi_rpc(struct cdx_mcdi *cdx, unsigned int cmd,
return cdx_mcdi_rpc_sync(cdx, cmd, inbuf, inlen, outbuf, outlen,
outlen_actual, false);
}
+EXPORT_SYMBOL_GPL(cdx_mcdi_rpc);
/**
* cdx_mcdi_rpc_async - Schedule an MCDI command to run asynchronously
diff --git a/include/linux/cdx/mcdi.h b/include/linux/cdx/mcdi.h
index 46e3f63b062a..1344119e9a2c 100644
--- a/include/linux/cdx/mcdi.h
+++ b/include/linux/cdx/mcdi.h
@@ -169,6 +169,12 @@ struct cdx_mcdi_data {
u32 fn_flags;
};
+int cdx_mcdi_init(struct cdx_mcdi *cdx);
+void cdx_mcdi_process_cmd(struct cdx_mcdi *cdx, struct cdx_dword *outbuf, int len);
+int cdx_mcdi_rpc(struct cdx_mcdi *cdx, unsigned int cmd,
+ const struct cdx_dword *inbuf, size_t inlen,
+ struct cdx_dword *outbuf, size_t outlen, size_t *outlen_actual);
+
/*
* We expect that 16- and 32-bit fields in MCDI requests and responses
* are appropriately aligned, but 64-bit fields are only
--
2.34.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v7 3/5] ras: Export log_non_standard_event for External Usage
2025-05-29 7:00 [PATCH v7 0/5] EDAC/Versal NET: Add support for error notification Shubhrajyoti Datta
2025-05-29 7:00 ` [PATCH v7 1/5] cdx: add the headers to include/linux Shubhrajyoti Datta
2025-05-29 7:00 ` [PATCH v7 2/5] cdx: Export Symbols for MCDI RPC and Initialization Shubhrajyoti Datta
@ 2025-05-29 7:00 ` Shubhrajyoti Datta
2025-05-29 7:00 ` [PATCH v7 4/5] dt-bindings: memory-controllers: Add support for Versal NET EDAC Shubhrajyoti Datta
` (2 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Shubhrajyoti Datta @ 2025-05-29 7:00 UTC (permalink / raw)
To: linux-kernel, devicetree, linux-edac
Cc: git, Krzysztof Kozlowski, Rob Herring, Conor Dooley,
Borislav Petkov, Tony Luck, James Morse, Mauro Carvalho Chehab,
Robert Richter, Nipun Gupta, Nikhil Agarwal, Shubhrajyoti Datta
The function log_non_standard_event is responsible for logging
platform-specific or vendor-defined RAS (Reliability, Availability,
and Serviceability) events. Currently, this function is only available
within the RAS subsystem, preventing external modules from
leveraging its capabilities.
log_non_standard_event is exported so that external drivers like VersalNet
EDAC can log non-standard RAS events.
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
---
(no changes since v6)
Changes in v6:
- Update the commit message.
Changes in v2:
- New patch addition
drivers/ras/ras.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index a6e4792a1b2e..ac0e132ccc3e 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -51,6 +51,7 @@ void log_non_standard_event(const guid_t *sec_type, const guid_t *fru_id,
{
trace_non_standard_event(sec_type, fru_id, fru_text, sev, err, len);
}
+EXPORT_SYMBOL_GPL(log_non_standard_event);
void log_arm_hw_error(struct cper_sec_proc_arm *err)
{
--
2.34.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v7 4/5] dt-bindings: memory-controllers: Add support for Versal NET EDAC
2025-05-29 7:00 [PATCH v7 0/5] EDAC/Versal NET: Add support for error notification Shubhrajyoti Datta
` (2 preceding siblings ...)
2025-05-29 7:00 ` [PATCH v7 3/5] ras: Export log_non_standard_event for External Usage Shubhrajyoti Datta
@ 2025-05-29 7:00 ` Shubhrajyoti Datta
2025-05-29 7:00 ` [PATCH v7 5/5] EDAC/VersalNET: Add support for error notification Shubhrajyoti Datta
2025-05-30 14:12 ` [PATCH v7 0/5] EDAC/Versal NET: " Michal Simek
5 siblings, 0 replies; 14+ messages in thread
From: Shubhrajyoti Datta @ 2025-05-29 7:00 UTC (permalink / raw)
To: linux-kernel, devicetree, linux-edac
Cc: git, Krzysztof Kozlowski, Rob Herring, Conor Dooley,
Borislav Petkov, Tony Luck, James Morse, Mauro Carvalho Chehab,
Robert Richter, Nipun Gupta, Nikhil Agarwal, Shubhrajyoti Datta
Add device tree bindings for AMD Versal NET EDAC for DDR controller.
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
---
Changes in v7:
- Add the reviewed by tag
Changes in v6:
- update to the chip name as xlnx,versal-net
- Correct indentation
Changes in v5:
- Update the binding
Changes in v4:
- Update the compatible
- align the example
- Enhance the description for rproc
Changes in v2:
- rename EDAC to memory controller
- update the compatible name
- Add remote proc handle
- Read the data width from the registers
- Remove the dwidth, rank and channel number the same is
read from the RpMsg.
.../xlnx,versal-net-ddrmc5.yaml | 41 +++++++++++++++++++
1 file changed, 41 insertions(+)
create mode 100644 Documentation/devicetree/bindings/memory-controllers/xlnx,versal-net-ddrmc5.yaml
diff --git a/Documentation/devicetree/bindings/memory-controllers/xlnx,versal-net-ddrmc5.yaml b/Documentation/devicetree/bindings/memory-controllers/xlnx,versal-net-ddrmc5.yaml
new file mode 100644
index 000000000000..479288567d0b
--- /dev/null
+++ b/Documentation/devicetree/bindings/memory-controllers/xlnx,versal-net-ddrmc5.yaml
@@ -0,0 +1,41 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/memory-controllers/xlnx,versal-net-ddrmc5.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Xilinx Versal NET Memory Controller
+
+maintainers:
+ - Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
+
+description:
+ The integrated DDR Memory Controllers (DDRMCs) support both DDR5 and LPDDR5
+ compact and extended memory interfaces. Versal NET DDR memory controller
+ has an optional ECC support which correct single bit ECC errors and detect
+ double bit ECC errors. It also has support for reporting other errors like
+ MMCM (Mixed-Mode Clock Manager) errors and General software errors.
+
+properties:
+ compatible:
+ const: xlnx,versal-net-ddrmc5
+
+ amd,rproc:
+ $ref: /schemas/types.yaml#/definitions/phandle
+ description:
+ phandle to the remoteproc_r5 rproc node using which APU interacts
+ with remote processor. APU primarily communicates with the RPU for
+ accessing the DDRMC address space and getting error notification.
+
+required:
+ - compatible
+ - amd,rproc
+
+additionalProperties: false
+
+examples:
+ - |
+ memory-controller {
+ compatible = "xlnx,versal-net-ddrmc5";
+ amd,rproc = <&remoteproc_r5>;
+ };
--
2.34.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v7 5/5] EDAC/VersalNET: Add support for error notification
2025-05-29 7:00 [PATCH v7 0/5] EDAC/Versal NET: Add support for error notification Shubhrajyoti Datta
` (3 preceding siblings ...)
2025-05-29 7:00 ` [PATCH v7 4/5] dt-bindings: memory-controllers: Add support for Versal NET EDAC Shubhrajyoti Datta
@ 2025-05-29 7:00 ` Shubhrajyoti Datta
2025-07-03 17:31 ` Borislav Petkov
2025-05-30 14:12 ` [PATCH v7 0/5] EDAC/Versal NET: " Michal Simek
5 siblings, 1 reply; 14+ messages in thread
From: Shubhrajyoti Datta @ 2025-05-29 7:00 UTC (permalink / raw)
To: linux-kernel, devicetree, linux-edac
Cc: git, Krzysztof Kozlowski, Rob Herring, Conor Dooley,
Borislav Petkov, Tony Luck, James Morse, Mauro Carvalho Chehab,
Robert Richter, Nipun Gupta, Nikhil Agarwal, Shubhrajyoti Datta
Hardware errors can corrupt the RAM or other system components, without
detection it can lead to crashes and system failure, this driver helps in
reporting of errors to user space, triggering corrective actions.
Add support for single bit error correction, double bit error detection
on AMD Versal NET DDR memory controller and other system errors
from various IP subsystems (e.g., RPU, NOCs, HNICX, PL) reporting.
The Versal NET EDAC listens to the notifications from NMC(Network
management controller) on RPMsg (Remote Processor Messaging).
The channel used for communicating to RPMsg is named "error_edac".
Upon receiving the notification the Versal NET edac driver
sends a RAS((Reliability, Availability, and Serviceability) event
trace. This aids the user space application to decide on the
corrective action.
For reporting events driver registers to the RAS framework
specifically:
Memory errors are reported through the Memory Controller (MC) events.
Non-memory errors are reported using non-standard RAS events.
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
---
Changes in v7:
- Update the header paths
- merge edac_cdx_pcol.h
Changes in v6:
- Update to xlnx,versal-net-ddrmc5
- Update the kconfig message
- Make the messages uniform
- Add some more supported events
- rename regval to reglo
- combine/ reformat functions
- remove trailing comments
- Remove unneeded comments
- make the amd_mcdi function void
- rename versalnet_rpmsg_edac to versalnet_edac
- Remove the column bit and use them directly
- Update the comments
- Update the mod_name to versalnet_edac
- remove the global priv col and rows
- rename edac_priv to mc_priv
- Update the comment description for dwidth
- Remove error_id enum
- rename the variable par to parity
- make get_ddr_config void
- Fix memory leak of the mcdi structure
- Update the spelling
- Remove the workqueue
Changes in v5:
- Update the compatible
- Update the handle_error documentation
Changes in v4:
- Update the compatible
Changes in v3:
- make remove void
Changes in v2:
- remove reset
- Add the remote proc requests
- remove probe_once
- reorder the rpmsg registration
- the data width , rank and number of channel is read from message.
drivers/edac/Kconfig | 11 +
drivers/edac/Makefile | 1 +
drivers/edac/versalnet_edac.c | 1108 +++++++++++++++++++++++++++++
include/linux/cdx/edac_cdx_pcol.h | 28 +
4 files changed, 1148 insertions(+)
create mode 100644 drivers/edac/versalnet_edac.c
create mode 100644 include/linux/cdx/edac_cdx_pcol.h
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 19ad3c3b675d..081bccd3405b 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -576,4 +576,15 @@ config EDAC_LOONGSON
errors (CE) only. Loongson-3A5000/3C5000/3D5000/3A6000/3C6000
are compatible.
+config EDAC_VERSALNET
+ tristate "AMD Versal NET EDAC"
+ depends on CDX_CONTROLLER && ARCH_ZYNQMP
+ help
+ Support for single bit error correction, double bit error detection on
+ the AMD Versal NET DDR memory controller and other system errors
+ from various IP subsystems (e.g., RPU, NOCs, HNICX, PL).
+
+ Report single bit errors (CE), double bit errors (UE) and
+ errors from other IP subsystems like RPU, APU, NOC, HNICX and PL.
+
endif # EDAC
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index a8f2d8f6c894..8eca81f04160 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -88,3 +88,4 @@ obj-$(CONFIG_EDAC_NPCM) += npcm_edac.o
obj-$(CONFIG_EDAC_ZYNQMP) += zynqmp_edac.o
obj-$(CONFIG_EDAC_VERSAL) += versal_edac.o
obj-$(CONFIG_EDAC_LOONGSON) += loongson_edac.o
+obj-$(CONFIG_EDAC_VERSALNET) += versalnet_edac.o
diff --git a/drivers/edac/versalnet_edac.c b/drivers/edac/versalnet_edac.c
new file mode 100644
index 000000000000..2cc7c8681deb
--- /dev/null
+++ b/drivers/edac/versalnet_edac.c
@@ -0,0 +1,1108 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * AMD Versal NET memory controller driver
+ * Copyright (C) 2025 Advanced Micro Devices, Inc.
+ */
+
+#include <linux/cdx/edac_cdx_pcol.h>
+#include <linux/edac.h>
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/ras.h>
+#include <linux/remoteproc.h>
+#include <linux/rpmsg.h>
+#include <linux/sizes.h>
+#include <ras/ras_event.h>
+
+#include "edac_module.h"
+
+/* Granularity of reported error in bytes */
+#define DDRMC5_EDAC_ERR_GRAIN 1
+#define MC_CMD_EDAC_GET_DDR_CONFIG_IN_LEN 4
+
+#define DDRMC5_EDAC_MSG_SIZE 256
+
+#define DDRMC5_IRQ_CE_MASK GENMASK(18, 15)
+#define DDRMC5_IRQ_UE_MASK GENMASK(14, 11)
+
+#define DDRMC5_RANK_1_MASK GENMASK(11, 6)
+#define MASK_24 GENMASK(29, 24)
+#define MASK_0 GENMASK(5, 0)
+
+#define DDRMC5_LRANK_1_MASK GENMASK(11, 6)
+#define DDRMC5_LRANK_2_MASK GENMASK(17, 12)
+#define DDRMC5_BANK1_MASK GENMASK(11, 6)
+#define DDRMC5_GRP_0_MASK GENMASK(17, 12)
+#define DDRMC5_GRP_1_MASK GENMASK(23, 18)
+
+#define ECCR_UE_CE_ADDR_HI_ROW_MASK GENMASK(10, 0)
+
+#define DDRMC5_MAX_ROW_CNT 18
+#define DDRMC5_MAX_COL_CNT 11
+#define DDRMC5_MAX_RANK_CNT 2
+#define DDRMC5_MAX_LRANK_CNT 4
+#define DDRMC5_MAX_BANK_CNT 2
+#define DDRMC5_MAX_GRP_CNT 3
+
+#define DDRMC5_REGHI_ROW 7
+#define DDRMC5_EACHBIT 1
+#define DDRMC5_ERR_TYPE_CE 0
+#define DDRMC5_ERR_TYPE_UE 1
+#define DDRMC5_HIGH_MEM_EN BIT(20)
+#define DDRMC5_MEM_MASK GENMASK(19, 0)
+#define DDRMC5_X16_BASE 256
+#define DDRMC5_X16_ECC 32
+#define DDRMC5_X16_SIZE (DDRMC5_X16_BASE + DDRMC5_X16_ECC)
+#define DDRMC5_X32_SIZE 576
+#define DDRMC5_HIMEM_BASE (256 * SZ_1M)
+#define DDRMC5_ILC_HIMEM_EN BIT(28)
+#define DDRMC5_ILC_MEM GENMASK(27, 0)
+#define DDRMC5_INTERLEAVE_SEL GENMASK(3, 0)
+#define DDRMC5_BUS_WIDTH_MASK GENMASK(19, 18)
+#define DDRMC5_NUM_CHANS_MASK BIT(17)
+#define DDRMC5_RANK_MASK GENMASK(15, 14)
+#define DDRMC5_DWIDTH_MASK GENMASK(5, 4)
+
+#define AMD_MIN_BUF_LEN 0x28
+#define AMD_ERROR_LEVEL 2
+#define AMD_ERRORID 3
+#define TOTAL_ERR_LENGTH 5
+#define AMD_MSG_ERR_OFFSET 8
+#define AMD_MSG_ERR_LENGTH 9
+#define AMD_ERR_DATA 10
+#define MCDI_RESPONSE 0xFF
+
+#define ERR_NOTIFICATION_MAX 96
+#define REG_MAX 152
+#define ADEC_MAX 152
+#define NUM_CONTROLLERS 8
+#define REGS_PER_CONTROLLER 19
+#define ADEC_NUM 19
+#define MC_CMD_EDAC_GET_OVERALL_DDR_CONFIG 2
+#define BUFFER_SZ 80
+
+#define XDDR5_BUS_WIDTH_64 0
+#define XDDR5_BUS_WIDTH_32 1
+#define XDDR5_BUS_WIDTH_16 2
+
+#define AMD_ERR "[VERSAL_EDAC_ERR_ID: %d] Error type:"
+/**
+ * struct ecc_error_info - ECC error log information.
+ * @burstpos: Burst position.
+ * @lrank: Logical Rank number.
+ * @rank: Rank number.
+ * @group: Group number.
+ * @bank: Bank number.
+ * @col: Column number.
+ * @row: Row number.
+ * @rowhi: Row number higher bits.
+ * @i: ECC error info.
+ */
+union ecc_error_info {
+ struct {
+ u32 burstpos:3;
+ u32 lrank:4;
+ u32 rank:2;
+ u32 group:3;
+ u32 bank:2;
+ u32 col:11;
+ u32 row:7;
+ u32 rowhi;
+ };
+ u64 i;
+} __packed;
+
+/**
+ * struct row_col_mapping - Row and column bit positions in ADEC(address decoder) registers.
+ * @row0: Row0 bit position.
+ * @row1: Row1 bit position.
+ * @row2: Row2 bit position.
+ * @row3: Row3 bit position.
+ * @row4: Row4 bit position.
+ * @reserved: Unused bits.
+ * @col1: Column 1 bit position.
+ * @col2: Column 2 bit position.
+ * @col3: Column 3 bit position.
+ * @col4: Column 4 bit position.
+ * @col5: Column 5 bit position.
+ * @reservedcol: Unused column bits.
+ * @i: ADEC register info.
+ */
+union row_col_mapping {
+ struct {
+ u32 row0:6;
+ u32 row1:6;
+ u32 row2:6;
+ u32 row3:6;
+ u32 row4:6;
+ u32 reserved:2;
+ };
+ struct {
+ u32 col1:6;
+ u32 col2:6;
+ u32 col3:6;
+ u32 col4:6;
+ u32 col5:6;
+ u32 reservedcol:2;
+ };
+ u32 i;
+} __packed;
+
+/**
+ * struct ecc_status - ECC status information to report.
+ * @ceinfo: Correctable error log information.
+ * @ueinfo: Uncorrected error log information.
+ * @channel: Channel number.
+ * @error_type: Error type information.
+ */
+struct ecc_status {
+ union ecc_error_info ceinfo[2];
+ union ecc_error_info ueinfo[2];
+ u8 channel;
+ u8 error_type;
+};
+
+/**
+ * struct mc_priv - DDR memory controller private instance data.
+ * @message: Buffer for framing the event specific info.
+ * @stat: ECC status information.
+ * @error_id: The error id.
+ * @error_level: The error level.
+ * @dwidth: Width of data bus excluding ECC bits.
+ * @part_len: The support of the message received.
+ * @regs: The registers sent on the rpmsg.
+ * @adec: Address decode registers.
+ * @mci: Memory controller interface.
+ * @ept: rpmsg endpoint.
+ * @mcdi: The mcdi handle.
+ */
+struct mc_priv {
+ char message[DDRMC5_EDAC_MSG_SIZE];
+ struct ecc_status stat;
+ u32 error_id;
+ u32 error_level;
+ u32 dwidth;
+ u32 part_len;
+ u32 regs[REG_MAX];
+ u32 adec[ADEC_MAX];
+ struct mem_ctl_info *mci;
+ struct rpmsg_endpoint *ept;
+ struct cdx_mcdi *mcdi;
+};
+
+/* Address decoder (ADEC) register information
+ * To match the order in which the register information is received from
+ * firmware
+ */
+enum adec_info {
+ CONF = 0,
+ ADEC0,
+ ADEC1,
+ ADEC2,
+ ADEC3,
+ ADEC4,
+ ADEC5,
+ ADEC6,
+ ADEC7,
+ ADEC8,
+ ADEC9,
+ ADEC10,
+ ADEC11,
+ ADEC12,
+ ADEC13,
+ ADEC14,
+ ADEC15,
+ ADEC16,
+ ADECILC,
+};
+
+enum reg_info {
+ ISR = 0,
+ IMR,
+ ECCR0_ERR_STATUS,
+ ECCR0_ADDR_LO,
+ ECCR0_ADDR_HI,
+ ECCR0_DATA_LO,
+ ECCR0_DATA_HI,
+ ECCR0_PAR,
+ ECCR1_ERR_STATUS,
+ ECCR1_ADDR_LO,
+ ECCR1_ADDR_HI,
+ ECCR1_DATA_LO,
+ ECCR1_DATA_HI,
+ ECCR1_PAR,
+ XMPU_ERR,
+ XMPU_ERR_ADDR_L0,
+ XMPU_ERR_ADDR_HI,
+ XMPU_ERR_AXI_ID,
+ ADEC_CHK_ERR_LOG,
+};
+
+static bool get_ddr_info(u32 *error_data, struct mc_priv *priv)
+{
+ u32 reglo, reghi, parity, eccr0_val, eccr1_val, isr;
+ struct ecc_status *p;
+
+ p = &priv->stat;
+
+ isr = error_data[ISR];
+
+ if (!(isr & (DDRMC5_IRQ_UE_MASK | DDRMC5_IRQ_CE_MASK)))
+ return false;
+
+ eccr0_val = error_data[ECCR0_ERR_STATUS];
+ eccr1_val = error_data[ECCR1_ERR_STATUS];
+
+ if (!eccr0_val && !eccr1_val)
+ return false;
+
+ if (!eccr0_val)
+ p->channel = 1;
+ else
+ p->channel = 0;
+
+ reglo = error_data[ECCR0_ADDR_LO];
+ reghi = error_data[ECCR0_ADDR_HI];
+ if ((isr & DDRMC5_IRQ_CE_MASK))
+ p->ceinfo[0].i = reglo | (u64)reghi << 32;
+ else if ((isr & DDRMC5_IRQ_UE_MASK))
+ p->ueinfo[0].i = reglo | (u64)reghi << 32;
+
+ parity = error_data[ECCR0_PAR];
+ edac_dbg(2, "ERR DATA: 0x%08X%08X ERR DATA PARITY: 0x%08X\n",
+ reghi, reglo, parity);
+
+ reglo = error_data[ECCR1_ADDR_LO];
+ reghi = error_data[ECCR1_ADDR_HI];
+ if ((isr & DDRMC5_IRQ_CE_MASK))
+ p->ceinfo[1].i = reglo | (u64)reghi << 32;
+ else if ((isr & DDRMC5_IRQ_UE_MASK))
+ p->ueinfo[1].i = reglo | (u64)reghi << 32;
+
+ parity = error_data[ECCR1_PAR];
+ edac_dbg(2, "ERR DATA: 0x%08X%08X ERR DATA PARITY: 0x%08X\n",
+ reghi, reglo, parity);
+
+ return true;
+}
+
+/**
+ * convert_to_physical - Convert to physical address.
+ * @priv: DDR memory controller private instance data.
+ * @pinf: ECC error info structure.
+ * @controller: Controller number of the DDRMC5
+ * @error_data: the DDRMC5 ADEC address decoder register data
+ *
+ * Return: Physical address of the DDR memory.
+ */
+static unsigned long convert_to_physical(struct mc_priv *priv,
+ union ecc_error_info pinf,
+ int controller, int *error_data)
+{
+ u32 row, blk, rsh_req_addr, interleave, ilc_base_ctrl_add, ilc_himem_en, reg, offset;
+ u64 high_mem_base, high_mem_offset, low_mem_offset, ilcmem_base;
+ unsigned long err_addr = 0, addr;
+ union row_col_mapping cols;
+ union row_col_mapping rows;
+ u32 col_bit_0;
+
+ row = pinf.rowhi << DDRMC5_REGHI_ROW | pinf.row;
+ offset = controller * ADEC_NUM;
+
+ reg = error_data[ADEC6];
+ rows.i = reg;
+ err_addr |= (row & BIT(0)) << rows.row0;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row1;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row2;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row3;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row4;
+ row >>= DDRMC5_EACHBIT;
+
+ reg = error_data[ADEC7];
+ rows.i = reg;
+ err_addr |= (row & BIT(0)) << rows.row0;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row1;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row2;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row3;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row4;
+ row >>= DDRMC5_EACHBIT;
+
+ reg = error_data[ADEC8];
+ rows.i = reg;
+ err_addr |= (row & BIT(0)) << rows.row0;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row1;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row2;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row3;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row4;
+
+ reg = error_data[ADEC9];
+ rows.i = reg;
+
+ err_addr |= (row & BIT(0)) << rows.row0;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row1;
+ row >>= DDRMC5_EACHBIT;
+ err_addr |= (row & BIT(0)) << rows.row2;
+ row >>= DDRMC5_EACHBIT;
+
+ col_bit_0 = FIELD_GET(MASK_24, error_data[ADEC9]);
+ pinf.col >>= 1;
+ err_addr |= (pinf.col & 1) << col_bit_0;
+
+ cols.i = error_data[ADEC10];
+ err_addr |= (pinf.col & 1) << cols.col1;
+ pinf.col >>= 1;
+ err_addr |= (pinf.col & 1) << cols.col2;
+ pinf.col >>= 1;
+ err_addr |= (pinf.col & 1) << cols.col3;
+ pinf.col >>= 1;
+ err_addr |= (pinf.col & 1) << cols.col4;
+ pinf.col >>= 1;
+ err_addr |= (pinf.col & 1) << cols.col5;
+ pinf.col >>= 1;
+
+ cols.i = error_data[ADEC11];
+ err_addr |= (pinf.col & 1) << cols.col1;
+ pinf.col >>= 1;
+ err_addr |= (pinf.col & 1) << cols.col2;
+ pinf.col >>= 1;
+ err_addr |= (pinf.col & 1) << cols.col3;
+ pinf.col >>= 1;
+ err_addr |= (pinf.col & 1) << cols.col4;
+ pinf.col >>= 1;
+ err_addr |= (pinf.col & 1) << cols.col5;
+ pinf.col >>= 1;
+
+ reg = error_data[ADEC12];
+ err_addr |= (pinf.bank & BIT(0)) << (reg & MASK_0);
+ pinf.bank >>= DDRMC5_EACHBIT;
+ err_addr |= (pinf.bank & BIT(0)) << FIELD_GET(DDRMC5_BANK1_MASK, reg);
+ pinf.bank >>= DDRMC5_EACHBIT;
+
+ err_addr |= (pinf.bank & BIT(0)) << FIELD_GET(DDRMC5_GRP_0_MASK, reg);
+ pinf.group >>= DDRMC5_EACHBIT;
+ err_addr |= (pinf.bank & BIT(0)) << FIELD_GET(DDRMC5_GRP_1_MASK, reg);
+ pinf.group >>= DDRMC5_EACHBIT;
+ err_addr |= (pinf.bank & BIT(0)) << FIELD_GET(MASK_24, reg);
+ pinf.group >>= DDRMC5_EACHBIT;
+
+ reg = error_data[ADEC4];
+ err_addr |= (pinf.rank & BIT(0)) << (reg & MASK_0);
+ pinf.rank >>= DDRMC5_EACHBIT;
+ err_addr |= (pinf.rank & BIT(0)) << FIELD_GET(DDRMC5_RANK_1_MASK, reg);
+ pinf.rank >>= DDRMC5_EACHBIT;
+
+ reg = error_data[ADEC5];
+ err_addr |= (pinf.lrank & BIT(0)) << (reg & MASK_0);
+ pinf.lrank >>= DDRMC5_EACHBIT;
+ err_addr |= (pinf.lrank & BIT(0)) << FIELD_GET(DDRMC5_LRANK_1_MASK, reg);
+ pinf.lrank >>= DDRMC5_EACHBIT;
+ err_addr |= (pinf.lrank & BIT(0)) << FIELD_GET(DDRMC5_LRANK_2_MASK, reg);
+ pinf.lrank >>= DDRMC5_EACHBIT;
+ err_addr |= (pinf.lrank & BIT(0)) << FIELD_GET(MASK_24, reg);
+ pinf.lrank >>= DDRMC5_EACHBIT;
+
+ high_mem_base = (priv->adec[ADEC2 + offset] & DDRMC5_MEM_MASK) * DDRMC5_HIMEM_BASE;
+ interleave = priv->adec[ADEC13 + offset] & DDRMC5_INTERLEAVE_SEL;
+
+ high_mem_offset = priv->adec[ADEC3 + offset] & DDRMC5_MEM_MASK;
+ low_mem_offset = priv->adec[ADEC1 + offset] & DDRMC5_MEM_MASK;
+ reg = priv->adec[ADEC14 + offset];
+ ilc_himem_en = !!(reg & DDRMC5_ILC_HIMEM_EN);
+ ilcmem_base = (reg & DDRMC5_ILC_MEM) * SZ_1M;
+ if (ilc_himem_en)
+ ilc_base_ctrl_add = ilcmem_base - high_mem_offset;
+ else
+ ilc_base_ctrl_add = ilcmem_base - low_mem_offset;
+
+ if (priv->dwidth == DEV_X16) {
+ blk = err_addr / DDRMC5_X16_SIZE;
+ rsh_req_addr = (blk << 8) + ilc_base_ctrl_add;
+ err_addr = rsh_req_addr * interleave * 2;
+ } else {
+ blk = err_addr / DDRMC5_X32_SIZE;
+ rsh_req_addr = (blk << 9) + ilc_base_ctrl_add;
+ err_addr = rsh_req_addr * interleave * 2;
+ }
+
+ if ((priv->adec[ADEC2 + offset] & DDRMC5_HIGH_MEM_EN) && err_addr >= high_mem_base)
+ addr = err_addr - high_mem_offset;
+ else
+ addr = err_addr - low_mem_offset;
+
+ return addr;
+}
+
+/**
+ * handle_error - Handle Correctable and Uncorrectable errors.
+ * @priv: DDR memory controller private instance data.
+ * @stat: ECC status structure.
+ * @controller: Controller number of the DDRMC5
+ * @error_data: the DDRMC5 ADEC address decoder register data
+ *
+ * Handles ECC correctable and uncorrectable errors.
+ */
+static void handle_error(struct mc_priv *priv, struct ecc_status *stat,
+ int controller, int *error_data)
+{
+ struct mem_ctl_info *mci = priv->mci;
+ union ecc_error_info pinf;
+ unsigned long pa;
+ phys_addr_t pfn;
+ int err;
+
+ if (stat->error_type == DDRMC5_ERR_TYPE_CE) {
+ pinf = stat->ceinfo[stat->channel];
+ snprintf(priv->message, DDRMC5_EDAC_MSG_SIZE,
+ "Error type:%s Controller %d Addr at %lx\n",
+ "CE", controller, convert_to_physical(priv, pinf, controller, error_data));
+
+ edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
+ 1, 0, 0, 0, 0, 0, -1,
+ priv->message, "");
+ }
+
+ if (stat->error_type == DDRMC5_ERR_TYPE_UE) {
+ pinf = stat->ueinfo[stat->channel];
+ snprintf(priv->message, DDRMC5_EDAC_MSG_SIZE,
+ "Error type:%s controller %d Addr at %lx\n",
+ "UE", controller, convert_to_physical(priv, pinf, controller, error_data));
+
+ edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
+ 1, 0, 0, 0, 0, 0, -1,
+ priv->message, "");
+ pa = convert_to_physical(priv, pinf, controller, error_data);
+ pfn = PHYS_PFN(pa);
+
+ if (IS_ENABLED(CONFIG_MEMORY_FAILURE)) {
+ err = memory_failure(pfn, MF_ACTION_REQUIRED);
+ if (err)
+ edac_dbg(2, "In fail of memory_failure %d\n", err);
+ else
+ edac_dbg(2, "Page at PA 0x%lx is hardware poisoned\n", pa);
+ }
+ }
+}
+
+/**
+ * init_csrows - Initialize the csrow data.
+ * @mci: EDAC memory controller instance.
+ *
+ * Initialize the chip select rows associated with the EDAC memory
+ * controller instance.
+ */
+static void init_csrows(struct mem_ctl_info *mci)
+{
+ struct mc_priv *priv = mci->pvt_info;
+ struct csrow_info *csi;
+ struct dimm_info *dimm;
+ u32 row;
+ int ch;
+
+ for (row = 0; row < mci->nr_csrows; row++) {
+ csi = mci->csrows[row];
+ for (ch = 0; ch < csi->nr_channels; ch++) {
+ dimm = csi->channels[ch]->dimm;
+ dimm->edac_mode = EDAC_SECDED;
+ dimm->mtype = MEM_DDR5;
+ dimm->grain = DDRMC5_EDAC_ERR_GRAIN;
+ dimm->dtype = priv->dwidth;
+ }
+ }
+}
+
+static void mc_init(struct mem_ctl_info *mci, struct platform_device *pdev)
+{
+ mci->pdev = &pdev->dev;
+ platform_set_drvdata(pdev, mci);
+
+ /* Initialize controller capabilities and configuration */
+ mci->mtype_cap = MEM_FLAG_DDR5;
+ mci->edac_ctl_cap = EDAC_FLAG_NONE | EDAC_FLAG_SECDED;
+ mci->scrub_cap = SCRUB_HW_SRC;
+ mci->scrub_mode = SCRUB_NONE;
+
+ mci->edac_cap = EDAC_FLAG_SECDED;
+ mci->ctl_name = "amd_ddr_controller";
+ mci->dev_name = dev_name(&pdev->dev);
+ mci->mod_name = "versalnet_edac";
+
+ edac_op_state = EDAC_OPSTATE_INT;
+
+ init_csrows(mci);
+}
+
+#define to_mci(k) container_of(k, struct mem_ctl_info, dev)
+
+static unsigned int amd_mcdi_rpc_timeout(struct cdx_mcdi *cdx, unsigned int cmd)
+{
+ return MCDI_RPC_TIMEOUT;
+}
+
+static void amd_mcdi_request(struct cdx_mcdi *cdx,
+ const struct cdx_dword *hdr, size_t hdr_len,
+ const struct cdx_dword *sdu, size_t sdu_len)
+{
+ unsigned char *send_buf;
+ int ret;
+
+ send_buf = kzalloc(hdr_len + sdu_len, GFP_KERNEL);
+ if (!send_buf)
+ return;
+
+ memcpy(send_buf, hdr, hdr_len);
+ memcpy(send_buf + hdr_len, sdu, sdu_len);
+
+ ret = rpmsg_send(cdx->ept, send_buf, hdr_len + sdu_len);
+ if (ret)
+ dev_err(&cdx->rpdev->dev, "Failed to send rpmsg data\n");
+ kfree(send_buf);
+}
+
+static const struct cdx_mcdi_ops mcdi_ops = {
+ .mcdi_rpc_timeout = amd_mcdi_rpc_timeout,
+ .mcdi_request = amd_mcdi_request,
+};
+
+static void get_ddr_config(u32 index, u32 *buffer, struct cdx_mcdi *amd_mcdi)
+{
+ size_t outlen;
+ int ret;
+
+ MCDI_DECLARE_BUF(inbuf, MC_CMD_EDAC_GET_DDR_CONFIG_IN_LEN);
+ MCDI_DECLARE_BUF(outbuf, BUFFER_SZ);
+
+ MCDI_SET_DWORD(inbuf, EDAC_GET_DDR_CONFIG_IN_CONTROLLER_INDEX, index);
+
+ ret = cdx_mcdi_rpc(amd_mcdi, MC_CMD_EDAC_GET_DDR_CONFIG, inbuf, sizeof(inbuf),
+ outbuf, sizeof(outbuf), &outlen);
+ if (!ret)
+ memcpy(buffer, MCDI_PTR(outbuf, EDAC_GET_DDR_CONFIG_OUT_REGISTER_VALUES),
+ (ADEC_NUM * 4));
+}
+
+static void amd_setup_mcdi(struct mc_priv *mc_priv)
+{
+ struct cdx_mcdi *amd_mcdi;
+ int ret, i;
+
+ amd_mcdi = kzalloc(sizeof(*amd_mcdi), GFP_KERNEL);
+ if (!amd_mcdi)
+ return;
+
+ amd_mcdi->mcdi_ops = &mcdi_ops;
+ ret = cdx_mcdi_init(amd_mcdi);
+ if (ret) {
+ kfree(amd_mcdi);
+ return;
+ }
+
+ amd_mcdi->ept = mc_priv->ept;
+ mc_priv->mcdi = amd_mcdi;
+
+ for (i = 0; i < NUM_CONTROLLERS; i++)
+ get_ddr_config(i, &mc_priv->adec[ADEC_NUM * i], amd_mcdi);
+}
+
+static const guid_t amd_versalnet_guid = GUID_INIT(0x82678888, 0xa556, 0x44f2,
+ 0xb8, 0xb4, 0x45, 0x56, 0x2e,
+ 0x8c, 0x5b, 0xec);
+
+static int amd_rpmsg_cb(struct rpmsg_device *rpdev, void *data,
+ int len, void *priv, u32 src)
+{
+ struct mc_priv *mc_priv = dev_get_drvdata(&rpdev->dev);
+ const guid_t *sec_type = &guid_null;
+ u32 length, offset, error_id;
+ u32 *result = (u32 *)data;
+ struct ecc_status *p;
+ int i, j, k, sec_sev;
+ u32 *adec_data;
+
+ if (*(u8 *)data == MCDI_RESPONSE) {
+ cdx_mcdi_process_cmd(mc_priv->mcdi, (struct cdx_dword *)data, len);
+ return 0;
+ }
+
+ sec_sev = result[AMD_ERROR_LEVEL];
+ error_id = result[AMD_ERRORID];
+ length = result[AMD_MSG_ERR_LENGTH];
+ offset = result[AMD_MSG_ERR_OFFSET];
+
+ if (result[TOTAL_ERR_LENGTH] > length) {
+ if (!mc_priv->part_len)
+ mc_priv->part_len = length;
+ else
+ mc_priv->part_len += length;
+ /*
+ * The data can come in 2 stretches. Construct the regs from 2
+ * messages the offset indicates the offset from which the data is to
+ * be taken
+ */
+ for (i = 0 ; i < length; i++) {
+ k = offset + i;
+ j = AMD_ERR_DATA + i;
+ mc_priv->regs[k] = result[j];
+ }
+ if (mc_priv->part_len < result[TOTAL_ERR_LENGTH])
+ return 0;
+ mc_priv->part_len = 0;
+ }
+
+ mc_priv->error_id = error_id;
+ mc_priv->error_level = result[AMD_ERROR_LEVEL];
+
+ switch (error_id) {
+ /* GSW Non-Correctable error */
+ case 5:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "General Software Non-Correctable error", error_id);
+ break;
+ /* CFU error */
+ case 6:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "CFU error", error_id);
+ break;
+ /* CFRAME error */
+ case 7:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "CFRAME error", error_id);
+ break;
+ /* Microblaze correctable error */
+ case 10:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "DDRMC Microblaze Correctable ECC error", error_id);
+ break;
+ /* Microblaze Non-correctable */
+ case 11:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "DDRMC Microblaze Non-Correctable ECC error", error_id);
+ break;
+ /* MMCM error */
+ case 15:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "MMCM error", error_id);
+ break;
+ /* HNIX correctable */
+ case 16:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "HNICX Correctable error", error_id);
+ break;
+ /* HNIX Non-correctable */
+ case 17:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "HNICX Non-Correctable error", error_id);
+ break;
+ /* DDRMC correctable error */
+ case 18:
+ p = &mc_priv->stat;
+ memset(p, 0, sizeof(struct ecc_status));
+ p->error_type = DDRMC5_ERR_TYPE_CE;
+ for (i = 0 ; i < NUM_CONTROLLERS; i++) {
+ if (get_ddr_info(&mc_priv->regs[i * REGS_PER_CONTROLLER], mc_priv)) {
+ adec_data = mc_priv->adec + ADEC_NUM * i;
+ handle_error(mc_priv, &mc_priv->stat, i, adec_data);
+ }
+ }
+ return 0;
+ /* DDRMC Non-correctable */
+ case 19:
+ p = &mc_priv->stat;
+ memset(p, 0, sizeof(struct ecc_status));
+ p->error_type = DDRMC5_ERR_TYPE_UE;
+ for (i = 0 ; i < NUM_CONTROLLERS; i++) {
+ if (get_ddr_info(&mc_priv->regs[i * REGS_PER_CONTROLLER], mc_priv)) {
+ adec_data = mc_priv->adec + ADEC_NUM * i;
+ handle_error(mc_priv, &mc_priv->stat, i, adec_data);
+ }
+ }
+ return 0;
+ /* GT Correctable error */
+ case 21:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "GT Non-Correctable error", error_id);
+ break;
+ /* PL Sysmon correctable error */
+ case 22:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "PL Sysmon Correctable error", error_id);
+ break;
+ /* PL Sysmon Non-correctable error */
+ case 23:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "PL Sysmon Non-Correctable error", error_id);
+ break;
+ /* LPX unexpected dfx activation error */
+ case 111:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "LPX unexpected dfx activation error", error_id);
+ break;
+ /* INT LPD Non-Correctable error */
+ case 114:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "INT_LPD Non-Correctable error", error_id);
+ break;
+ /* INT OCM Non-Correctable error */
+ case 116:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "INT_OCM Non-Correctable error", error_id);
+ break;
+ /* INT FPD Correctable error */
+ case 117:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "INT_FPD Correctable error", error_id);
+ break;
+ /* INT FPD Non-Correctable error */
+ case 118:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "INT_FPD Non-Correctable error", error_id);
+ break;
+ /* INT IOU Non-Correctable error */
+ case 120:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "INT_IOU Non-Correctable error", error_id);
+ break;
+ /* GIC AXI err_int_irq */
+ case 123:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "err_int_irq from APU GIC Distributor", error_id);
+ break;
+ /* GIC ECC fault_int_irq */
+ case 124:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "fault_int_irq from APU GIC Distribute", error_id);
+ break;
+ /* FPX SPLITTER error */
+ case 132 ... 139:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "FPX SPLITTER error", error_id);
+ break;
+ /* APU0 error */
+ case 140:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "APU Cluster 0 error", error_id);
+ break;
+ /* APU1 error */
+ case 141:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "APU Cluster 1 error", error_id);
+ break;
+ /* APU2 error */
+ case 142:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "APU Cluster 2 error", error_id);
+ break;
+ /* APU3 error */
+ case 143:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "APU Cluster 3 error", error_id);
+ break;
+ /* Window watchdog LPX error */
+ case 145:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "WWDT1 LPX error", error_id);
+ break;
+ /* IPI error */
+ case 147:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "IPI error", error_id);
+ break;
+ /* LPX AFIFS error */
+ case 152 ... 153:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "AFIFS error", error_id);
+ break;
+ /* LPX glitch Errors */
+ case 154 ... 155:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "LPX glitch error", error_id);
+ break;
+ /* FPX AFIFS error */
+ case 185 ... 186:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "FPX AFIFS error", error_id);
+ break;
+ /* AFIFM Non-fatal error */
+ case 195 ... 199:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "AFIFM error", error_id);
+ break;
+ /* Firmware error */
+ case 108:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "PSM Correctable error", error_id);
+ break;
+ /* PMC Correctable error */
+ case 59:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "PMC correctable error", error_id);
+ break;
+ /* PMC Un-Correctable error */
+ case 60:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "PMC Un correctable error", error_id);
+ break;
+ /* PMC Sysmon temperature shutdown alert and power supply failure errors */
+ case 43 ... 47:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "PMC Sysmon error", error_id);
+ break;
+ /* RPU Error */
+ case 163 ... 184:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "RPU error", error_id);
+ break;
+ /* OCM0 correctable error */
+ case 148:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "OCM0 correctable error", error_id);
+ break;
+ /* OCM1 correctable error */
+ case 149:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "OCM1 correctable error", error_id);
+ break;
+ /* OCM0 Un-correctable error */
+ case 150:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "OCM0 Un-correctable error", error_id);
+ break;
+ /* OCM1 Un-correctable error */
+ case 151:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "OCM1 Un-correctable error", error_id);
+ break;
+ /* PSX_CMN_3 error */
+ case 189:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "PSX_CMN_3 PD block consolidated error", error_id);
+ break;
+ /* FPD_INT_WRAP error */
+ case 191:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "FPD_INT_WRAP PD block consolidated error", error_id);
+ break;
+ /* CRAM_CE error */
+ case 232:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ AMD_ERR "CRAM Un-Correctable error", error_id);
+ break;
+ default:
+ snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
+ "VERSAL_EDAC_ERR_ID: %d", error_id);
+ break;
+ }
+
+ /* Convert to bytes */
+ length = result[TOTAL_ERR_LENGTH] * 4;
+ log_non_standard_event(sec_type, &amd_versalnet_guid, mc_priv->message,
+ sec_sev, (void *)&result[AMD_ERR_DATA], length);
+
+ return 0;
+}
+
+static struct rpmsg_device_id amd_rpmsg_id_table[] = {
+ { .name = "error_ipc" },
+ { },
+};
+MODULE_DEVICE_TABLE(rpmsg, amd_rpmsg_id_table);
+
+static int amd_rpmsg_probe(struct rpmsg_device *rpdev)
+{
+ struct rpmsg_channel_info chinfo = {0};
+ struct mc_priv *pg;
+
+ pg = (struct mc_priv *)amd_rpmsg_id_table[0].driver_data;
+ chinfo.src = RPMSG_ADDR_ANY;
+ chinfo.dst = rpdev->dst;
+ strscpy(chinfo.name, amd_rpmsg_id_table[0].name,
+ strlen(amd_rpmsg_id_table[0].name));
+
+ pg->ept = rpmsg_create_ept(rpdev, amd_rpmsg_cb, NULL, chinfo);
+ if (!pg->ept)
+ return dev_err_probe(&rpdev->dev, -ENXIO,
+ "Failed to create ept for channel %s\n",
+ chinfo.name);
+
+ dev_set_drvdata(&rpdev->dev, pg);
+ return 0;
+}
+
+static void amd_rpmsg_remove(struct rpmsg_device *rpdev)
+{
+ struct mc_priv *mc_priv = dev_get_drvdata(&rpdev->dev);
+
+ rpmsg_destroy_ept(mc_priv->ept);
+ dev_set_drvdata(&rpdev->dev, NULL);
+}
+
+static struct rpmsg_driver amd_rpmsg_driver = {
+ .drv.name = KBUILD_MODNAME,
+ .probe = amd_rpmsg_probe,
+ .remove = amd_rpmsg_remove,
+ .callback = amd_rpmsg_cb,
+ .id_table = amd_rpmsg_id_table,
+};
+
+/**
+ * get_dwidth - Return the controller memory width.
+ * @width: data width read from the config reg.
+ *
+ * Get the EDAC device type width appropriate for the controller
+ * configuration.
+ *
+ * Return: a device type width enumeration.
+ */
+static enum dev_type get_dwidth(u32 width)
+{
+ enum dev_type dt;
+
+ switch (width) {
+ case XDDR5_BUS_WIDTH_16:
+ dt = DEV_X16;
+ break;
+ case XDDR5_BUS_WIDTH_32:
+ dt = DEV_X32;
+ break;
+ case XDDR5_BUS_WIDTH_64:
+ dt = DEV_X64;
+ break;
+ default:
+ dt = DEV_UNKNOWN;
+ }
+
+ return dt;
+}
+
+static int mc_probe(struct platform_device *pdev)
+{
+ u32 num_chans, rank, dwidth, config;
+ struct device_node *r5_core_node;
+ struct edac_mc_layer layers[2];
+ struct mem_ctl_info *mci;
+ struct mc_priv *priv;
+ struct rproc *rp;
+ enum dev_type dt;
+ int rc, i;
+
+ r5_core_node = of_parse_phandle(pdev->dev.of_node, "amd,rproc", 0);
+ if (!r5_core_node) {
+ dev_err(&pdev->dev, "amd,rproc: invalid phandle\n");
+ return -EINVAL;
+ }
+
+ rp = rproc_get_by_phandle(r5_core_node->phandle);
+ if (!rp)
+ return -EPROBE_DEFER;
+
+ rc = rproc_boot(rp);
+ if (rc) {
+ dev_err(&pdev->dev, "Failed to attach to remote processor\n");
+ rproc_put(rp);
+ return rc;
+ }
+
+ priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
+ amd_rpmsg_id_table[0].driver_data = (kernel_ulong_t)priv;
+ rc = register_rpmsg_driver(&amd_rpmsg_driver);
+ if (rc) {
+ edac_printk(KERN_ERR, EDAC_MC,
+ "Failed to register RPMsg driver: %d\n", rc);
+ goto free_rproc;
+ }
+
+ amd_setup_mcdi(priv);
+
+ for (i = 0; i < NUM_CONTROLLERS; i++) {
+ config = priv->adec[CONF + i * ADEC_NUM];
+ num_chans = FIELD_GET(DDRMC5_NUM_CHANS_MASK, config);
+ rank = FIELD_GET(DDRMC5_RANK_MASK, config);
+ rank = 1 << rank;
+ dwidth = FIELD_GET(DDRMC5_BUS_WIDTH_MASK, config);
+ dt = get_dwidth(dwidth);
+
+ /* Find the first enabled device and register that one. */
+ if (dt != DEV_UNKNOWN) {
+ layers[0].type = EDAC_MC_LAYER_CHIP_SELECT;
+ layers[0].size = rank;
+ layers[0].is_virt_csrow = true;
+ layers[1].type = EDAC_MC_LAYER_CHANNEL;
+ layers[1].size = num_chans;
+ layers[1].is_virt_csrow = false;
+
+ mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers,
+ sizeof(struct mc_priv));
+ if (!mci) {
+ edac_printk(KERN_ERR, EDAC_MC,
+ "Failed memory allocation for mc instance\n");
+ rc = -ENOMEM;
+ goto free_rpmsg;
+ }
+
+ priv->mci = mci;
+ priv->dwidth = dt;
+ mc_init(mci, pdev);
+ rc = edac_mc_add_mc(mci);
+ if (rc) {
+ edac_printk(KERN_ERR, EDAC_MC,
+ "Failed to register with EDAC core\n");
+ goto free_edac_mc;
+ }
+ return 0;
+ }
+ }
+
+ return 0;
+
+free_edac_mc:
+ edac_mc_free(mci);
+free_rpmsg:
+ unregister_rpmsg_driver(&amd_rpmsg_driver);
+free_rproc:
+ rproc_shutdown(rp);
+ return rc;
+}
+
+static void mc_remove(struct platform_device *pdev)
+{
+ struct mem_ctl_info *mci = platform_get_drvdata(pdev);
+ struct mc_priv *priv = mci->pvt_info;
+
+ unregister_rpmsg_driver(&amd_rpmsg_driver);
+ edac_mc_del_mc(&pdev->dev);
+ edac_mc_free(mci);
+ rproc_shutdown(priv->mcdi->r5_rproc);
+}
+
+static const struct of_device_id amd_edac_match[] = {
+ { .compatible = "xlnx,versal-net-ddrmc5", },
+ {}
+};
+MODULE_DEVICE_TABLE(of, amd_edac_match);
+
+static struct platform_driver amd_ddr_edac_mc_driver = {
+ .driver = {
+ .name = "amd-ddrmc-edac",
+ .of_match_table = amd_edac_match,
+ },
+ .probe = mc_probe,
+ .remove = mc_remove,
+};
+
+module_platform_driver(amd_ddr_edac_mc_driver);
+
+MODULE_AUTHOR("AMD Inc");
+MODULE_DESCRIPTION("AMD DDRMC ECC driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/cdx/edac_cdx_pcol.h b/include/linux/cdx/edac_cdx_pcol.h
new file mode 100644
index 000000000000..a72ec131f083
--- /dev/null
+++ b/include/linux/cdx/edac_cdx_pcol.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Driver for AMD network controllers and boards
+ *
+ * Copyright (C) 2021, Xilinx, Inc.
+ * Copyright (C) 2022-2023, Advanced Micro Devices, Inc.
+ */
+
+#ifndef MC_CDX_PCOL_H
+#define MC_CDX_PCOL_H
+#include <linux/cdx/mcdi.h>
+
+#define MC_CMD_EDAC_GET_DDR_CONFIG_OUT_WORD_LENGTH_LEN 4
+/* Number of registers for the DDR controller */
+#define MC_CMD_EDAC_GET_DDR_CONFIG_OUT_REGISTER_VALUES_OFST 4
+#define MC_CMD_EDAC_GET_DDR_CONFIG_OUT_REGISTER_VALUES_LEN 4
+
+/***********************************/
+/* MC_CMD_EDAC_GET_DDR_CONFIG
+ * Provides detailed configuration for the DDR controller of the given index.
+ */
+#define MC_CMD_EDAC_GET_DDR_CONFIG 0x3
+
+/* MC_CMD_EDAC_GET_DDR_CONFIG_IN msgrequest */
+#define MC_CMD_EDAC_GET_DDR_CONFIG_IN_CONTROLLER_INDEX_OFST 0
+#define MC_CMD_EDAC_GET_DDR_CONFIG_IN_CONTROLLER_INDEX_LEN 4
+
+#endif /* MC_CDX_PCOL_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v7 0/5] EDAC/Versal NET: Add support for error notification
2025-05-29 7:00 [PATCH v7 0/5] EDAC/Versal NET: Add support for error notification Shubhrajyoti Datta
` (4 preceding siblings ...)
2025-05-29 7:00 ` [PATCH v7 5/5] EDAC/VersalNET: Add support for error notification Shubhrajyoti Datta
@ 2025-05-30 14:12 ` Michal Simek
5 siblings, 0 replies; 14+ messages in thread
From: Michal Simek @ 2025-05-30 14:12 UTC (permalink / raw)
To: Shubhrajyoti Datta, linux-kernel, devicetree, linux-edac
Cc: git, Krzysztof Kozlowski, Rob Herring, Conor Dooley,
Borislav Petkov, Tony Luck, James Morse, Mauro Carvalho Chehab,
Robert Richter, Nipun Gupta, Nikhil Agarwal
On 5/29/25 09:00, Shubhrajyoti Datta wrote:
> Adds support for the error notification for the Versal NET EDAC driver.
> The driver receives error events via RPMsg instead of directly accessing
> hardware registers. The NMC((Network management controller), which has
> secure access to DDRMC registers, gathers the necessary information and
> transmits it through RPMsg.
>
> During probe, the driver registers with RPMsg and retrieves DDR
> configuration by scheduling a work item from the NMC.
> Once this is completed, it registers the EDAC controller.
> When an error occurs, the NMC sends an RPMsg, notifying the driver.
> The EDAC driver handles error reporting for all events.
> Also we register the EDAC once and it reports the errors for all the
> events including the 8 DDRMC controllers. So while registering we give
> the particulars of the 1st controller.
>
> Currently 20 errors has been tested.
>
>
> Changes in v7:
> - add a minimal header instead moving them
> - Add the kernel doc description
> - Add the prototype from first patch to export patch
> - Add the reviewed by tag
> - Update the header paths
> - merge edac_cdx_pcol.h
>
> Changes in v6:
> - Patch added
> - Update commit description
> - Update the commit message.
> - update to the chip name as xlnx,versal-net
> - Correct indentation
> - Update to xlnx,versal-net-ddrmc5
> - Update the kconfig message
> - Make the messages uniform
> - Add some more supported events
> - rename regval to reglo
> - combine/ reformat functions
> - remove trailing comments
> - Remove unneeded comments
> - make the amd_mcdi function void
> - rename versalnet_rpmsg_edac to versalnet_edac
> - Remove the column bit and use them directly
> - Update the comments
> - Update the mod_name to versalnet_edac
> - remove the global priv col and rows
> - rename edac_priv to mc_priv
> - Update the comment description for dwidth
> - Remove error_id enum
> - rename the variable par to parity
> - make get_ddr_config void
> - Fix memory leak of the mcdi structure
> - Update the spelling
> - Remove the workqueue
>
> Changes in v5:
> - Update the binding
> - Update the compatible
> - Update the handle_error documentation
>
> Changes in v4:
> - Update the compatible
> - align the example
> - Enhance the description for rproc
> - Update the compatible
>
> Changes in v3:
> - make remove void
>
> Changes in v2:
> - Export the symbols for module compilation
> - New patch addition
> - rename EDAC to memory controller
> - update the compatible name
> - Add remote proc handle
> - Read the data width from the registers
> - Remove the dwidth, rank and channel number the same is
> read from the RpMsg.
> - remove reset
> - Add the remote proc requests
> - remove probe_once
> - reorder the rpmsg registration
> - the data width , rank and number of channel is read from message.
>
> Shubhrajyoti Datta (5):
> cdx: add the headers to include/linux
> cdx: Export Symbols for MCDI RPC and Initialization
> ras: Export log_non_standard_event for External Usage
> dt-bindings: memory-controllers: Add support for Versal NET EDAC
> EDAC/VersalNET: Add support for error notification
>
> .../xlnx,versal-net-ddrmc5.yaml | 41 +
> drivers/cdx/controller/mcdi.c | 29 +
> drivers/edac/Kconfig | 11 +
> drivers/edac/Makefile | 1 +
> drivers/edac/versalnet_edac.c | 1108 +++++++++++++++++
> drivers/ras/ras.c | 1 +
> include/linux/cdx/bitfield.h | 78 ++
> include/linux/cdx/edac_cdx_pcol.h | 28 +
> include/linux/cdx/mcdi.h | 198 +++
> 9 files changed, 1495 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/memory-controllers/xlnx,versal-net-ddrmc5.yaml
> create mode 100644 drivers/edac/versalnet_edac.c
> create mode 100644 include/linux/cdx/bitfield.h
> create mode 100644 include/linux/cdx/edac_cdx_pcol.h
> create mode 100644 include/linux/cdx/mcdi.h
>
Acked-by: Michal Simek <michal.simek@amd.com>
Thanks,
Michal
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v7 1/5] cdx: add the headers to include/linux
2025-05-29 7:00 ` [PATCH v7 1/5] cdx: add the headers to include/linux Shubhrajyoti Datta
@ 2025-06-13 19:59 ` Yazen Ghannam
0 siblings, 0 replies; 14+ messages in thread
From: Yazen Ghannam @ 2025-06-13 19:59 UTC (permalink / raw)
To: Shubhrajyoti Datta
Cc: linux-kernel, devicetree, linux-edac, git, Krzysztof Kozlowski,
Rob Herring, Conor Dooley, Borislav Petkov, Tony Luck,
James Morse, Mauro Carvalho Chehab, Robert Richter, Nipun Gupta,
Nikhil Agarwal
On Thu, May 29, 2025 at 12:30:13PM +0530, Shubhrajyoti Datta wrote:
> Add a the bitfield.h and mcdi.h headers.
Extra "a"?
> This is in preparation for VersalNET EDAC
> driver that relies on it.
>
> Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
> ---
>
> Changes in v7:
> - add a minimal header instead moving them
>
> Changes in v6:
> - Patch added
>
> include/linux/cdx/bitfield.h | 78 ++++++++++++++
> include/linux/cdx/mcdi.h | 192 +++++++++++++++++++++++++++++++++++
These only get used by VersalNET EDAC driver at the end.
So the headers can go in drivers/edac.
Also, maybe these can all be a single header file? It seems like each
one is just included in the next one.
$ git grep "cdx/bitfield"
include/linux/cdx/mcdi.h:#include "linux/cdx/bitfield.h"
$ git grep "cdx/mcdi"
include/linux/cdx/edac_cdx_pcol.h:#include <linux/cdx/mcdi.h>
$ git grep "cdx/edac_cdx_pcol"
drivers/edac/versalnet_edac.c:#include <linux/cdx/edac_cdx_pcol.h>
If these are truly independent header files, then the "c" file should
include them all. You should not depend on a header file including other
header files, if possible.
From "Documentation/process/submit-checklist.rst":
1) If you use a facility then #include the file that defines/declares
that facility. Don't depend on other header files pulling in ones
that you use.
Thanks,
Yazen
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v7 2/5] cdx: Export Symbols for MCDI RPC and Initialization
2025-05-29 7:00 ` [PATCH v7 2/5] cdx: Export Symbols for MCDI RPC and Initialization Shubhrajyoti Datta
@ 2025-06-13 20:10 ` Yazen Ghannam
2025-06-16 12:20 ` Datta, Shubhrajyoti
0 siblings, 1 reply; 14+ messages in thread
From: Yazen Ghannam @ 2025-06-13 20:10 UTC (permalink / raw)
To: Shubhrajyoti Datta
Cc: linux-kernel, devicetree, linux-edac, git, Krzysztof Kozlowski,
Rob Herring, Conor Dooley, Borislav Petkov, Tony Luck,
James Morse, Mauro Carvalho Chehab, Robert Richter, Nipun Gupta,
Nikhil Agarwal
On Thu, May 29, 2025 at 12:30:14PM +0530, Shubhrajyoti Datta wrote:
> The cdx_mcdi_init, cdx_mcdi_process_cmd, and cdx_mcdi_rpc functions are
> needed by VersalNET EDAC modules that interact with the MCDI (Management
> Controller Direct Interface) framework. These functions facilitate
> communication between different hardware components by enabling command
> execution and status management.
>
> Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
> ---
>
> Changes in v7:
> - Add the kernel doc description
> - Add the prototype from first patch to here
>
> Changes in v6:
> - Update commit description
>
> Changes in v2:
> - Export the symbols for module compilation
>
> drivers/cdx/controller/mcdi.c | 29 +++++++++++++++++++++++++++++
> include/linux/cdx/mcdi.h | 6 ++++++
You've added the function prototypes to this new global header.
But you didn't remove them from the local header.
drivers/cdx/controller/mcdi.h
Also, you haven't included the new global header in the cdx/controller
code.
Even though this does compile, it doesn't seem proper.
I expect you would want to do the following:
1) Add the common code to the global header.
2) Remove the common code from the local header.
3) Include the global header everywhere the common code is needed.
Keeping the diff below for reference.
Thanks,
Yazen
> 2 files changed, 35 insertions(+)
>
> diff --git a/drivers/cdx/controller/mcdi.c b/drivers/cdx/controller/mcdi.c
> index e760f8d347cc..f3cca4c884ff 100644
> --- a/drivers/cdx/controller/mcdi.c
> +++ b/drivers/cdx/controller/mcdi.c
> @@ -99,6 +99,19 @@ static unsigned long cdx_mcdi_rpc_timeout(struct cdx_mcdi *cdx, unsigned int cmd
> return cdx->mcdi_ops->mcdi_rpc_timeout(cdx, cmd);
> }
>
> +/**
> + * cdx_mcdi_init - Initialize MCDI (Management Controller Driver Interface) state
> + * @cdx: NIC through which to issue the command
> + *
> + * This function allocates and initializes internal MCDI structures and resources
> + * for the CDX device, including the workqueue, locking primitives, and command
> + * tracking mechanisms. It sets the initial operating mode and prepares the device
> + * for MCDI operations.
> + *
> + * Return:
> + * * 0 - on success
> + * * -ENOMEM - if memory allocation or workqueue creation fails
> + */
> int cdx_mcdi_init(struct cdx_mcdi *cdx)
> {
> struct cdx_mcdi_iface *mcdi;
> @@ -128,6 +141,7 @@ int cdx_mcdi_init(struct cdx_mcdi *cdx)
> fail:
> return rc;
> }
> +EXPORT_SYMBOL_GPL(cdx_mcdi_init);
>
> void cdx_mcdi_finish(struct cdx_mcdi *cdx)
> {
> @@ -553,6 +567,19 @@ static void cdx_mcdi_start_or_queue(struct cdx_mcdi_iface *mcdi,
> cdx_mcdi_cmd_start_or_queue(mcdi, cmd);
> }
>
> +/**
> + * cdx_mcdi_process_cmd - Process an incoming MCDI response
> + * @cdx: NIC through which to issue the command
> + * @outbuf: Pointer to the response buffer received from the management controller
> + * @len: Length of the response buffer in bytes
> + *
> + * This function handles a response from the management controller. It locates the
> + * corresponding command using the sequence number embedded in the header,
> + * completes the command if it is still pending, and initiates any necessary cleanup.
> + *
> + * The function assumes that the response buffer is well-formed and at least one
> + * dword in size.
> + */
> void cdx_mcdi_process_cmd(struct cdx_mcdi *cdx, struct cdx_dword *outbuf, int len)
> {
> struct cdx_mcdi_iface *mcdi;
> @@ -590,6 +617,7 @@ void cdx_mcdi_process_cmd(struct cdx_mcdi *cdx, struct cdx_dword *outbuf, int le
>
> cdx_mcdi_process_cleanup_list(mcdi->cdx, &cleanup_list);
> }
> +EXPORT_SYMBOL_GPL(cdx_mcdi_process_cmd);
>
> static void cdx_mcdi_cmd_work(struct work_struct *context)
> {
> @@ -757,6 +785,7 @@ int cdx_mcdi_rpc(struct cdx_mcdi *cdx, unsigned int cmd,
> return cdx_mcdi_rpc_sync(cdx, cmd, inbuf, inlen, outbuf, outlen,
> outlen_actual, false);
> }
> +EXPORT_SYMBOL_GPL(cdx_mcdi_rpc);
>
> /**
> * cdx_mcdi_rpc_async - Schedule an MCDI command to run asynchronously
> diff --git a/include/linux/cdx/mcdi.h b/include/linux/cdx/mcdi.h
> index 46e3f63b062a..1344119e9a2c 100644
> --- a/include/linux/cdx/mcdi.h
> +++ b/include/linux/cdx/mcdi.h
> @@ -169,6 +169,12 @@ struct cdx_mcdi_data {
> u32 fn_flags;
> };
>
> +int cdx_mcdi_init(struct cdx_mcdi *cdx);
> +void cdx_mcdi_process_cmd(struct cdx_mcdi *cdx, struct cdx_dword *outbuf, int len);
> +int cdx_mcdi_rpc(struct cdx_mcdi *cdx, unsigned int cmd,
> + const struct cdx_dword *inbuf, size_t inlen,
> + struct cdx_dword *outbuf, size_t outlen, size_t *outlen_actual);
> +
> /*
> * We expect that 16- and 32-bit fields in MCDI requests and responses
> * are appropriately aligned, but 64-bit fields are only
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH v7 2/5] cdx: Export Symbols for MCDI RPC and Initialization
2025-06-13 20:10 ` Yazen Ghannam
@ 2025-06-16 12:20 ` Datta, Shubhrajyoti
2025-06-20 11:03 ` Datta, Shubhrajyoti
0 siblings, 1 reply; 14+ messages in thread
From: Datta, Shubhrajyoti @ 2025-06-16 12:20 UTC (permalink / raw)
To: Ghannam, Yazen
Cc: linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
linux-edac@vger.kernel.org, git (AMD-Xilinx), Krzysztof Kozlowski,
Rob Herring, Conor Dooley, Borislav Petkov, Tony Luck,
James Morse, Mauro Carvalho Chehab, Robert Richter, Gupta, Nipun,
Agarwal, Nikhil
[AMD Official Use Only - AMD Internal Distribution Only]
> -----Original Message-----
> From: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Sent: Saturday, June 14, 2025 1:41 AM
> To: Datta, Shubhrajyoti <shubhrajyoti.datta@amd.com>
> Cc: linux-kernel@vger.kernel.org; devicetree@vger.kernel.org; linux-
> edac@vger.kernel.org; git (AMD-Xilinx) <git@amd.com>; Krzysztof Kozlowski
> <krzk@kernel.org>; Rob Herring <robh@kernel.org>; Conor Dooley
> <conor+dt@kernel.org>; Borislav Petkov <bp@alien8.de>; Tony Luck
> <tony.luck@intel.com>; James Morse <james.morse@arm.com>; Mauro
> Carvalho Chehab <mchehab@kernel.org>; Robert Richter <rric@kernel.org>;
> Gupta, Nipun <Nipun.Gupta@amd.com>; Agarwal, Nikhil
> <nikhil.agarwal@amd.com>
> Subject: Re: [PATCH v7 2/5] cdx: Export Symbols for MCDI RPC and Initialization
>
> On Thu, May 29, 2025 at 12:30:14PM +0530, Shubhrajyoti Datta wrote:
> > The cdx_mcdi_init, cdx_mcdi_process_cmd, and cdx_mcdi_rpc functions
> > are needed by VersalNET EDAC modules that interact with the MCDI
> > (Management Controller Direct Interface) framework. These functions
> > facilitate communication between different hardware components by
> > enabling command execution and status management.
> >
> > Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
> > ---
> >
> > Changes in v7:
> > - Add the kernel doc description
> > - Add the prototype from first patch to here
> >
> > Changes in v6:
> > - Update commit description
> >
> > Changes in v2:
> > - Export the symbols for module compilation
> >
> > drivers/cdx/controller/mcdi.c | 29 +++++++++++++++++++++++++++++
> > include/linux/cdx/mcdi.h | 6 ++++++
>
> You've added the function prototypes to this new global header.
>
> But you didn't remove them from the local header.
> drivers/cdx/controller/mcdi.h
>
> Also, you haven't included the new global header in the cdx/controller code.
>
> Even though this does compile, it doesn't seem proper.
>
> I expect you would want to do the following:
>
> 1) Add the common code to the global header.
> 2) Remove the common code from the local header.
> 3) Include the global header everywhere the common code is needed.
>
> Keeping the diff below for reference.
Will do. However I am planning to compile test the cdx changes.
>
> Thanks,
> Yazen
>
> > 2 files changed, 35 insertions(+)
> >
> > diff --git a/drivers/cdx/controller/mcdi.c
> > b/drivers/cdx/controller/mcdi.c index e760f8d347cc..f3cca4c884ff
> > 100644
> > --- a/drivers/cdx/controller/mcdi.c
> > +++ b/drivers/cdx/controller/mcdi.c
> > @@ -99,6 +99,19 @@ static unsigned long cdx_mcdi_rpc_timeout(struct
> cdx_mcdi *cdx, unsigned int cmd
> > return cdx->mcdi_ops->mcdi_rpc_timeout(cdx, cmd); }
> >
> > +/**
> > + * cdx_mcdi_init - Initialize MCDI (Management Controller Driver
> > +Interface) state
> > + * @cdx: NIC through which to issue the command
> > + *
> > + * This function allocates and initializes internal MCDI structures
> > +and resources
> > + * for the CDX device, including the workqueue, locking primitives,
> > +and command
> > + * tracking mechanisms. It sets the initial operating mode and
> > +prepares the device
> > + * for MCDI operations.
> > + *
> > + * Return:
> > + * * 0 - on success
> > + * * -ENOMEM - if memory allocation or workqueue creation fails */
> > int cdx_mcdi_init(struct cdx_mcdi *cdx) {
> > struct cdx_mcdi_iface *mcdi;
> > @@ -128,6 +141,7 @@ int cdx_mcdi_init(struct cdx_mcdi *cdx)
> > fail:
> > return rc;
> > }
> > +EXPORT_SYMBOL_GPL(cdx_mcdi_init);
> >
> > void cdx_mcdi_finish(struct cdx_mcdi *cdx) { @@ -553,6 +567,19 @@
> > static void cdx_mcdi_start_or_queue(struct cdx_mcdi_iface *mcdi,
> > cdx_mcdi_cmd_start_or_queue(mcdi, cmd); }
> >
> > +/**
> > + * cdx_mcdi_process_cmd - Process an incoming MCDI response
> > + * @cdx: NIC through which to issue the command
> > + * @outbuf: Pointer to the response buffer received from the management
> controller
> > + * @len: Length of the response buffer in bytes
> > + *
> > + * This function handles a response from the management controller.
> > +It locates the
> > + * corresponding command using the sequence number embedded in the
> > +header,
> > + * completes the command if it is still pending, and initiates any necessary
> cleanup.
> > + *
> > + * The function assumes that the response buffer is well-formed and
> > +at least one
> > + * dword in size.
> > + */
> > void cdx_mcdi_process_cmd(struct cdx_mcdi *cdx, struct cdx_dword
> > *outbuf, int len) {
> > struct cdx_mcdi_iface *mcdi;
> > @@ -590,6 +617,7 @@ void cdx_mcdi_process_cmd(struct cdx_mcdi *cdx,
> > struct cdx_dword *outbuf, int le
> >
> > cdx_mcdi_process_cleanup_list(mcdi->cdx, &cleanup_list); }
> > +EXPORT_SYMBOL_GPL(cdx_mcdi_process_cmd);
> >
> > static void cdx_mcdi_cmd_work(struct work_struct *context) { @@
> > -757,6 +785,7 @@ int cdx_mcdi_rpc(struct cdx_mcdi *cdx, unsigned int cmd,
> > return cdx_mcdi_rpc_sync(cdx, cmd, inbuf, inlen, outbuf, outlen,
> > outlen_actual, false);
> > }
> > +EXPORT_SYMBOL_GPL(cdx_mcdi_rpc);
> >
> > /**
> > * cdx_mcdi_rpc_async - Schedule an MCDI command to run
> > asynchronously diff --git a/include/linux/cdx/mcdi.h
> > b/include/linux/cdx/mcdi.h index 46e3f63b062a..1344119e9a2c 100644
> > --- a/include/linux/cdx/mcdi.h
> > +++ b/include/linux/cdx/mcdi.h
> > @@ -169,6 +169,12 @@ struct cdx_mcdi_data {
> > u32 fn_flags;
> > };
> >
> > +int cdx_mcdi_init(struct cdx_mcdi *cdx); void
> > +cdx_mcdi_process_cmd(struct cdx_mcdi *cdx, struct cdx_dword *outbuf,
> > +int len); int cdx_mcdi_rpc(struct cdx_mcdi *cdx, unsigned int cmd,
> > + const struct cdx_dword *inbuf, size_t inlen,
> > + struct cdx_dword *outbuf, size_t outlen, size_t *outlen_actual);
> > +
> > /*
> > * We expect that 16- and 32-bit fields in MCDI requests and responses
> > * are appropriately aligned, but 64-bit fields are only
> > --
> > 2.34.1
> >
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH v7 2/5] cdx: Export Symbols for MCDI RPC and Initialization
2025-06-16 12:20 ` Datta, Shubhrajyoti
@ 2025-06-20 11:03 ` Datta, Shubhrajyoti
2025-07-02 10:27 ` Borislav Petkov
0 siblings, 1 reply; 14+ messages in thread
From: Datta, Shubhrajyoti @ 2025-06-20 11:03 UTC (permalink / raw)
To: Ghannam, Yazen
Cc: linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
linux-edac@vger.kernel.org, git (AMD-Xilinx), Krzysztof Kozlowski,
Rob Herring, Conor Dooley, Borislav Petkov, Tony Luck,
James Morse, Mauro Carvalho Chehab, Robert Richter, Gupta, Nipun,
Agarwal, Nikhil
[AMD Official Use Only - AMD Internal Distribution Only]
> -----Original Message-----
> From: Datta, Shubhrajyoti
> Sent: Monday, June 16, 2025 5:51 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-kernel@vger.kernel.org; devicetree@vger.kernel.org; linux-
> edac@vger.kernel.org; git (AMD-Xilinx) <git@amd.com>; Krzysztof Kozlowski
> <krzk@kernel.org>; Rob Herring <robh@kernel.org>; Conor Dooley
> <conor+dt@kernel.org>; Borislav Petkov <bp@alien8.de>; Tony Luck
> <tony.luck@intel.com>; James Morse <james.morse@arm.com>; Mauro
> Carvalho Chehab <mchehab@kernel.org>; Robert Richter <rric@kernel.org>;
> Gupta, Nipun <Nipun.Gupta@amd.com>; Agarwal, Nikhil
> <nikhil.agarwal@amd.com>
> Subject: RE: [PATCH v7 2/5] cdx: Export Symbols for MCDI RPC and Initialization
>
>
>
> > -----Original Message-----
> > From: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> > Sent: Saturday, June 14, 2025 1:41 AM
> > To: Datta, Shubhrajyoti <shubhrajyoti.datta@amd.com>
> > Cc: linux-kernel@vger.kernel.org; devicetree@vger.kernel.org; linux-
> > edac@vger.kernel.org; git (AMD-Xilinx) <git@amd.com>; Krzysztof
> > Kozlowski <krzk@kernel.org>; Rob Herring <robh@kernel.org>; Conor
> > Dooley <conor+dt@kernel.org>; Borislav Petkov <bp@alien8.de>; Tony
> > Luck <tony.luck@intel.com>; James Morse <james.morse@arm.com>; Mauro
> > Carvalho Chehab <mchehab@kernel.org>; Robert Richter
> > <rric@kernel.org>; Gupta, Nipun <Nipun.Gupta@amd.com>; Agarwal, Nikhil
> > <nikhil.agarwal@amd.com>
> > Subject: Re: [PATCH v7 2/5] cdx: Export Symbols for MCDI RPC and
> > Initialization
> >
> > On Thu, May 29, 2025 at 12:30:14PM +0530, Shubhrajyoti Datta wrote:
> > > The cdx_mcdi_init, cdx_mcdi_process_cmd, and cdx_mcdi_rpc functions
> > > are needed by VersalNET EDAC modules that interact with the MCDI
> > > (Management Controller Direct Interface) framework. These functions
> > > facilitate communication between different hardware components by
> > > enabling command execution and status management.
> > >
> > > Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
> > > ---
> > >
> > > Changes in v7:
> > > - Add the kernel doc description
> > > - Add the prototype from first patch to here
> > >
> > > Changes in v6:
> > > - Update commit description
> > >
> > > Changes in v2:
> > > - Export the symbols for module compilation
> > >
> > > drivers/cdx/controller/mcdi.c | 29 +++++++++++++++++++++++++++++
> > > include/linux/cdx/mcdi.h | 6 ++++++
> >
> > You've added the function prototypes to this new global header.
> >
> > But you didn't remove them from the local header.
> > drivers/cdx/controller/mcdi.h
> >
> > Also, you haven't included the new global header in the cdx/controller code.
> >
> > Even though this does compile, it doesn't seem proper.
> >
> > I expect you would want to do the following:
> >
> > 1) Add the common code to the global header.
> > 2) Remove the common code from the local header.
> > 3) Include the global header everywhere the common code is needed.
> >
> > Keeping the diff below for reference.
>
I plan to have the current (in the cdx) mcdi.h to mcdid.h and move the common parts to
Include/linux/cdx/mcdi.h
And for the bitfield.h will move to include/linux/cdx/bitfield.h as most of it is needed
by others
I plan to rename the existing mcdi.h under cdx/controller/ to mcdid.h.
The common prototypes and definitions will be moved to the new global header at
include/linux/cdx/mcdi.h.
I am using most of (close to 90/95 percent) the bitfield.h, I'll move it to include/linux/cdx/bitfield.h to
make it accessible for other subsystems.
Let me know if that sounds good.
>
> >
> > Thanks,
> > Yazen
> >
> > > 2 files changed, 35 insertions(+)
> > >
> > > diff --git a/drivers/cdx/controller/mcdi.c
> > > b/drivers/cdx/controller/mcdi.c index e760f8d347cc..f3cca4c884ff
> > > 100644
> > > --- a/drivers/cdx/controller/mcdi.c
> > > +++ b/drivers/cdx/controller/mcdi.c
> > > @@ -99,6 +99,19 @@ static unsigned long cdx_mcdi_rpc_timeout(struct
> > cdx_mcdi *cdx, unsigned int cmd
> > > return cdx->mcdi_ops->mcdi_rpc_timeout(cdx, cmd); }
> > >
> > > +/**
> > > + * cdx_mcdi_init - Initialize MCDI (Management Controller Driver
> > > +Interface) state
> > > + * @cdx: NIC through which to issue the command
> > > + *
> > > + * This function allocates and initializes internal MCDI structures
> > > +and resources
> > > + * for the CDX device, including the workqueue, locking primitives,
> > > +and command
> > > + * tracking mechanisms. It sets the initial operating mode and
> > > +prepares the device
> > > + * for MCDI operations.
> > > + *
> > > + * Return:
> > > + * * 0 - on success
> > > + * * -ENOMEM - if memory allocation or workqueue creation fails
> > > +*/
> > > int cdx_mcdi_init(struct cdx_mcdi *cdx) {
> > > struct cdx_mcdi_iface *mcdi;
> > > @@ -128,6 +141,7 @@ int cdx_mcdi_init(struct cdx_mcdi *cdx)
> > > fail:
> > > return rc;
> > > }
> > > +EXPORT_SYMBOL_GPL(cdx_mcdi_init);
> > >
> > > void cdx_mcdi_finish(struct cdx_mcdi *cdx) { @@ -553,6 +567,19 @@
> > > static void cdx_mcdi_start_or_queue(struct cdx_mcdi_iface *mcdi,
> > > cdx_mcdi_cmd_start_or_queue(mcdi, cmd); }
> > >
> > > +/**
> > > + * cdx_mcdi_process_cmd - Process an incoming MCDI response
> > > + * @cdx: NIC through which to issue the command
> > > + * @outbuf: Pointer to the response buffer received from the
> > > +management
> > controller
> > > + * @len: Length of the response buffer in bytes
> > > + *
> > > + * This function handles a response from the management controller.
> > > +It locates the
> > > + * corresponding command using the sequence number embedded in the
> > > +header,
> > > + * completes the command if it is still pending, and initiates any
> > > +necessary
> > cleanup.
> > > + *
> > > + * The function assumes that the response buffer is well-formed and
> > > +at least one
> > > + * dword in size.
> > > + */
> > > void cdx_mcdi_process_cmd(struct cdx_mcdi *cdx, struct cdx_dword
> > > *outbuf, int len) {
> > > struct cdx_mcdi_iface *mcdi;
> > > @@ -590,6 +617,7 @@ void cdx_mcdi_process_cmd(struct cdx_mcdi *cdx,
> > > struct cdx_dword *outbuf, int le
> > >
> > > cdx_mcdi_process_cleanup_list(mcdi->cdx, &cleanup_list); }
> > > +EXPORT_SYMBOL_GPL(cdx_mcdi_process_cmd);
> > >
> > > static void cdx_mcdi_cmd_work(struct work_struct *context) { @@
> > > -757,6 +785,7 @@ int cdx_mcdi_rpc(struct cdx_mcdi *cdx, unsigned int cmd,
> > > return cdx_mcdi_rpc_sync(cdx, cmd, inbuf, inlen, outbuf, outlen,
> > > outlen_actual, false);
> > > }
> > > +EXPORT_SYMBOL_GPL(cdx_mcdi_rpc);
> > >
> > > /**
> > > * cdx_mcdi_rpc_async - Schedule an MCDI command to run
> > > asynchronously diff --git a/include/linux/cdx/mcdi.h
> > > b/include/linux/cdx/mcdi.h index 46e3f63b062a..1344119e9a2c 100644
> > > --- a/include/linux/cdx/mcdi.h
> > > +++ b/include/linux/cdx/mcdi.h
> > > @@ -169,6 +169,12 @@ struct cdx_mcdi_data {
> > > u32 fn_flags;
> > > };
> > >
> > > +int cdx_mcdi_init(struct cdx_mcdi *cdx); void
> > > +cdx_mcdi_process_cmd(struct cdx_mcdi *cdx, struct cdx_dword
> > > +*outbuf, int len); int cdx_mcdi_rpc(struct cdx_mcdi *cdx, unsigned int cmd,
> > > + const struct cdx_dword *inbuf, size_t inlen,
> > > + struct cdx_dword *outbuf, size_t outlen, size_t *outlen_actual);
> > > +
> > > /*
> > > * We expect that 16- and 32-bit fields in MCDI requests and responses
> > > * are appropriately aligned, but 64-bit fields are only
> > > --
> > > 2.34.1
> > >
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v7 2/5] cdx: Export Symbols for MCDI RPC and Initialization
2025-06-20 11:03 ` Datta, Shubhrajyoti
@ 2025-07-02 10:27 ` Borislav Petkov
0 siblings, 0 replies; 14+ messages in thread
From: Borislav Petkov @ 2025-07-02 10:27 UTC (permalink / raw)
To: Datta, Shubhrajyoti, Gupta, Nipun, Agarwal, Nikhil
Cc: Ghannam, Yazen, linux-kernel@vger.kernel.org,
devicetree@vger.kernel.org, linux-edac@vger.kernel.org,
git (AMD-Xilinx), Krzysztof Kozlowski, Rob Herring, Conor Dooley,
Tony Luck, James Morse, Mauro Carvalho Chehab, Robert Richter
On Fri, Jun 20, 2025 at 11:03:02AM +0000, Datta, Shubhrajyoti wrote:
> I plan to have the current (in the cdx) mcdi.h to mcdid.h and move the common parts to
> Include/linux/cdx/mcdi.h
>
> And for the bitfield.h will move to include/linux/cdx/bitfield.h as most of it is needed
> by others
>
>
> I plan to rename the existing mcdi.h under cdx/controller/ to mcdid.h.
> The common prototypes and definitions will be moved to the new global header at
> include/linux/cdx/mcdi.h.
>
> I am using most of (close to 90/95 percent) the bitfield.h, I'll move it to include/linux/cdx/bitfield.h to
> make it accessible for other subsystems.
>
> Let me know if that sounds good.
Well, I'd expect the maintainers of that thing to move:
./scripts/get_maintainer.pl -f drivers/cdx/
Nipun Gupta <nipun.gupta@amd.com> (maintainer:AMD CDX BUS DRIVER)
Nikhil Agarwal <nikhil.agarwal@amd.com> (maintainer:AMD CDX BUS DRIVER)
and maintain.
Perhaps ping them on internal chat to catch their attention.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v7 5/5] EDAC/VersalNET: Add support for error notification
2025-05-29 7:00 ` [PATCH v7 5/5] EDAC/VersalNET: Add support for error notification Shubhrajyoti Datta
@ 2025-07-03 17:31 ` Borislav Petkov
2025-07-21 4:08 ` Datta, Shubhrajyoti
0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2025-07-03 17:31 UTC (permalink / raw)
To: Shubhrajyoti Datta
Cc: linux-kernel, devicetree, linux-edac, git, Krzysztof Kozlowski,
Rob Herring, Conor Dooley, Tony Luck, James Morse,
Mauro Carvalho Chehab, Robert Richter, Nipun Gupta,
Nikhil Agarwal
On Thu, May 29, 2025 at 12:30:17PM +0530, Shubhrajyoti Datta wrote:
> Hardware errors can corrupt the RAM or other system components, without
> detection it can lead to crashes and system failure, this driver helps in
> reporting of errors to user space, triggering corrective actions.
Kinda useless paragraph.
> drivers/edac/Kconfig | 11 +
> drivers/edac/Makefile | 1 +
> drivers/edac/versalnet_edac.c | 1108 +++++++++++++++++++++++++++++
> include/linux/cdx/edac_cdx_pcol.h | 28 +
I'd need a MAINTAINERS entry for this driver so that you can get CCed on
fixes. And I'd need that email to be responsive and not stuff to disappear
into the void. If it does and no one cares for it, I don't have a problem with
removing this driver again from the tree.
> +/* Granularity of reported error in bytes */
> +#define DDRMC5_EDAC_ERR_GRAIN 1
> +#define MC_CMD_EDAC_GET_DDR_CONFIG_IN_LEN 4
> +
> +#define DDRMC5_EDAC_MSG_SIZE 256
> +
> +#define DDRMC5_IRQ_CE_MASK GENMASK(18, 15)
> +#define DDRMC5_IRQ_UE_MASK GENMASK(14, 11)
> +
> +#define DDRMC5_RANK_1_MASK GENMASK(11, 6)
> +#define MASK_24 GENMASK(29, 24)
> +#define MASK_0 GENMASK(5, 0)
> +
> +#define DDRMC5_LRANK_1_MASK GENMASK(11, 6)
> +#define DDRMC5_LRANK_2_MASK GENMASK(17, 12)
> +#define DDRMC5_BANK1_MASK GENMASK(11, 6)
> +#define DDRMC5_GRP_0_MASK GENMASK(17, 12)
> +#define DDRMC5_GRP_1_MASK GENMASK(23, 18)
> +
> +#define ECCR_UE_CE_ADDR_HI_ROW_MASK GENMASK(10, 0)
> +
> +#define DDRMC5_MAX_ROW_CNT 18
> +#define DDRMC5_MAX_COL_CNT 11
> +#define DDRMC5_MAX_RANK_CNT 2
> +#define DDRMC5_MAX_LRANK_CNT 4
> +#define DDRMC5_MAX_BANK_CNT 2
> +#define DDRMC5_MAX_GRP_CNT 3
> +
> +#define DDRMC5_REGHI_ROW 7
> +#define DDRMC5_EACHBIT 1
> +#define DDRMC5_ERR_TYPE_CE 0
> +#define DDRMC5_ERR_TYPE_UE 1
> +#define DDRMC5_HIGH_MEM_EN BIT(20)
> +#define DDRMC5_MEM_MASK GENMASK(19, 0)
> +#define DDRMC5_X16_BASE 256
> +#define DDRMC5_X16_ECC 32
> +#define DDRMC5_X16_SIZE (DDRMC5_X16_BASE + DDRMC5_X16_ECC)
> +#define DDRMC5_X32_SIZE 576
> +#define DDRMC5_HIMEM_BASE (256 * SZ_1M)
> +#define DDRMC5_ILC_HIMEM_EN BIT(28)
> +#define DDRMC5_ILC_MEM GENMASK(27, 0)
> +#define DDRMC5_INTERLEAVE_SEL GENMASK(3, 0)
> +#define DDRMC5_BUS_WIDTH_MASK GENMASK(19, 18)
> +#define DDRMC5_NUM_CHANS_MASK BIT(17)
> +#define DDRMC5_RANK_MASK GENMASK(15, 14)
> +#define DDRMC5_DWIDTH_MASK GENMASK(5, 4)
> +
> +#define AMD_MIN_BUF_LEN 0x28
> +#define AMD_ERROR_LEVEL 2
> +#define AMD_ERRORID 3
> +#define TOTAL_ERR_LENGTH 5
> +#define AMD_MSG_ERR_OFFSET 8
> +#define AMD_MSG_ERR_LENGTH 9
> +#define AMD_ERR_DATA 10
> +#define MCDI_RESPONSE 0xFF
> +
> +#define ERR_NOTIFICATION_MAX 96
> +#define REG_MAX 152
> +#define ADEC_MAX 152
> +#define NUM_CONTROLLERS 8
> +#define REGS_PER_CONTROLLER 19
> +#define ADEC_NUM 19
> +#define MC_CMD_EDAC_GET_OVERALL_DDR_CONFIG 2
> +#define BUFFER_SZ 80
> +
> +#define XDDR5_BUS_WIDTH_64 0
> +#define XDDR5_BUS_WIDTH_32 1
> +#define XDDR5_BUS_WIDTH_16 2
For all those defines above:
- remove the unused ones
- remove "EDAC" from the name
- shorten them
which will allow you to have shorter lines in general and vertical alignment
will be closer to the left.
> +
> +#define AMD_ERR "[VERSAL_EDAC_ERR_ID: %d] Error type:"
This is not now this is done - see the edac_*_printk() macros in
drivers/edac/edac_mc.h. You could use them, extend them and so on. No need for
homegrown hackery.
> +/**
> + * struct ecc_error_info - ECC error log information.
> + * @burstpos: Burst position.
> + * @lrank: Logical Rank number.
> + * @rank: Rank number.
> + * @group: Group number.
> + * @bank: Bank number.
> + * @col: Column number.
> + * @row: Row number.
> + * @rowhi: Row number higher bits.
> + * @i: ECC error info.
That's the combined vector of all the above bits - "ECC error info" is too
generic.
> + */
> +union ecc_error_info {
> + struct {
> + u32 burstpos:3;
> + u32 lrank:4;
> + u32 rank:2;
> + u32 group:3;
> + u32 bank:2;
> + u32 col:11;
> + u32 row:7;
> + u32 rowhi;
> + };
> + u64 i;
> +} __packed;
> +
> +/**
> + * struct row_col_mapping - Row and column bit positions in ADEC(address decoder) registers.
> + * @row0: Row0 bit position.
> + * @row1: Row1 bit position.
> + * @row2: Row2 bit position.
> + * @row3: Row3 bit position.
> + * @row4: Row4 bit position.
> + * @reserved: Unused bits.
> + * @col1: Column 1 bit position.
> + * @col2: Column 2 bit position.
> + * @col3: Column 3 bit position.
> + * @col4: Column 4 bit position.
> + * @col5: Column 5 bit position.
> + * @reservedcol: Unused column bits.
> + * @i: ADEC register info.
> + */
Simply:
/*
* Row and column bit positions in ADEC (address decoder) registers.
*/
Please think what would be the most optimal comment content for readers: a lot
of repetitive und useless gunk or proper variable naming and comment *helping*
to explain what the struct is.
...
> +static bool get_ddr_info(u32 *error_data, struct mc_priv *priv)
> +{
> + u32 reglo, reghi, parity, eccr0_val, eccr1_val, isr;
> + struct ecc_status *p;
> +
> + p = &priv->stat;
> +
> + isr = error_data[ISR];
> +
> + if (!(isr & (DDRMC5_IRQ_UE_MASK | DDRMC5_IRQ_CE_MASK)))
> + return false;
> +
> + eccr0_val = error_data[ECCR0_ERR_STATUS];
> + eccr1_val = error_data[ECCR1_ERR_STATUS];
> +
> + if (!eccr0_val && !eccr1_val)
> + return false;
> +
> + if (!eccr0_val)
> + p->channel = 1;
> + else
> + p->channel = 0;
> +
> + reglo = error_data[ECCR0_ADDR_LO];
> + reghi = error_data[ECCR0_ADDR_HI];
> + if ((isr & DDRMC5_IRQ_CE_MASK))
What are those double-brackets for?
> + p->ceinfo[0].i = reglo | (u64)reghi << 32;
> + else if ((isr & DDRMC5_IRQ_UE_MASK))
Ditto. Audit your whole driver pls.
> + p->ueinfo[0].i = reglo | (u64)reghi << 32;
> +
> + parity = error_data[ECCR0_PAR];
> + edac_dbg(2, "ERR DATA: 0x%08X%08X ERR DATA PARITY: 0x%08X\n",
So you can shorten the second one only to "PARITY"?
> + reghi, reglo, parity);
> +
> + reglo = error_data[ECCR1_ADDR_LO];
> + reghi = error_data[ECCR1_ADDR_HI];
> + if ((isr & DDRMC5_IRQ_CE_MASK))
> + p->ceinfo[1].i = reglo | (u64)reghi << 32;
> + else if ((isr & DDRMC5_IRQ_UE_MASK))
> + p->ueinfo[1].i = reglo | (u64)reghi << 32;
> +
> + parity = error_data[ECCR1_PAR];
> + edac_dbg(2, "ERR DATA: 0x%08X%08X ERR DATA PARITY: 0x%08X\n",
Ditto.
> + reghi, reglo, parity);
> +
> + return true;
> +}
...
> +/**
> + * handle_error - Handle Correctable and Uncorrectable errors.
> + * @priv: DDR memory controller private instance data.
> + * @stat: ECC status structure.
> + * @controller: Controller number of the DDRMC5
> + * @error_data: the DDRMC5 ADEC address decoder register data
> + *
> + * Handles ECC correctable and uncorrectable errors.
> + */
> +static void handle_error(struct mc_priv *priv, struct ecc_status *stat,
> + int controller, int *error_data)
> +{
> + struct mem_ctl_info *mci = priv->mci;
> + union ecc_error_info pinf;
> + unsigned long pa;
> + phys_addr_t pfn;
> + int err;
> +
> + if (stat->error_type == DDRMC5_ERR_TYPE_CE) {
> + pinf = stat->ceinfo[stat->channel];
> + snprintf(priv->message, DDRMC5_EDAC_MSG_SIZE,
> + "Error type:%s Controller %d Addr at %lx\n",
> + "CE", controller, convert_to_physical(priv, pinf, controller, error_data));
> +
> + edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
> + 1, 0, 0, 0, 0, 0, -1,
> + priv->message, "");
> + }
> +
> + if (stat->error_type == DDRMC5_ERR_TYPE_UE) {
> + pinf = stat->ueinfo[stat->channel];
> + snprintf(priv->message, DDRMC5_EDAC_MSG_SIZE,
> + "Error type:%s controller %d Addr at %lx\n",
> + "UE", controller, convert_to_physical(priv, pinf, controller, error_data));
> +
> + edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
> + 1, 0, 0, 0, 0, 0, -1,
> + priv->message, "");
> + pa = convert_to_physical(priv, pinf, controller, error_data);
> + pfn = PHYS_PFN(pa);
> +
> + if (IS_ENABLED(CONFIG_MEMORY_FAILURE)) {
> + err = memory_failure(pfn, MF_ACTION_REQUIRED);
> + if (err)
> + edac_dbg(2, "In fail of memory_failure %d\n", err);
"memory_failure() error: %d"
> + else
> + edac_dbg(2, "Page at PA 0x%lx is hardware poisoned\n", pa);
> + }
> + }
> +}
> +
> +/**
> + * init_csrows - Initialize the csrow data.
> + * @mci: EDAC memory controller instance.
> + *
> + * Initialize the chip select rows associated with the EDAC memory
> + * controller instance.
> + */
> +static void init_csrows(struct mem_ctl_info *mci)
Merge it into its only caller.
> +{
> + struct mc_priv *priv = mci->pvt_info;
> + struct csrow_info *csi;
> + struct dimm_info *dimm;
> + u32 row;
> + int ch;
> +
> + for (row = 0; row < mci->nr_csrows; row++) {
> + csi = mci->csrows[row];
> + for (ch = 0; ch < csi->nr_channels; ch++) {
> + dimm = csi->channels[ch]->dimm;
> + dimm->edac_mode = EDAC_SECDED;
> + dimm->mtype = MEM_DDR5;
> + dimm->grain = DDRMC5_EDAC_ERR_GRAIN;
> + dimm->dtype = priv->dwidth;
> + }
> + }
> +}
> +
> +static void mc_init(struct mem_ctl_info *mci, struct platform_device *pdev)
> +{
> + mci->pdev = &pdev->dev;
> + platform_set_drvdata(pdev, mci);
> +
> + /* Initialize controller capabilities and configuration */
> + mci->mtype_cap = MEM_FLAG_DDR5;
> + mci->edac_ctl_cap = EDAC_FLAG_NONE | EDAC_FLAG_SECDED;
> + mci->scrub_cap = SCRUB_HW_SRC;
> + mci->scrub_mode = SCRUB_NONE;
> +
> + mci->edac_cap = EDAC_FLAG_SECDED;
> + mci->ctl_name = "amd_ddr_controller";
"VersalNET DDR5 controller"
or something more specific.
> + mci->dev_name = dev_name(&pdev->dev);
> + mci->mod_name = "versalnet_edac";
> +
> + edac_op_state = EDAC_OPSTATE_INT;
> +
> + init_csrows(mci);
> +}
> +
> +#define to_mci(k) container_of(k, struct mem_ctl_info, dev)
> +
> +static unsigned int amd_mcdi_rpc_timeout(struct cdx_mcdi *cdx, unsigned int cmd)
> +{
> + return MCDI_RPC_TIMEOUT;
> +}
> +
> +static void amd_mcdi_request(struct cdx_mcdi *cdx,
No need for function prefixes of static functions. Audit your whole driver
pls.
> + const struct cdx_dword *hdr, size_t hdr_len,
> + const struct cdx_dword *sdu, size_t sdu_len)
> +{
> + unsigned char *send_buf;
void *send_buf;
> + int ret;
> +
> + send_buf = kzalloc(hdr_len + sdu_len, GFP_KERNEL);
> + if (!send_buf)
> + return;
> +
> + memcpy(send_buf, hdr, hdr_len);
> + memcpy(send_buf + hdr_len, sdu, sdu_len);
> +
> + ret = rpmsg_send(cdx->ept, send_buf, hdr_len + sdu_len);
> + if (ret)
> + dev_err(&cdx->rpdev->dev, "Failed to send rpmsg data\n");
<---- newline here.
> + kfree(send_buf);
> +}
> +
> +static const struct cdx_mcdi_ops mcdi_ops = {
> + .mcdi_rpc_timeout = amd_mcdi_rpc_timeout,
> + .mcdi_request = amd_mcdi_request,
> +};
> +
> +static void get_ddr_config(u32 index, u32 *buffer, struct cdx_mcdi *amd_mcdi)
> +{
> + size_t outlen;
> + int ret;
> +
> + MCDI_DECLARE_BUF(inbuf, MC_CMD_EDAC_GET_DDR_CONFIG_IN_LEN);
> + MCDI_DECLARE_BUF(outbuf, BUFFER_SZ);
> +
> + MCDI_SET_DWORD(inbuf, EDAC_GET_DDR_CONFIG_IN_CONTROLLER_INDEX, index);
> +
> + ret = cdx_mcdi_rpc(amd_mcdi, MC_CMD_EDAC_GET_DDR_CONFIG, inbuf, sizeof(inbuf),
> + outbuf, sizeof(outbuf), &outlen);
> + if (!ret)
> + memcpy(buffer, MCDI_PTR(outbuf, EDAC_GET_DDR_CONFIG_OUT_REGISTER_VALUES),
Those EDAC_GET* bla names are too long. Shorten pls.
> + (ADEC_NUM * 4));
> +}
> +
> +static void amd_setup_mcdi(struct mc_priv *mc_priv)
> +{
> + struct cdx_mcdi *amd_mcdi;
> + int ret, i;
> +
> + amd_mcdi = kzalloc(sizeof(*amd_mcdi), GFP_KERNEL);
> + if (!amd_mcdi)
> + return;
> +
> + amd_mcdi->mcdi_ops = &mcdi_ops;
> + ret = cdx_mcdi_init(amd_mcdi);
> + if (ret) {
> + kfree(amd_mcdi);
> + return;
> + }
> +
> + amd_mcdi->ept = mc_priv->ept;
> + mc_priv->mcdi = amd_mcdi;
Where does that amd_mcdi get freed?
> +
> + for (i = 0; i < NUM_CONTROLLERS; i++)
> + get_ddr_config(i, &mc_priv->adec[ADEC_NUM * i], amd_mcdi);
> +}
> +
> +static const guid_t amd_versalnet_guid = GUID_INIT(0x82678888, 0xa556, 0x44f2,
> + 0xb8, 0xb4, 0x45, 0x56, 0x2e,
> + 0x8c, 0x5b, 0xec);
> +
> +static int amd_rpmsg_cb(struct rpmsg_device *rpdev, void *data,
> + int len, void *priv, u32 src)
> +{
> + struct mc_priv *mc_priv = dev_get_drvdata(&rpdev->dev);
> + const guid_t *sec_type = &guid_null;
> + u32 length, offset, error_id;
> + u32 *result = (u32 *)data;
> + struct ecc_status *p;
> + int i, j, k, sec_sev;
> + u32 *adec_data;
> +
> + if (*(u8 *)data == MCDI_RESPONSE) {
> + cdx_mcdi_process_cmd(mc_priv->mcdi, (struct cdx_dword *)data, len);
> + return 0;
> + }
> +
> + sec_sev = result[AMD_ERROR_LEVEL];
> + error_id = result[AMD_ERRORID];
> + length = result[AMD_MSG_ERR_LENGTH];
> + offset = result[AMD_MSG_ERR_OFFSET];
> +
> + if (result[TOTAL_ERR_LENGTH] > length) {
> + if (!mc_priv->part_len)
> + mc_priv->part_len = length;
> + else
> + mc_priv->part_len += length;
> + /*
> + * The data can come in 2 stretches. Construct the regs from 2
> + * messages the offset indicates the offset from which the data is to
> + * be taken
> + */
> + for (i = 0 ; i < length; i++) {
> + k = offset + i;
> + j = AMD_ERR_DATA + i;
> + mc_priv->regs[k] = result[j];
> + }
> + if (mc_priv->part_len < result[TOTAL_ERR_LENGTH])
> + return 0;
> + mc_priv->part_len = 0;
> + }
> +
> + mc_priv->error_id = error_id;
> + mc_priv->error_level = result[AMD_ERROR_LEVEL];
> +
> + switch (error_id) {
> + /* GSW Non-Correctable error */
No need for those comments - the strings are already there.
> + case 5:
> + snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
> + AMD_ERR "General Software Non-Correctable error", error_id);
> + break;
> + /* CFU error */
> + case 6:
> + snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
> + AMD_ERR "CFU error", error_id);
> + break;
> + /* CFRAME error */
> + case 7:
> + snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
> + AMD_ERR "CFRAME error", error_id);
> + break;
> + /* Microblaze correctable error */
> + case 10:
> + snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
> + AMD_ERR "DDRMC Microblaze Correctable ECC error", error_id);
> + break;
...
> +static int amd_rpmsg_probe(struct rpmsg_device *rpdev)
> +{
> + struct rpmsg_channel_info chinfo = {0};
^^^^^^
Why clear it if you're initializing all its members below?
> + struct mc_priv *pg;
> +
> + pg = (struct mc_priv *)amd_rpmsg_id_table[0].driver_data;
> + chinfo.src = RPMSG_ADDR_ANY;
> + chinfo.dst = rpdev->dst;
> + strscpy(chinfo.name, amd_rpmsg_id_table[0].name,
> + strlen(amd_rpmsg_id_table[0].name));
> +
> + pg->ept = rpmsg_create_ept(rpdev, amd_rpmsg_cb, NULL, chinfo);
> + if (!pg->ept)
> + return dev_err_probe(&rpdev->dev, -ENXIO,
> + "Failed to create ept for channel %s\n",
> + chinfo.name);
> +
> + dev_set_drvdata(&rpdev->dev, pg);
> + return 0;
> +}
> +
> +static void amd_rpmsg_remove(struct rpmsg_device *rpdev)
> +{
> + struct mc_priv *mc_priv = dev_get_drvdata(&rpdev->dev);
> +
> + rpmsg_destroy_ept(mc_priv->ept);
> + dev_set_drvdata(&rpdev->dev, NULL);
> +}
> +
> +static struct rpmsg_driver amd_rpmsg_driver = {
> + .drv.name = KBUILD_MODNAME,
> + .probe = amd_rpmsg_probe,
> + .remove = amd_rpmsg_remove,
> + .callback = amd_rpmsg_cb,
> + .id_table = amd_rpmsg_id_table,
> +};
> +
> +/**
> + * get_dwidth - Return the controller memory width.
> + * @width: data width read from the config reg.
> + *
> + * Get the EDAC device type width appropriate for the controller
> + * configuration.
> + *
> + * Return: a device type width enumeration.
> + */
> +static enum dev_type get_dwidth(u32 width)
Merge it into its single caller.
> +{
> + enum dev_type dt;
> +
> + switch (width) {
> + case XDDR5_BUS_WIDTH_16:
> + dt = DEV_X16;
> + break;
> + case XDDR5_BUS_WIDTH_32:
> + dt = DEV_X32;
> + break;
> + case XDDR5_BUS_WIDTH_64:
> + dt = DEV_X64;
> + break;
> + default:
> + dt = DEV_UNKNOWN;
> + }
> +
> + return dt;
> +}
> +
> +static int mc_probe(struct platform_device *pdev)
> +{
> + u32 num_chans, rank, dwidth, config;
> + struct device_node *r5_core_node;
> + struct edac_mc_layer layers[2];
> + struct mem_ctl_info *mci;
> + struct mc_priv *priv;
> + struct rproc *rp;
> + enum dev_type dt;
> + int rc, i;
> +
> + r5_core_node = of_parse_phandle(pdev->dev.of_node, "amd,rproc", 0);
> + if (!r5_core_node) {
> + dev_err(&pdev->dev, "amd,rproc: invalid phandle\n");
> + return -EINVAL;
> + }
> +
> + rp = rproc_get_by_phandle(r5_core_node->phandle);
> + if (!rp)
> + return -EPROBE_DEFER;
> +
> + rc = rproc_boot(rp);
> + if (rc) {
> + dev_err(&pdev->dev, "Failed to attach to remote processor\n");
> + rproc_put(rp);
> + return rc;
> + }
> +
> + priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
> + amd_rpmsg_id_table[0].driver_data = (kernel_ulong_t)priv;
> + rc = register_rpmsg_driver(&amd_rpmsg_driver);
> + if (rc) {
> + edac_printk(KERN_ERR, EDAC_MC,
> + "Failed to register RPMsg driver: %d\n", rc);
> + goto free_rproc;
> + }
> +
> + amd_setup_mcdi(priv);
This function can fail. Why isn't it returning success/failure and why aren't
you checking it here?
> + for (i = 0; i < NUM_CONTROLLERS; i++) {
> + config = priv->adec[CONF + i * ADEC_NUM];
> + num_chans = FIELD_GET(DDRMC5_NUM_CHANS_MASK, config);
> + rank = FIELD_GET(DDRMC5_RANK_MASK, config);
> + rank = 1 << rank;
merge the two:
rank = 1 << FIELD_GET...
> + dwidth = FIELD_GET(DDRMC5_BUS_WIDTH_MASK, config);
> + dt = get_dwidth(dwidth);
> +
> + /* Find the first enabled device and register that one. */
> + if (dt != DEV_UNKNOWN) {
Save an indentation level:
if (dt == DEV_UNKNOWN)
continue;
layers...
> + layers[0].type = EDAC_MC_LAYER_CHIP_SELECT;
> + layers[0].size = rank;
> + layers[0].is_virt_csrow = true;
> + layers[1].type = EDAC_MC_LAYER_CHANNEL;
> + layers[1].size = num_chans;
> + layers[1].is_virt_csrow = false;
> +
> + mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers,
> + sizeof(struct mc_priv));
> + if (!mci) {
> + edac_printk(KERN_ERR, EDAC_MC,
> + "Failed memory allocation for mc instance\n");
> + rc = -ENOMEM;
> + goto free_rpmsg;
> + }
> +
> + priv->mci = mci;
> + priv->dwidth = dt;
> + mc_init(mci, pdev);
> + rc = edac_mc_add_mc(mci);
> + if (rc) {
> + edac_printk(KERN_ERR, EDAC_MC,
> + "Failed to register with EDAC core\n");
> + goto free_edac_mc;
This is wrong as it frees only the current mci which the loop has allocated.
You need to loop back from i to 0, free all those previous mcis and unwind all
the setup work properly.
> + }
> + return 0;
> + }
> + }
> +
> + return 0;
> +
> +free_edac_mc:
> + edac_mc_free(mci);
> +free_rpmsg:
> + unregister_rpmsg_driver(&amd_rpmsg_driver);
> +free_rproc:
> + rproc_shutdown(rp);
> + return rc;
> +}
> +
> +static void mc_remove(struct platform_device *pdev)
> +{
> + struct mem_ctl_info *mci = platform_get_drvdata(pdev);
> + struct mc_priv *priv = mci->pvt_info;
> +
> + unregister_rpmsg_driver(&amd_rpmsg_driver);
> + edac_mc_del_mc(&pdev->dev);
> + edac_mc_free(mci);
> + rproc_shutdown(priv->mcdi->r5_rproc);
Why isn't this a loop over NUM_CONTROLLERS unwinding all the stuff mc_probe()
has done, in the reverse order?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH v7 5/5] EDAC/VersalNET: Add support for error notification
2025-07-03 17:31 ` Borislav Petkov
@ 2025-07-21 4:08 ` Datta, Shubhrajyoti
0 siblings, 0 replies; 14+ messages in thread
From: Datta, Shubhrajyoti @ 2025-07-21 4:08 UTC (permalink / raw)
To: Borislav Petkov
Cc: linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
linux-edac@vger.kernel.org, git (AMD-Xilinx), Krzysztof Kozlowski,
Rob Herring, Conor Dooley, Tony Luck, James Morse,
Mauro Carvalho Chehab, Robert Richter, Gupta, Nipun,
Agarwal, Nikhil
[AMD Official Use Only - AMD Internal Distribution Only]
> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Thursday, July 3, 2025 11:01 PM
> To: Datta, Shubhrajyoti <shubhrajyoti.datta@amd.com>
> Cc: linux-kernel@vger.kernel.org; devicetree@vger.kernel.org; linux-
> edac@vger.kernel.org; git (AMD-Xilinx) <git@amd.com>; Krzysztof Kozlowski
> <krzk@kernel.org>; Rob Herring <robh@kernel.org>; Conor Dooley
> <conor+dt@kernel.org>; Tony Luck <tony.luck@intel.com>; James Morse
> <james.morse@arm.com>; Mauro Carvalho Chehab <mchehab@kernel.org>;
> Robert Richter <rric@kernel.org>; Gupta, Nipun <Nipun.Gupta@amd.com>;
> Agarwal, Nikhil <nikhil.agarwal@amd.com>
> Subject: Re: [PATCH v7 5/5] EDAC/VersalNET: Add support for error notification
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> On Thu, May 29, 2025 at 12:30:17PM +0530, Shubhrajyoti Datta wrote:
> > Hardware errors can corrupt the RAM or other system components,
> > without detection it can lead to crashes and system failure, this
> > driver helps in reporting of errors to user space, triggering corrective actions.
>
> Kinda useless paragraph.
>
> > drivers/edac/Kconfig | 11 +
> > drivers/edac/Makefile | 1 +
> > drivers/edac/versalnet_edac.c | 1108 +++++++++++++++++++++++++++++
> > include/linux/cdx/edac_cdx_pcol.h | 28 +
>
> I'd need a MAINTAINERS entry for this driver so that you can get CCed on fixes.
> And I'd need that email to be responsive and not stuff to disappear into the void.
> If it does and no one cares for it, I don't have a problem with removing this driver
> again from the tree.
>
> > +/* Granularity of reported error in bytes */
> > +#define DDRMC5_EDAC_ERR_GRAIN 1
> > +#define MC_CMD_EDAC_GET_DDR_CONFIG_IN_LEN 4
> > +
> > +#define DDRMC5_EDAC_MSG_SIZE 256
> > +
> > +#define DDRMC5_IRQ_CE_MASK GENMASK(18, 15)
> > +#define DDRMC5_IRQ_UE_MASK GENMASK(14, 11)
> > +
> > +#define DDRMC5_RANK_1_MASK GENMASK(11, 6)
> > +#define MASK_24 GENMASK(29, 24)
> > +#define MASK_0 GENMASK(5, 0)
> > +
> > +#define DDRMC5_LRANK_1_MASK GENMASK(11, 6)
> > +#define DDRMC5_LRANK_2_MASK GENMASK(17, 12)
> > +#define DDRMC5_BANK1_MASK GENMASK(11, 6)
> > +#define DDRMC5_GRP_0_MASK GENMASK(17, 12)
> > +#define DDRMC5_GRP_1_MASK GENMASK(23, 18)
> > +
> > +#define ECCR_UE_CE_ADDR_HI_ROW_MASK GENMASK(10, 0)
> > +
> > +#define DDRMC5_MAX_ROW_CNT 18
> > +#define DDRMC5_MAX_COL_CNT 11
> > +#define DDRMC5_MAX_RANK_CNT 2
> > +#define DDRMC5_MAX_LRANK_CNT 4
> > +#define DDRMC5_MAX_BANK_CNT 2
> > +#define DDRMC5_MAX_GRP_CNT 3
> > +
> > +#define DDRMC5_REGHI_ROW 7
> > +#define DDRMC5_EACHBIT 1
> > +#define DDRMC5_ERR_TYPE_CE 0
> > +#define DDRMC5_ERR_TYPE_UE 1
> > +#define DDRMC5_HIGH_MEM_EN BIT(20)
> > +#define DDRMC5_MEM_MASK GENMASK(19, 0)
> > +#define DDRMC5_X16_BASE 256
> > +#define DDRMC5_X16_ECC 32
> > +#define DDRMC5_X16_SIZE (DDRMC5_X16_BASE +
> DDRMC5_X16_ECC)
> > +#define DDRMC5_X32_SIZE 576
> > +#define DDRMC5_HIMEM_BASE (256 * SZ_1M)
> > +#define DDRMC5_ILC_HIMEM_EN BIT(28)
> > +#define DDRMC5_ILC_MEM GENMASK(27, 0)
> > +#define DDRMC5_INTERLEAVE_SEL GENMASK(3, 0)
> > +#define DDRMC5_BUS_WIDTH_MASK GENMASK(19, 18)
> > +#define DDRMC5_NUM_CHANS_MASK BIT(17)
> > +#define DDRMC5_RANK_MASK GENMASK(15, 14)
> > +#define DDRMC5_DWIDTH_MASK GENMASK(5, 4)
> > +
> > +#define AMD_MIN_BUF_LEN 0x28
> > +#define AMD_ERROR_LEVEL 2
> > +#define AMD_ERRORID 3
> > +#define TOTAL_ERR_LENGTH 5
> > +#define AMD_MSG_ERR_OFFSET 8
> > +#define AMD_MSG_ERR_LENGTH 9
> > +#define AMD_ERR_DATA 10
> > +#define MCDI_RESPONSE 0xFF
> > +
> > +#define ERR_NOTIFICATION_MAX 96
> > +#define REG_MAX 152
> > +#define ADEC_MAX 152
> > +#define NUM_CONTROLLERS 8
> > +#define REGS_PER_CONTROLLER 19
> > +#define ADEC_NUM 19
> > +#define MC_CMD_EDAC_GET_OVERALL_DDR_CONFIG 2
> > +#define BUFFER_SZ 80
> > +
> > +#define XDDR5_BUS_WIDTH_64 0
> > +#define XDDR5_BUS_WIDTH_32 1
> > +#define XDDR5_BUS_WIDTH_16 2
>
> For all those defines above:
>
> - remove the unused ones
> - remove "EDAC" from the name
> - shorten them
>
> which will allow you to have shorter lines in general and vertical alignment will be
> closer to the left.
>
> > +
> > +#define AMD_ERR "[VERSAL_EDAC_ERR_ID: %d] Error type:"
>
> This is not now this is done - see the edac_*_printk() macros in
> drivers/edac/edac_mc.h. You could use them, extend them and so on. No need
> for homegrown hackery.
The AMD_ERR is used for the snprintf and then the string is passed to the log_non_standard_event.
Should I plan to add a edac_snprintf to the drivers/edac/edac_mc.h.
Or should I just merge AMD_ERR to snprintf.
> > + snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
> > + AMD_ERR "CFU error", error_id);
> > + break;
> > + /* CFRAME error */
> > + case 7:
> > + snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
> > + AMD_ERR "CFRAME error", error_id);
> > + break;
> > + /* Microblaze correctable error */
> > + case 10:
> > + snprintf(mc_priv->message, DDRMC5_EDAC_MSG_SIZE,
> > + AMD_ERR "DDRMC Microblaze Correctable ECC error", error_id);
> > + break;
>
...
;
> > + }
> > +
> > + priv->mci = mci;
> > + priv->dwidth = dt;
> > + mc_init(mci, pdev);
> > + rc = edac_mc_add_mc(mci);
> > + if (rc) {
> > + edac_printk(KERN_ERR, EDAC_MC,
> > + "Failed to register with EDAC core\n");
> > + goto free_edac_mc;
>
> This is wrong as it frees only the current mci which the loop has allocated.
> You need to loop back from i to 0, free all those previous mcis and unwind all the
> setup work properly.
>
We are registering the first valid EDAC controller and free that .
However I understand from your comment that I should be registering all the valid
EDAC controllers and free them.
Will fix in the next version.
> > + }
> > + return 0;
> > + }
> > + }
> > +
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-07-21 4:08 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-29 7:00 [PATCH v7 0/5] EDAC/Versal NET: Add support for error notification Shubhrajyoti Datta
2025-05-29 7:00 ` [PATCH v7 1/5] cdx: add the headers to include/linux Shubhrajyoti Datta
2025-06-13 19:59 ` Yazen Ghannam
2025-05-29 7:00 ` [PATCH v7 2/5] cdx: Export Symbols for MCDI RPC and Initialization Shubhrajyoti Datta
2025-06-13 20:10 ` Yazen Ghannam
2025-06-16 12:20 ` Datta, Shubhrajyoti
2025-06-20 11:03 ` Datta, Shubhrajyoti
2025-07-02 10:27 ` Borislav Petkov
2025-05-29 7:00 ` [PATCH v7 3/5] ras: Export log_non_standard_event for External Usage Shubhrajyoti Datta
2025-05-29 7:00 ` [PATCH v7 4/5] dt-bindings: memory-controllers: Add support for Versal NET EDAC Shubhrajyoti Datta
2025-05-29 7:00 ` [PATCH v7 5/5] EDAC/VersalNET: Add support for error notification Shubhrajyoti Datta
2025-07-03 17:31 ` Borislav Petkov
2025-07-21 4:08 ` Datta, Shubhrajyoti
2025-05-30 14:12 ` [PATCH v7 0/5] EDAC/Versal NET: " Michal Simek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox