linux-cxl.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] cxl/events: Update to rev 3.2, improvements and add trace memory sparing event record
@ 2025-07-16 10:49 shiju.jose
  2025-07-16 10:49 ` [PATCH 1/4] cxl/events: Update Common Event Record to CXL spec rev 3.2 shiju.jose
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: shiju.jose @ 2025-07-16 10:49 UTC (permalink / raw)
  To: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
	alison.schofield, dave, vishal.l.verma, ira.weiny
  Cc: tanxiaofei, prime.zeng, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add following changes to the CXL trace events,
1. Update Common Event Record to CXL spec rev 3.2
   https://lore.kernel.org/all/20250522090002.831-1-shiju.jose@huawei.com/
2. Add extra validity checks for corrected memory error count
   in General Media Event Record
3. Add extra validity checks for CVME count in DRAM Event Record
4. Add support for Trace Memory Sparing Event Record.

Shiju Jose (4):
  cxl/events: Update Common Event Record to CXL spec rev 3.2
  cxl/events: Add extra validity checks for corrected memory error count
    in General Media Event Record
  cxl/events: Add extra validity checks for CVME count in DRAM Event
    Record
  cxl/events: Trace Memory Sparing Event Record

 drivers/cxl/core/mbox.c  |  24 ++++++++
 drivers/cxl/core/trace.h | 127 +++++++++++++++++++++++++++++++++++++--
 drivers/cxl/cxlmem.h     |   8 +++
 include/cxl/event.h      |  37 +++++++++++-
 4 files changed, 190 insertions(+), 6 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] cxl/events: Update Common Event Record to CXL spec rev 3.2
  2025-07-16 10:49 [PATCH 0/4] cxl/events: Update to rev 3.2, improvements and add trace memory sparing event record shiju.jose
@ 2025-07-16 10:49 ` shiju.jose
  2025-07-16 12:53   ` Jonathan Cameron
  2025-07-16 10:49 ` [PATCH 2/4] cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record shiju.jose
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: shiju.jose @ 2025-07-16 10:49 UTC (permalink / raw)
  To: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
	alison.schofield, dave, vishal.l.verma, ira.weiny
  Cc: tanxiaofei, prime.zeng, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

CXL spec 3.2 section 8.2.10.2.1 Table 8-55, Common Event Record format
defined new fields LD-ID and Head ID.

LD-ID: ID of logical device from where the event originated, which is
valid only if LD-ID valid flag is set to 1.
CXL spec 3.2 Section 2.4 describes, a Type 3 Multi-Logical Device (MLD)
can partition its resources into up to 16 isolated Logical Devices.
Each Logical Device is identified by a Logical Device Identifier (LD-ID)
in CXL.mem and CXL.io protocols. LD-ID is a 16-bit Logical Device
identifier applicable for CXL.io and CXL.mem requests and responses.
CXL.mem supports only the lower 4 bits of LD-ID and therefore can support
up to 16 unique LD-ID values over the link. Requests and responses
forwarded over an MLD Port are tagged with LD-ID.

Head ID: ID of the device head, from where the event originated, which is
valid only if head valid flag is set to 1.

Add updates for the above spec changes in the CXL events record and CXL
common trace event implementation.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/cxl/core/trace.h | 18 ++++++++++++++----
 include/cxl/event.h      |  4 +++-
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 25ebfbc1616c..a77487a257b3 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -214,12 +214,16 @@ TRACE_EVENT(cxl_overflow,
 #define CXL_EVENT_RECORD_FLAG_PERF_DEGRADED	BIT(4)
 #define CXL_EVENT_RECORD_FLAG_HW_REPLACE	BIT(5)
 #define CXL_EVENT_RECORD_FLAG_MAINT_OP_SUB_CLASS_VALID	BIT(6)
+#define CXL_EVENT_RECORD_FLAG_LD_ID_VALID	BIT(7)
+#define CXL_EVENT_RECORD_FLAG_HEAD_ID_VALID	BIT(8)
 #define show_hdr_flags(flags)	__print_flags(flags, " | ",			   \
 	{ CXL_EVENT_RECORD_FLAG_PERMANENT,	"PERMANENT_CONDITION"		}, \
 	{ CXL_EVENT_RECORD_FLAG_MAINT_NEEDED,	"MAINTENANCE_NEEDED"		}, \
 	{ CXL_EVENT_RECORD_FLAG_PERF_DEGRADED,	"PERFORMANCE_DEGRADED"		}, \
 	{ CXL_EVENT_RECORD_FLAG_HW_REPLACE,	"HARDWARE_REPLACEMENT_NEEDED"	},  \
-	{ CXL_EVENT_RECORD_FLAG_MAINT_OP_SUB_CLASS_VALID,	"MAINT_OP_SUB_CLASS_VALID" }	\
+	{ CXL_EVENT_RECORD_FLAG_MAINT_OP_SUB_CLASS_VALID,	"MAINT_OP_SUB_CLASS_VALID" }, \
+	{ CXL_EVENT_RECORD_FLAG_LD_ID_VALID,	"LD_ID_VALID" }, \
+	{ CXL_EVENT_RECORD_FLAG_HEAD_ID_VALID,	"HEAD_ID_VALID" } \
 )
 
 /*
@@ -247,7 +251,9 @@ TRACE_EVENT(cxl_overflow,
 	__field(u64, hdr_timestamp)				\
 	__field(u8, hdr_length)					\
 	__field(u8, hdr_maint_op_class)				\
-	__field(u8, hdr_maint_op_sub_class)
+	__field(u8, hdr_maint_op_sub_class)			\
+	__field(u16, hdr_ld_id)					\
+	__field(u8, hdr_head_id)
 
 #define CXL_EVT_TP_fast_assign(cxlmd, l, hdr)					\
 	__assign_str(memdev);				\
@@ -260,18 +266,22 @@ TRACE_EVENT(cxl_overflow,
 	__entry->hdr_related_handle = le16_to_cpu((hdr).related_handle);	\
 	__entry->hdr_timestamp = le64_to_cpu((hdr).timestamp);			\
 	__entry->hdr_maint_op_class = (hdr).maint_op_class;			\
-	__entry->hdr_maint_op_sub_class = (hdr).maint_op_sub_class
+	__entry->hdr_maint_op_sub_class = (hdr).maint_op_sub_class;		\
+	__entry->hdr_ld_id = le16_to_cpu((hdr).ld_id);				\
+	__entry->hdr_head_id = (hdr).head_id
 
 #define CXL_EVT_TP_printk(fmt, ...) \
 	TP_printk("memdev=%s host=%s serial=%lld log=%s : time=%llu uuid=%pUb "	\
 		"len=%d flags='%s' handle=%x related_handle=%x "		\
-		"maint_op_class=%u maint_op_sub_class=%u : " fmt,		\
+		"maint_op_class=%u maint_op_sub_class=%u "			\
+		"ld_id=%x head_id=%x : " fmt,					\
 		__get_str(memdev), __get_str(host), __entry->serial,		\
 		cxl_event_log_type_str(__entry->log),				\
 		__entry->hdr_timestamp, &__entry->hdr_uuid, __entry->hdr_length,\
 		show_hdr_flags(__entry->hdr_flags), __entry->hdr_handle,	\
 		__entry->hdr_related_handle, __entry->hdr_maint_op_class,	\
 		__entry->hdr_maint_op_sub_class,	\
+		__entry->hdr_ld_id, __entry->hdr_head_id,			\
 		##__VA_ARGS__)
 
 TRACE_EVENT(cxl_generic_event,
diff --git a/include/cxl/event.h b/include/cxl/event.h
index f9ae1796da85..f4cb8568566b 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -19,7 +19,9 @@ struct cxl_event_record_hdr {
 	__le64 timestamp;
 	u8 maint_op_class;
 	u8 maint_op_sub_class;
-	u8 reserved[14];
+	__le16 ld_id;
+	u8 head_id;
+	u8 reserved[11];
 } __packed;
 
 struct cxl_event_media_hdr {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record
  2025-07-16 10:49 [PATCH 0/4] cxl/events: Update to rev 3.2, improvements and add trace memory sparing event record shiju.jose
  2025-07-16 10:49 ` [PATCH 1/4] cxl/events: Update Common Event Record to CXL spec rev 3.2 shiju.jose
@ 2025-07-16 10:49 ` shiju.jose
  2025-07-16 13:04   ` Jonathan Cameron
                     ` (2 more replies)
  2025-07-16 10:49 ` [PATCH 3/4] cxl/events: Add extra validity checks for CVME count in DRAM " shiju.jose
  2025-07-16 10:49 ` [PATCH 4/4] cxl/events: Trace Memory Sparing " shiju.jose
  3 siblings, 3 replies; 15+ messages in thread
From: shiju.jose @ 2025-07-16 10:49 UTC (permalink / raw)
  To: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
	alison.schofield, dave, vishal.l.verma, ira.weiny
  Cc: tanxiaofei, prime.zeng, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

According to the CXL Specification Revision 3.2, Section 8.2.10.2.1.1,
Table 8-57 (General Media Event Record), the Corrected Memory Error Count
field is valid under the following conditions:
1. The Threshold Event bit is set in the Memory Event Descriptor field, and
2. The Corrected Memory Error Count must be greater than 0 for events where
   the Advanced Programmable Threshold Counter has expired.

Additionally, if the Advanced Programmable Corrected Memory Error Counter
Expire bit in the Memory Event Type field is set, then the Threshold Event
bit in the Memory Event Descriptor field shall also be set.

Add validity checks for the above conditions while reporting the event to
the userspace.

Note: CXL spec rev3.2 Table 8-57. General Media Event Record
Field: Corrected Memory Error Count at Event) "For events in which the
advanced programmable threshold counter has expired, this field value
shall be a value greater than 0. Counter expiration events in which
the corrected memory error count is 0 shall not generate a media event
record".
Q: Should kernel drop the event record in this case or user space
to handle?

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/cxl/core/mbox.c  | 9 +++++++++
 drivers/cxl/core/trace.h | 5 ++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 2689e6453c5a..5a30d3891b17 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -926,6 +926,15 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 			if (cxl_store_rec_gen_media((struct cxl_memdev *)cxlmd, evt))
 				dev_dbg(&cxlmd->dev, "CXL store rec_gen_media failed\n");
 
+			if (evt->gen_media.media_hdr.descriptor &
+			    CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
+				WARN_ON_ONCE((evt->gen_media.media_hdr.type &
+					      CXL_GMER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE) &&
+					     !evt->gen_media.cme_count);
+			else
+				WARN_ON_ONCE(evt->gen_media.media_hdr.type &
+					     CXL_GMER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE);
+
 			trace_cxl_general_media(cxlmd, type, cxlr, hpa,
 						hpa_alias, &evt->gen_media);
 		} else if (event_type == CXL_CPER_EVENT_DRAM) {
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index a77487a257b3..c38f94ca0ca1 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -506,7 +506,10 @@ TRACE_EVENT(cxl_general_media,
 			uuid_copy(&__entry->region_uuid, &uuid_null);
 		}
 		__entry->cme_threshold_ev_flags = rec->cme_threshold_ev_flags;
-		__entry->cme_count = get_unaligned_le24(rec->cme_count);
+		if (rec->media_hdr.descriptor & CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
+			__entry->cme_count = get_unaligned_le24(rec->cme_count);
+		else
+			__entry->cme_count = 0;
 	),
 
 	CXL_EVT_TP_printk("dpa=%llx dpa_flags='%s' " \
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] cxl/events: Add extra validity checks for CVME count in DRAM Event Record
  2025-07-16 10:49 [PATCH 0/4] cxl/events: Update to rev 3.2, improvements and add trace memory sparing event record shiju.jose
  2025-07-16 10:49 ` [PATCH 1/4] cxl/events: Update Common Event Record to CXL spec rev 3.2 shiju.jose
  2025-07-16 10:49 ` [PATCH 2/4] cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record shiju.jose
@ 2025-07-16 10:49 ` shiju.jose
  2025-07-16 13:07   ` Jonathan Cameron
                     ` (2 more replies)
  2025-07-16 10:49 ` [PATCH 4/4] cxl/events: Trace Memory Sparing " shiju.jose
  3 siblings, 3 replies; 15+ messages in thread
From: shiju.jose @ 2025-07-16 10:49 UTC (permalink / raw)
  To: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
	alison.schofield, dave, vishal.l.verma, ira.weiny
  Cc: tanxiaofei, prime.zeng, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

According to the CXL Specification Revision 3.2, Section 8.2.10.2.1.2,
Table 8-58 (DRAM Event Record), the CVME (Corrected Volatile Memory Error)
Count field is valid under the following conditions:
1. The Threshold Event bit is set in the Memory Event Descriptor field,
and
2. The CVME Count must be greater than 0 for events where the Advanced
Programmable Threshold Counter has expired.

Additionally, if the Advanced Programmable Corrected Memory Error Counter
Expire bit in the Memory Event Type field is set, then the Threshold Event
bit in the Memory Event Descriptor field shall also be set.

Add validity checks for the above conditions while reporting the event to
the userspace.

Note: CXL spec rev3.2 Table 8-58. DRAM Event Record, (Field: CVME Count
at Event), "For events in which the advanced programmable threshold
counter has expired, this field value shall be a value greater than 0.
Counter expiration events in which the corrected memory error count
is 0 shall not generate a media event record".
Q: Should kernel drop the event record in this case or user space
to handle?

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/cxl/core/mbox.c  | 9 +++++++++
 drivers/cxl/core/trace.h | 4 ++++
 2 files changed, 13 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 5a30d3891b17..e8c49ae28828 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -941,6 +941,15 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 			if (cxl_store_rec_dram((struct cxl_memdev *)cxlmd, evt))
 				dev_dbg(&cxlmd->dev, "CXL store rec_dram failed\n");
 
+			if (evt->dram.media_hdr.descriptor &
+			    CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
+				WARN_ON_ONCE((evt->dram.media_hdr.type &
+					      CXL_DER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE) &&
+					     !evt->dram.cvme_count);
+			else
+				WARN_ON_ONCE(evt->dram.media_hdr.type &
+					     CXL_DER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE);
+
 			trace_cxl_dram(cxlmd, type, cxlr, hpa, hpa_alias,
 				       &evt->dram);
 		}
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index c38f94ca0ca1..c3cd871942c5 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -662,6 +662,10 @@ TRACE_EVENT(cxl_dram,
 		__entry->sub_channel = rec->sub_channel;
 		__entry->cme_threshold_ev_flags = rec->cme_threshold_ev_flags;
 		__entry->cvme_count = get_unaligned_le24(rec->cvme_count);
+		if (rec->media_hdr.descriptor & CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
+			__entry->cvme_count = get_unaligned_le24(rec->cvme_count);
+		else
+			__entry->cvme_count = 0;
 	),
 
 	CXL_EVT_TP_printk("dpa=%llx dpa_flags='%s' descriptor='%s' type='%s' sub_type='%s' " \
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] cxl/events: Trace Memory Sparing Event Record
  2025-07-16 10:49 [PATCH 0/4] cxl/events: Update to rev 3.2, improvements and add trace memory sparing event record shiju.jose
                   ` (2 preceding siblings ...)
  2025-07-16 10:49 ` [PATCH 3/4] cxl/events: Add extra validity checks for CVME count in DRAM " shiju.jose
@ 2025-07-16 10:49 ` shiju.jose
  2025-07-16 13:16   ` Jonathan Cameron
  2025-07-16 22:23   ` Dave Jiang
  3 siblings, 2 replies; 15+ messages in thread
From: shiju.jose @ 2025-07-16 10:49 UTC (permalink / raw)
  To: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
	alison.schofield, dave, vishal.l.verma, ira.weiny
  Cc: tanxiaofei, prime.zeng, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

CXL rev 3.2 section 8.2.10.2.1.4 Table 8-60 defines the Memory Sparing
Event Record.

Determine if the event read is memory sparing record and if so trace the
record.

Memory device shall produce a memory sparing event record
1. After completion of a PPR maintenance operation if the memory sparing
event record enable bit is set (Field: sPPR/hPPR Operation Mode in
Table 8-128/Table 8-131).
2. In response to a query request by the host (see section 8.2.10.7.1.4)
to determine the availability of sparing resources.
The device shall report the resource availability by producing the Memory
Sparing Event Record (see Table 8-60) in which the channel, rank, nibble
mask, bank group, bank, row, column, sub-channel fields are a copy of the
values specified in the request. If the controller does not support
reporting whether a resource is available, and a perform maintenance
operation for memory sparing is issued with query resources set to 1, the
controller shall return invalid input.

Example trace log for produce memory sparing event record on completion
of a soft PPR operation,
cxl_memory_sparing: memdev=mem1 host=0000:0f:00.0 serial=3
log=Informational : time=55045163029
uuid=e71f3a40-2d29-4092-8a39-4d1c966c7c65 len=128 flags='0x1' handle=1
related_handle=0 maint_op_class=2 maint_op_sub_class=1
ld_id=0 head_id=0 : flags='' result=0
validity_flags='CHANNEL|RANK|NIBBLE|BANK GROUP|BANK|ROW|COLUMN'
spare resource avail=1 channel=2 rank=5 nibble_mask=a59c bank_group=2
bank=4 row=13 column=23 sub_channel=0
comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
comp_id_pldm_valid_flags='' pldm_entity_id=0x00 pldm_resource_id=0x00

Note: For memory sparing event record, fields 'maintenance operation
class' and 'maintenance operation subclass' are defined twice, first
in the common event record (Table 8-55) and second in the memory
sparing event record (Table 8-60). Thus those in the sparing event
record coded as reserved, to be removed when the spec is updated.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/cxl/core/mbox.c  |   6 +++
 drivers/cxl/core/trace.h | 100 +++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxlmem.h     |   8 ++++
 include/cxl/event.h      |  33 +++++++++++++
 4 files changed, 147 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index e8c49ae28828..682993ace4ae 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -899,6 +899,10 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 		trace_cxl_generic_event(cxlmd, type, uuid, &evt->generic);
 		return;
 	}
+	if (event_type == CXL_CPER_EVENT_MEM_SPARING) {
+		trace_cxl_memory_sparing(cxlmd, type, &evt->mem_sparing);
+		return;
+	}
 
 	if (trace_cxl_general_media_enabled() || trace_cxl_dram_enabled()) {
 		u64 dpa, hpa = ULLONG_MAX, hpa_alias = ULLONG_MAX;
@@ -970,6 +974,8 @@ static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 		ev_type = CXL_CPER_EVENT_DRAM;
 	else if (uuid_equal(uuid, &CXL_EVENT_MEM_MODULE_UUID))
 		ev_type = CXL_CPER_EVENT_MEM_MODULE;
+	else if (uuid_equal(uuid, &CXL_EVENT_MEM_SPARING_UUID))
+		ev_type = CXL_CPER_EVENT_MEM_SPARING;
 
 	cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
 }
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index c3cd871942c5..2c291fb1857c 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -888,6 +888,106 @@ TRACE_EVENT(cxl_memory_module,
 	)
 );
 
+#define CXL_MSER_QUERY_RESOURCE_FLAG			BIT(0)
+#define CXL_MSER_HARD_SPARING_FLAG			BIT(1)
+#define CXL_MSER_DEV_INITED_FLAG			BIT(2)
+#define show_mem_sparing_flags(flags)	__print_flags(flags, "|",		\
+	{ CXL_MSER_QUERY_RESOURCE_FLAG,		"Query Resources" },		\
+	{ CXL_MSER_HARD_SPARING_FLAG,		"Hard Sparing" },		\
+	{ CXL_MSER_DEV_INITED_FLAG,	"Device Initiated Sparing"	}	\
+)
+
+#define CXL_MSER_VALID_CHANNEL				BIT(0)
+#define CXL_MSER_VALID_RANK				BIT(1)
+#define CXL_MSER_VALID_NIBBLE				BIT(2)
+#define CXL_MSER_VALID_BANK_GROUP			BIT(3)
+#define CXL_MSER_VALID_BANK				BIT(4)
+#define CXL_MSER_VALID_ROW				BIT(5)
+#define CXL_MSER_VALID_COLUMN				BIT(6)
+#define CXL_MSER_VALID_COMPONENT_ID			BIT(7)
+#define CXL_MSER_VALID_COMPONENT_ID_FORMAT		BIT(8)
+#define CXL_MSER_VALID_SUB_CHANNEL			BIT(9)
+#define show_mem_sparing_valid_flags(flags)	__print_flags(flags, "|",		\
+	{ CXL_MSER_VALID_CHANNEL,			"CHANNEL"		},	\
+	{ CXL_MSER_VALID_RANK,				"RANK"			},	\
+	{ CXL_MSER_VALID_NIBBLE,			"NIBBLE"		},	\
+	{ CXL_MSER_VALID_BANK_GROUP,			"BANK GROUP"		},	\
+	{ CXL_MSER_VALID_BANK,				"BANK"			},	\
+	{ CXL_MSER_VALID_ROW,				"ROW"			},	\
+	{ CXL_MSER_VALID_COLUMN,			"COLUMN"		},	\
+	{ CXL_MSER_VALID_COMPONENT_ID,			"COMPONENT ID"		},	\
+	{ CXL_MSER_VALID_COMPONENT_ID_FORMAT,		"COMPONENT ID PLDM FORMAT" },	\
+	{ CXL_MSER_VALID_SUB_CHANNEL,			"SUB CHANNEL"		}	\
+)
+
+TRACE_EVENT(cxl_memory_sparing,
+
+	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
+		 struct cxl_event_mem_sparing *rec),
+
+	TP_ARGS(cxlmd, log, rec),
+
+	TP_STRUCT__entry(
+		CXL_EVT_TP_entry
+
+		/* Memory Sparing Event */
+		__field(u8, flags)
+		__field(u8, result)
+		__field(u16, validity_flags)
+		__field(u16, res_avail)
+		__field(u8, channel)
+		__field(u8, rank)
+		__field(u32, nibble_mask)
+		__field(u8, bank_group)
+		__field(u8, bank)
+		__field(u32, row)
+		__field(u16, column)
+		__field(u8, sub_channel)
+		__array(u8, comp_id, CXL_EVENT_GEN_MED_COMP_ID_SIZE)
+	),
+
+	TP_fast_assign(
+		CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
+		__entry->hdr_uuid = CXL_EVENT_MEM_SPARING_UUID;
+
+		/* Memory Sparing Event */
+		__entry->flags = rec->flags;
+		__entry->result = rec->result;
+		__entry->validity_flags = le16_to_cpu(rec->validity_flags);
+		__entry->res_avail = le16_to_cpu(rec->res_avail);
+		__entry->channel = rec->channel;
+		__entry->rank = rec->rank;
+		__entry->nibble_mask = get_unaligned_le24(rec->nibble_mask);
+		__entry->bank_group = rec->bank_group;
+		__entry->bank = rec->bank;
+		__entry->row = get_unaligned_le24(rec->row);
+		__entry->column = le16_to_cpu(rec->column);
+		__entry->sub_channel = rec->sub_channel;
+		memcpy(__entry->comp_id, &rec->component_id,
+		       CXL_EVENT_GEN_MED_COMP_ID_SIZE);
+	),
+
+	CXL_EVT_TP_printk("flags='%s' result=%u validity_flags='%s' " \
+		"spare resource avail=%u channel=%u rank=%u " \
+		"nibble_mask=%x bank_group=%u bank=%u " \
+		"row=%u column=%u sub_channel=%u " \
+		"comp_id=%s comp_id_pldm_valid_flags='%s' " \
+		"pldm_entity_id=%s pldm_resource_id=%s",
+		show_mem_sparing_flags(__entry->flags),
+		__entry->result,
+		show_mem_sparing_valid_flags(__entry->validity_flags),
+		__entry->res_avail, __entry->channel, __entry->rank,
+		__entry->nibble_mask, __entry->bank_group, __entry->bank,
+		__entry->row, __entry->column, __entry->sub_channel,
+		__print_hex(__entry->comp_id, CXL_EVENT_GEN_MED_COMP_ID_SIZE),
+		show_comp_id_pldm_flags(__entry->comp_id[0]),
+		show_pldm_entity_id(__entry->validity_flags, CXL_MSER_VALID_COMPONENT_ID,
+				    CXL_MSER_VALID_COMPONENT_ID_FORMAT, __entry->comp_id),
+		show_pldm_resource_id(__entry->validity_flags, CXL_MSER_VALID_COMPONENT_ID,
+				      CXL_MSER_VALID_COMPONENT_ID_FORMAT, __entry->comp_id)
+	)
+);
+
 #define show_poison_trace_type(type)			\
 	__print_symbolic(type,				\
 	{ CXL_POISON_TRACE_LIST,	"List"   },	\
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 551b0ba2caa1..f98311f357b7 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -633,6 +633,14 @@ struct cxl_mbox_identify {
 	UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \
 		  0x13, 0xb7, 0x74)
 
+/*
+ * Memory Sparing Event Record UUID
+ * CXL rev 3.2 section 8.2.10.2.1.4: Table 8-60
+ */
+#define CXL_EVENT_MEM_SPARING_UUID                                          \
+	UUID_INIT(0xe71f3a40, 0x2d29, 0x4092, 0x8a, 0x39, 0x4d, 0x1c, 0x96, \
+		  0x6c, 0x7c, 0x65)
+
 /*
  * Get Event Records output payload
  * CXL rev 3.0 section 8.2.9.2.2; Table 8-50
diff --git a/include/cxl/event.h b/include/cxl/event.h
index f4cb8568566b..6fd90f9cc203 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -110,11 +110,43 @@ struct cxl_event_mem_module {
 	u8 reserved[0x2a];
 } __packed;
 
+/*
+ * Memory Sparing Event Record - MSER
+ * CXL rev 3.2 section 8.2.10.2.1.4; Table 8-60
+ */
+struct cxl_event_mem_sparing {
+	struct cxl_event_record_hdr hdr;
+	/*
+	 * The fields maintenance operation class and maintenance operation
+	 * subclass defined in the Memory Sparing Event Record are the
+	 * duplication of the same in the common event record. Thus defined
+	 * as reserved and to be removed after the spec correction.
+	 */
+	u8 rsv1;
+	u8 rsv2;
+	u8 flags;
+	u8 result;
+	__le16 validity_flags;
+	u8 reserved1[6];
+	__le16 res_avail;
+	u8 channel;
+	u8 rank;
+	u8 nibble_mask[3];
+	u8 bank_group;
+	u8 bank;
+	u8 row[3];
+	__le16 column;
+	u8 component_id[CXL_EVENT_GEN_MED_COMP_ID_SIZE];
+	u8 sub_channel;
+	u8 reserved2[0x25];
+} __packed;
+
 union cxl_event {
 	struct cxl_event_generic generic;
 	struct cxl_event_gen_media gen_media;
 	struct cxl_event_dram dram;
 	struct cxl_event_mem_module mem_module;
+	struct cxl_event_mem_sparing mem_sparing;
 	/* dram & gen_media event header */
 	struct cxl_event_media_hdr media_hdr;
 } __packed;
@@ -133,6 +165,7 @@ enum cxl_event_type {
 	CXL_CPER_EVENT_GEN_MEDIA,
 	CXL_CPER_EVENT_DRAM,
 	CXL_CPER_EVENT_MEM_MODULE,
+	CXL_CPER_EVENT_MEM_SPARING,
 };
 
 #define CPER_CXL_DEVICE_ID_VALID		BIT(0)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] cxl/events: Update Common Event Record to CXL spec rev 3.2
  2025-07-16 10:49 ` [PATCH 1/4] cxl/events: Update Common Event Record to CXL spec rev 3.2 shiju.jose
@ 2025-07-16 12:53   ` Jonathan Cameron
  0 siblings, 0 replies; 15+ messages in thread
From: Jonathan Cameron @ 2025-07-16 12:53 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-cxl, dan.j.williams, dave.jiang, alison.schofield, dave,
	vishal.l.verma, ira.weiny, tanxiaofei, prime.zeng, linuxarm

On Wed, 16 Jul 2025 11:49:42 +0100
<shiju.jose@huawei.com> wrote:

> From: Shiju Jose <shiju.jose@huawei.com>
> 
> CXL spec 3.2 section 8.2.10.2.1 Table 8-55, Common Event Record format
> defined new fields LD-ID and Head ID.
> 
> LD-ID: ID of logical device from where the event originated, which is
> valid only if LD-ID valid flag is set to 1.
> CXL spec 3.2 Section 2.4 describes, a Type 3 Multi-Logical Device (MLD)
> can partition its resources into up to 16 isolated Logical Devices.
> Each Logical Device is identified by a Logical Device Identifier (LD-ID)
> in CXL.mem and CXL.io protocols. LD-ID is a 16-bit Logical Device
> identifier applicable for CXL.io and CXL.mem requests and responses.
> CXL.mem supports only the lower 4 bits of LD-ID and therefore can support
> up to 16 unique LD-ID values over the link. Requests and responses
> forwarded over an MLD Port are tagged with LD-ID.
> 
> Head ID: ID of the device head, from where the event originated, which is
> valid only if head valid flag is set to 1.
> 
> Add updates for the above spec changes in the CXL events record and CXL
> common trace event implementation.
> 
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record
  2025-07-16 10:49 ` [PATCH 2/4] cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record shiju.jose
@ 2025-07-16 13:04   ` Jonathan Cameron
  2025-07-16 21:40   ` Dave Jiang
  2025-07-17  3:32   ` kernel test robot
  2 siblings, 0 replies; 15+ messages in thread
From: Jonathan Cameron @ 2025-07-16 13:04 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-cxl, dan.j.williams, dave.jiang, alison.schofield, dave,
	vishal.l.verma, ira.weiny, tanxiaofei, prime.zeng, linuxarm

On Wed, 16 Jul 2025 11:49:43 +0100
<shiju.jose@huawei.com> wrote:

> From: Shiju Jose <shiju.jose@huawei.com>
> 
> According to the CXL Specification Revision 3.2, Section 8.2.10.2.1.1,
> Table 8-57 (General Media Event Record), the Corrected Memory Error Count
> field is valid under the following conditions:
> 1. The Threshold Event bit is set in the Memory Event Descriptor field, and
> 2. The Corrected Memory Error Count must be greater than 0 for events where
>    the Advanced Programmable Threshold Counter has expired.
> 
> Additionally, if the Advanced Programmable Corrected Memory Error Counter
> Expire bit in the Memory Event Type field is set, then the Threshold Event
> bit in the Memory Event Descriptor field shall also be set.
> 
> Add validity checks for the above conditions while reporting the event to
> the userspace.
> 
> Note: CXL spec rev3.2 Table 8-57. General Media Event Record
> Field: Corrected Memory Error Count at Event) "For events in which the
> advanced programmable threshold counter has expired, this field value
> shall be a value greater than 0. Counter expiration events in which
> the corrected memory error count is 0 shall not generate a media event
> record".
> Q: Should kernel drop the event record in this case or user space
> to handle?

Personally I think this is a problem for user space. I don't mind the
additional warnings though. I'd have put the question under ---
though as then, if everyone is happy no need to resend to drop the
question (or for Dave to remove it)

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> ---
>  drivers/cxl/core/mbox.c  | 9 +++++++++
>  drivers/cxl/core/trace.h | 5 ++++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 2689e6453c5a..5a30d3891b17 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -926,6 +926,15 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  			if (cxl_store_rec_gen_media((struct cxl_memdev *)cxlmd, evt))
>  				dev_dbg(&cxlmd->dev, "CXL store rec_gen_media failed\n");
>  
> +			if (evt->gen_media.media_hdr.descriptor &
> +			    CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
> +				WARN_ON_ONCE((evt->gen_media.media_hdr.type &
> +					      CXL_GMER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE) &&
> +					     !evt->gen_media.cme_count);
> +			else
> +				WARN_ON_ONCE(evt->gen_media.media_hdr.type &
> +					     CXL_GMER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE);
> +
>  			trace_cxl_general_media(cxlmd, type, cxlr, hpa,
>  						hpa_alias, &evt->gen_media);
>  		} else if (event_type == CXL_CPER_EVENT_DRAM) {
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index a77487a257b3..c38f94ca0ca1 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -506,7 +506,10 @@ TRACE_EVENT(cxl_general_media,
>  			uuid_copy(&__entry->region_uuid, &uuid_null);
>  		}
>  		__entry->cme_threshold_ev_flags = rec->cme_threshold_ev_flags;
> -		__entry->cme_count = get_unaligned_le24(rec->cme_count);
> +		if (rec->media_hdr.descriptor & CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
> +			__entry->cme_count = get_unaligned_le24(rec->cme_count);
> +		else
> +			__entry->cme_count = 0;
>  	),
>  
>  	CXL_EVT_TP_printk("dpa=%llx dpa_flags='%s' " \


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] cxl/events: Add extra validity checks for CVME count in DRAM Event Record
  2025-07-16 10:49 ` [PATCH 3/4] cxl/events: Add extra validity checks for CVME count in DRAM " shiju.jose
@ 2025-07-16 13:07   ` Jonathan Cameron
  2025-07-16 21:53   ` Dave Jiang
  2025-07-17  5:16   ` kernel test robot
  2 siblings, 0 replies; 15+ messages in thread
From: Jonathan Cameron @ 2025-07-16 13:07 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-cxl, dan.j.williams, dave.jiang, alison.schofield, dave,
	vishal.l.verma, ira.weiny, tanxiaofei, prime.zeng, linuxarm

On Wed, 16 Jul 2025 11:49:44 +0100
<shiju.jose@huawei.com> wrote:

> From: Shiju Jose <shiju.jose@huawei.com>
> 
> According to the CXL Specification Revision 3.2, Section 8.2.10.2.1.2,
> Table 8-58 (DRAM Event Record), the CVME (Corrected Volatile Memory Error)
> Count field is valid under the following conditions:
> 1. The Threshold Event bit is set in the Memory Event Descriptor field,
> and
> 2. The CVME Count must be greater than 0 for events where the Advanced
> Programmable Threshold Counter has expired.
> 
> Additionally, if the Advanced Programmable Corrected Memory Error Counter
> Expire bit in the Memory Event Type field is set, then the Threshold Event
> bit in the Memory Event Descriptor field shall also be set.
> 
> Add validity checks for the above conditions while reporting the event to
> the userspace.
> 
> Note: CXL spec rev3.2 Table 8-58. DRAM Event Record, (Field: CVME Count
> at Event), "For events in which the advanced programmable threshold
> counter has expired, this field value shall be a value greater than 0.
> Counter expiration events in which the corrected memory error count
> is 0 shall not generate a media event record".
> Q: Should kernel drop the event record in this case or user space
> to handle?
> 
Would be reasonable to combine this with previous patch but I don't mind
that much.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] cxl/events: Trace Memory Sparing Event Record
  2025-07-16 10:49 ` [PATCH 4/4] cxl/events: Trace Memory Sparing " shiju.jose
@ 2025-07-16 13:16   ` Jonathan Cameron
  2025-07-16 15:07     ` Shiju Jose
  2025-07-16 22:23   ` Dave Jiang
  1 sibling, 1 reply; 15+ messages in thread
From: Jonathan Cameron @ 2025-07-16 13:16 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-cxl, dan.j.williams, dave.jiang, alison.schofield, dave,
	vishal.l.verma, ira.weiny, tanxiaofei, prime.zeng, linuxarm

On Wed, 16 Jul 2025 11:49:45 +0100
<shiju.jose@huawei.com> wrote:

> From: Shiju Jose <shiju.jose@huawei.com>
> 
> CXL rev 3.2 section 8.2.10.2.1.4 Table 8-60 defines the Memory Sparing
> Event Record.
> 
> Determine if the event read is memory sparing record and if so trace the
> record.
> 
> Memory device shall produce a memory sparing event record
> 1. After completion of a PPR maintenance operation if the memory sparing
> event record enable bit is set (Field: sPPR/hPPR Operation Mode in
> Table 8-128/Table 8-131).
> 2. In response to a query request by the host (see section 8.2.10.7.1.4)
> to determine the availability of sparing resources.
> The device shall report the resource availability by producing the Memory
> Sparing Event Record (see Table 8-60) in which the channel, rank, nibble
> mask, bank group, bank, row, column, sub-channel fields are a copy of the
> values specified in the request. If the controller does not support
> reporting whether a resource is available, and a perform maintenance
> operation for memory sparing is issued with query resources set to 1, the
> controller shall return invalid input.
> 
> Example trace log for produce memory sparing event record on completion
> of a soft PPR operation,
> cxl_memory_sparing: memdev=mem1 host=0000:0f:00.0 serial=3
> log=Informational : time=55045163029
> uuid=e71f3a40-2d29-4092-8a39-4d1c966c7c65 len=128 flags='0x1' handle=1
> related_handle=0 maint_op_class=2 maint_op_sub_class=1
> ld_id=0 head_id=0 : flags='' result=0
> validity_flags='CHANNEL|RANK|NIBBLE|BANK GROUP|BANK|ROW|COLUMN'
> spare resource avail=1 channel=2 rank=5 nibble_mask=a59c bank_group=2
> bank=4 row=13 column=23 sub_channel=0
> comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> comp_id_pldm_valid_flags='' pldm_entity_id=0x00 pldm_resource_id=0x00
> 
> Note: For memory sparing event record, fields 'maintenance operation
> class' and 'maintenance operation subclass' are defined twice, first
> in the common event record (Table 8-55) and second in the memory
> sparing event record (Table 8-60). Thus those in the sparing event
> record coded as reserved, to be removed when the spec is updated.
> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Only comment formatting related.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
>  drivers/cxl/core/mbox.c  |   6 +++
>  drivers/cxl/core/trace.h | 100 +++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxlmem.h     |   8 ++++
>  include/cxl/event.h      |  33 +++++++++++++
>  4 files changed, 147 insertions(+)
> 

> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index c3cd871942c5..2c291fb1857c 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -888,6 +888,106 @@ TRACE_EVENT(cxl_memory_module,
>  	)
>  );
>  
> +#define CXL_MSER_QUERY_RESOURCE_FLAG			BIT(0)
> +#define CXL_MSER_HARD_SPARING_FLAG			BIT(1)
> +#define CXL_MSER_DEV_INITED_FLAG			BIT(2)
> +#define show_mem_sparing_flags(flags)	__print_flags(flags, "|",		\
> +	{ CXL_MSER_QUERY_RESOURCE_FLAG,		"Query Resources" },		\
> +	{ CXL_MSER_HARD_SPARING_FLAG,		"Hard Sparing" },		\
> +	{ CXL_MSER_DEV_INITED_FLAG,	"Device Initiated Sparing"	}	\

Spacing before the } is inconsistent for this last line.  Copy whatever we have
in the file already and if it is inconsistent (which it is) pick most common option.

> +)
> +
> +#define CXL_MSER_VALID_CHANNEL				BIT(0)
> +#define CXL_MSER_VALID_RANK				BIT(1)
> +#define CXL_MSER_VALID_NIBBLE				BIT(2)
> +#define CXL_MSER_VALID_BANK_GROUP			BIT(3)
> +#define CXL_MSER_VALID_BANK				BIT(4)
> +#define CXL_MSER_VALID_ROW				BIT(5)
> +#define CXL_MSER_VALID_COLUMN				BIT(6)
> +#define CXL_MSER_VALID_COMPONENT_ID			BIT(7)
> +#define CXL_MSER_VALID_COMPONENT_ID_FORMAT		BIT(8)
> +#define CXL_MSER_VALID_SUB_CHANNEL			BIT(9)
> +#define show_mem_sparing_valid_flags(flags)	__print_flags(flags, "|",		\
> +	{ CXL_MSER_VALID_CHANNEL,			"CHANNEL"		},	\
> +	{ CXL_MSER_VALID_RANK,				"RANK"			},	\
> +	{ CXL_MSER_VALID_NIBBLE,			"NIBBLE"		},	\
> +	{ CXL_MSER_VALID_BANK_GROUP,			"BANK GROUP"		},	\
> +	{ CXL_MSER_VALID_BANK,				"BANK"			},	\
> +	{ CXL_MSER_VALID_ROW,				"ROW"			},	\
> +	{ CXL_MSER_VALID_COLUMN,			"COLUMN"		},	\
> +	{ CXL_MSER_VALID_COMPONENT_ID,			"COMPONENT ID"		},	\
> +	{ CXL_MSER_VALID_COMPONENT_ID_FORMAT,		"COMPONENT ID PLDM FORMAT" },	\
> +	{ CXL_MSER_VALID_SUB_CHANNEL,			"SUB CHANNEL"		}	\
> +)
> +
> +TRACE_EVENT(cxl_memory_sparing,
> +
> +	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
> +		 struct cxl_event_mem_sparing *rec),
> +
> +	TP_ARGS(cxlmd, log, rec),
> +
> +	TP_STRUCT__entry(
> +		CXL_EVT_TP_entry
> +
> +		/* Memory Sparing Event */
> +		__field(u8, flags)
> +		__field(u8, result)
> +		__field(u16, validity_flags)
> +		__field(u16, res_avail)
> +		__field(u8, channel)
> +		__field(u8, rank)
> +		__field(u32, nibble_mask)
> +		__field(u8, bank_group)
> +		__field(u8, bank)
> +		__field(u32, row)
> +		__field(u16, column)
> +		__field(u8, sub_channel)
> +		__array(u8, comp_id, CXL_EVENT_GEN_MED_COMP_ID_SIZE)
> +	),
> +
> +	TP_fast_assign(
> +		CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
> +		__entry->hdr_uuid = CXL_EVENT_MEM_SPARING_UUID;
> +
> +		/* Memory Sparing Event */
> +		__entry->flags = rec->flags;
> +		__entry->result = rec->result;
> +		__entry->validity_flags = le16_to_cpu(rec->validity_flags);
> +		__entry->res_avail = le16_to_cpu(rec->res_avail);
> +		__entry->channel = rec->channel;
> +		__entry->rank = rec->rank;
> +		__entry->nibble_mask = get_unaligned_le24(rec->nibble_mask);
> +		__entry->bank_group = rec->bank_group;
> +		__entry->bank = rec->bank;
> +		__entry->row = get_unaligned_le24(rec->row);
> +		__entry->column = le16_to_cpu(rec->column);
> +		__entry->sub_channel = rec->sub_channel;
> +		memcpy(__entry->comp_id, &rec->component_id,
> +		       CXL_EVENT_GEN_MED_COMP_ID_SIZE);
> +	),
> +
> +	CXL_EVT_TP_printk("flags='%s' result=%u validity_flags='%s' " \
> +		"spare resource avail=%u channel=%u rank=%u " \
> +		"nibble_mask=%x bank_group=%u bank=%u " \
> +		"row=%u column=%u sub_channel=%u " \
> +		"comp_id=%s comp_id_pldm_valid_flags='%s' " \
> +		"pldm_entity_id=%s pldm_resource_id=%s",
> +		show_mem_sparing_flags(__entry->flags),
> +		__entry->result,
> +		show_mem_sparing_valid_flags(__entry->validity_flags),
> +		__entry->res_avail, __entry->channel, __entry->rank,
> +		__entry->nibble_mask, __entry->bank_group, __entry->bank,
> +		__entry->row, __entry->column, __entry->sub_channel,
> +		__print_hex(__entry->comp_id, CXL_EVENT_GEN_MED_COMP_ID_SIZE),
> +		show_comp_id_pldm_flags(__entry->comp_id[0]),
> +		show_pldm_entity_id(__entry->validity_flags, CXL_MSER_VALID_COMPONENT_ID,
> +				    CXL_MSER_VALID_COMPONENT_ID_FORMAT, __entry->comp_id),
> +		show_pldm_resource_id(__entry->validity_flags, CXL_MSER_VALID_COMPONENT_ID,
> +				      CXL_MSER_VALID_COMPONENT_ID_FORMAT, __entry->comp_id)
> +	)
> +);
> +
>  #define show_poison_trace_type(type)			\
>  	__print_symbolic(type,				\
>  	{ CXL_POISON_TRACE_LIST,	"List"   },	\



^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 4/4] cxl/events: Trace Memory Sparing Event Record
  2025-07-16 13:16   ` Jonathan Cameron
@ 2025-07-16 15:07     ` Shiju Jose
  0 siblings, 0 replies; 15+ messages in thread
From: Shiju Jose @ 2025-07-16 15:07 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl@vger.kernel.org, dan.j.williams@intel.com,
	dave.jiang@intel.com, alison.schofield@intel.com,
	dave@stgolabs.net, vishal.l.verma@intel.com, ira.weiny@intel.com,
	tanxiaofei, Zengtao (B), Linuxarm

>-----Original Message-----
>From: Jonathan Cameron <jonathan.cameron@huawei.com>
>Sent: 16 July 2025 14:17
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-cxl@vger.kernel.org; dan.j.williams@intel.com; dave.jiang@intel.com;
>alison.schofield@intel.com; dave@stgolabs.net; vishal.l.verma@intel.com;
>ira.weiny@intel.com; tanxiaofei <tanxiaofei@huawei.com>; Zengtao (B)
><prime.zeng@hisilicon.com>; Linuxarm <linuxarm@huawei.com>
>Subject: Re: [PATCH 4/4] cxl/events: Trace Memory Sparing Event Record
>
>On Wed, 16 Jul 2025 11:49:45 +0100
><shiju.jose@huawei.com> wrote:
>
>> From: Shiju Jose <shiju.jose@huawei.com>
>>
>> CXL rev 3.2 section 8.2.10.2.1.4 Table 8-60 defines the Memory Sparing
>> Event Record.
>>
>> Determine if the event read is memory sparing record and if so trace
>> the record.
>>
>> Memory device shall produce a memory sparing event record 1. After
>> completion of a PPR maintenance operation if the memory sparing event
>> record enable bit is set (Field: sPPR/hPPR Operation Mode in Table
>> 8-128/Table 8-131).
>> 2. In response to a query request by the host (see section
>> 8.2.10.7.1.4) to determine the availability of sparing resources.
>> The device shall report the resource availability by producing the
>> Memory Sparing Event Record (see Table 8-60) in which the channel,
>> rank, nibble mask, bank group, bank, row, column, sub-channel fields
>> are a copy of the values specified in the request. If the controller
>> does not support reporting whether a resource is available, and a
>> perform maintenance operation for memory sparing is issued with query
>> resources set to 1, the controller shall return invalid input.
>>
>> Example trace log for produce memory sparing event record on
>> completion of a soft PPR operation,
>> cxl_memory_sparing: memdev=mem1 host=0000:0f:00.0 serial=3
>> log=Informational : time=55045163029
>> uuid=e71f3a40-2d29-4092-8a39-4d1c966c7c65 len=128 flags='0x1' handle=1
>> related_handle=0 maint_op_class=2 maint_op_sub_class=1
>> ld_id=0 head_id=0 : flags='' result=0
>> validity_flags='CHANNEL|RANK|NIBBLE|BANK GROUP|BANK|ROW|COLUMN'
>> spare resource avail=1 channel=2 rank=5 nibble_mask=a59c bank_group=2
>> bank=4 row=13 column=23 sub_channel=0
>> comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> comp_id_pldm_valid_flags='' pldm_entity_id=0x00 pldm_resource_id=0x00
>>
>> Note: For memory sparing event record, fields 'maintenance operation
>> class' and 'maintenance operation subclass' are defined twice, first
>> in the common event record (Table 8-55) and second in the memory
>> sparing event record (Table 8-60). Thus those in the sparing event
>> record coded as reserved, to be removed when the spec is updated.
>>
>> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
>Only comment formatting related.
>
>Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>
>> ---
>>  drivers/cxl/core/mbox.c  |   6 +++
>>  drivers/cxl/core/trace.h | 100
>+++++++++++++++++++++++++++++++++++++++
>>  drivers/cxl/cxlmem.h     |   8 ++++
>>  include/cxl/event.h      |  33 +++++++++++++
>>  4 files changed, 147 insertions(+)
>>
>
>> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h index
>> c3cd871942c5..2c291fb1857c 100644
>> --- a/drivers/cxl/core/trace.h
>> +++ b/drivers/cxl/core/trace.h
>> @@ -888,6 +888,106 @@ TRACE_EVENT(cxl_memory_module,
>>  	)
>>  );
>>
>> +#define CXL_MSER_QUERY_RESOURCE_FLAG			BIT(0)
>> +#define CXL_MSER_HARD_SPARING_FLAG			BIT(1)
>> +#define CXL_MSER_DEV_INITED_FLAG			BIT(2)
>> +#define show_mem_sparing_flags(flags)	__print_flags(flags, "|",
>	\
>> +	{ CXL_MSER_QUERY_RESOURCE_FLAG,		"Query Resources" },
>		\
>> +	{ CXL_MSER_HARD_SPARING_FLAG,		"Hard Sparing" },
>		\
>> +	{ CXL_MSER_DEV_INITED_FLAG,	"Device Initiated Sparing"
>	}	\
>
>Spacing before the } is inconsistent for this last line.  Copy whatever we have in
>the file already and if it is inconsistent (which it is) pick most common option.

Thanks Jonathan.
I will correct in v2.
>
>> +)
>

Thanks
Shiju

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record
  2025-07-16 10:49 ` [PATCH 2/4] cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record shiju.jose
  2025-07-16 13:04   ` Jonathan Cameron
@ 2025-07-16 21:40   ` Dave Jiang
  2025-07-17  3:32   ` kernel test robot
  2 siblings, 0 replies; 15+ messages in thread
From: Dave Jiang @ 2025-07-16 21:40 UTC (permalink / raw)
  To: shiju.jose, linux-cxl, dan.j.williams, jonathan.cameron,
	alison.schofield, dave, vishal.l.verma, ira.weiny
  Cc: tanxiaofei, prime.zeng, linuxarm



On 7/16/25 3:49 AM, shiju.jose@huawei.com wrote:
> From: Shiju Jose <shiju.jose@huawei.com>
> 
> According to the CXL Specification Revision 3.2, Section 8.2.10.2.1.1,
> Table 8-57 (General Media Event Record), the Corrected Memory Error Count
> field is valid under the following conditions:
> 1. The Threshold Event bit is set in the Memory Event Descriptor field, and
> 2. The Corrected Memory Error Count must be greater than 0 for events where
>    the Advanced Programmable Threshold Counter has expired.
> 
> Additionally, if the Advanced Programmable Corrected Memory Error Counter
> Expire bit in the Memory Event Type field is set, then the Threshold Event
> bit in the Memory Event Descriptor field shall also be set.
> 
> Add validity checks for the above conditions while reporting the event to
> the userspace.
> 
> Note: CXL spec rev3.2 Table 8-57. General Media Event Record
> Field: Corrected Memory Error Count at Event) "For events in which the
> advanced programmable threshold counter has expired, this field value
> shall be a value greater than 0. Counter expiration events in which
> the corrected memory error count is 0 shall not generate a media event
> record".
> Q: Should kernel drop the event record in this case or user space
> to handle?
> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>

As Jonathan mentioned, the Q doesn't belong in the commit log.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/cxl/core/mbox.c  | 9 +++++++++
>  drivers/cxl/core/trace.h | 5 ++++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 2689e6453c5a..5a30d3891b17 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -926,6 +926,15 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  			if (cxl_store_rec_gen_media((struct cxl_memdev *)cxlmd, evt))
>  				dev_dbg(&cxlmd->dev, "CXL store rec_gen_media failed\n");
>  
> +			if (evt->gen_media.media_hdr.descriptor &
> +			    CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
> +				WARN_ON_ONCE((evt->gen_media.media_hdr.type &
> +					      CXL_GMER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE) &&
> +					     !evt->gen_media.cme_count);
> +			else
> +				WARN_ON_ONCE(evt->gen_media.media_hdr.type &
> +					     CXL_GMER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE);
> +
>  			trace_cxl_general_media(cxlmd, type, cxlr, hpa,
>  						hpa_alias, &evt->gen_media);
>  		} else if (event_type == CXL_CPER_EVENT_DRAM) {
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index a77487a257b3..c38f94ca0ca1 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -506,7 +506,10 @@ TRACE_EVENT(cxl_general_media,
>  			uuid_copy(&__entry->region_uuid, &uuid_null);
>  		}
>  		__entry->cme_threshold_ev_flags = rec->cme_threshold_ev_flags;
> -		__entry->cme_count = get_unaligned_le24(rec->cme_count);
> +		if (rec->media_hdr.descriptor & CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
> +			__entry->cme_count = get_unaligned_le24(rec->cme_count);
> +		else
> +			__entry->cme_count = 0;
>  	),
>  
>  	CXL_EVT_TP_printk("dpa=%llx dpa_flags='%s' " \


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] cxl/events: Add extra validity checks for CVME count in DRAM Event Record
  2025-07-16 10:49 ` [PATCH 3/4] cxl/events: Add extra validity checks for CVME count in DRAM " shiju.jose
  2025-07-16 13:07   ` Jonathan Cameron
@ 2025-07-16 21:53   ` Dave Jiang
  2025-07-17  5:16   ` kernel test robot
  2 siblings, 0 replies; 15+ messages in thread
From: Dave Jiang @ 2025-07-16 21:53 UTC (permalink / raw)
  To: shiju.jose, linux-cxl, dan.j.williams, jonathan.cameron,
	alison.schofield, dave, vishal.l.verma, ira.weiny
  Cc: tanxiaofei, prime.zeng, linuxarm



On 7/16/25 3:49 AM, shiju.jose@huawei.com wrote:
> From: Shiju Jose <shiju.jose@huawei.com>
> 
> According to the CXL Specification Revision 3.2, Section 8.2.10.2.1.2,
> Table 8-58 (DRAM Event Record), the CVME (Corrected Volatile Memory Error)
> Count field is valid under the following conditions:
> 1. The Threshold Event bit is set in the Memory Event Descriptor field,
> and
> 2. The CVME Count must be greater than 0 for events where the Advanced
> Programmable Threshold Counter has expired.
> 
> Additionally, if the Advanced Programmable Corrected Memory Error Counter
> Expire bit in the Memory Event Type field is set, then the Threshold Event
> bit in the Memory Event Descriptor field shall also be set.
> 
> Add validity checks for the above conditions while reporting the event to
> the userspace.
> 
> Note: CXL spec rev3.2 Table 8-58. DRAM Event Record, (Field: CVME Count
> at Event), "For events in which the advanced programmable threshold
> counter has expired, this field value shall be a value greater than 0.
> Counter expiration events in which the corrected memory error count
> is 0 shall not generate a media event record".
> Q: Should kernel drop the event record in this case or user space
> to handle?
> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>

The Q should be after '---' or dropped from commit log.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/cxl/core/mbox.c  | 9 +++++++++
>  drivers/cxl/core/trace.h | 4 ++++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 5a30d3891b17..e8c49ae28828 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -941,6 +941,15 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  			if (cxl_store_rec_dram((struct cxl_memdev *)cxlmd, evt))
>  				dev_dbg(&cxlmd->dev, "CXL store rec_dram failed\n");
>  
> +			if (evt->dram.media_hdr.descriptor &
> +			    CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
> +				WARN_ON_ONCE((evt->dram.media_hdr.type &
> +					      CXL_DER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE) &&
> +					     !evt->dram.cvme_count);
> +			else
> +				WARN_ON_ONCE(evt->dram.media_hdr.type &
> +					     CXL_DER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE);
> +
>  			trace_cxl_dram(cxlmd, type, cxlr, hpa, hpa_alias,
>  				       &evt->dram);
>  		}
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index c38f94ca0ca1..c3cd871942c5 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -662,6 +662,10 @@ TRACE_EVENT(cxl_dram,
>  		__entry->sub_channel = rec->sub_channel;
>  		__entry->cme_threshold_ev_flags = rec->cme_threshold_ev_flags;
>  		__entry->cvme_count = get_unaligned_le24(rec->cvme_count);
> +		if (rec->media_hdr.descriptor & CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
> +			__entry->cvme_count = get_unaligned_le24(rec->cvme_count);
> +		else
> +			__entry->cvme_count = 0;
>  	),
>  
>  	CXL_EVT_TP_printk("dpa=%llx dpa_flags='%s' descriptor='%s' type='%s' sub_type='%s' " \


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] cxl/events: Trace Memory Sparing Event Record
  2025-07-16 10:49 ` [PATCH 4/4] cxl/events: Trace Memory Sparing " shiju.jose
  2025-07-16 13:16   ` Jonathan Cameron
@ 2025-07-16 22:23   ` Dave Jiang
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Jiang @ 2025-07-16 22:23 UTC (permalink / raw)
  To: shiju.jose, linux-cxl, dan.j.williams, jonathan.cameron,
	alison.schofield, dave, vishal.l.verma, ira.weiny
  Cc: tanxiaofei, prime.zeng, linuxarm



On 7/16/25 3:49 AM, shiju.jose@huawei.com wrote:
> From: Shiju Jose <shiju.jose@huawei.com>
> 
> CXL rev 3.2 section 8.2.10.2.1.4 Table 8-60 defines the Memory Sparing
> Event Record.
> 
> Determine if the event read is memory sparing record and if so trace the
> record.
> 
> Memory device shall produce a memory sparing event record
> 1. After completion of a PPR maintenance operation if the memory sparing
> event record enable bit is set (Field: sPPR/hPPR Operation Mode in
> Table 8-128/Table 8-131).
> 2. In response to a query request by the host (see section 8.2.10.7.1.4)
> to determine the availability of sparing resources.
> The device shall report the resource availability by producing the Memory
> Sparing Event Record (see Table 8-60) in which the channel, rank, nibble
> mask, bank group, bank, row, column, sub-channel fields are a copy of the
> values specified in the request. If the controller does not support
> reporting whether a resource is available, and a perform maintenance
> operation for memory sparing is issued with query resources set to 1, the
> controller shall return invalid input.
> 
> Example trace log for produce memory sparing event record on completion
> of a soft PPR operation,
> cxl_memory_sparing: memdev=mem1 host=0000:0f:00.0 serial=3
> log=Informational : time=55045163029
> uuid=e71f3a40-2d29-4092-8a39-4d1c966c7c65 len=128 flags='0x1' handle=1
> related_handle=0 maint_op_class=2 maint_op_sub_class=1
> ld_id=0 head_id=0 : flags='' result=0
> validity_flags='CHANNEL|RANK|NIBBLE|BANK GROUP|BANK|ROW|COLUMN'
> spare resource avail=1 channel=2 rank=5 nibble_mask=a59c bank_group=2
> bank=4 row=13 column=23 sub_channel=0
> comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> comp_id_pldm_valid_flags='' pldm_entity_id=0x00 pldm_resource_id=0x00
> 
> Note: For memory sparing event record, fields 'maintenance operation
> class' and 'maintenance operation subclass' are defined twice, first
> in the common event record (Table 8-55) and second in the memory
> sparing event record (Table 8-60). Thus those in the sparing event
> record coded as reserved, to be removed when the spec is updated.
> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/cxl/core/mbox.c  |   6 +++
>  drivers/cxl/core/trace.h | 100 +++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxlmem.h     |   8 ++++
>  include/cxl/event.h      |  33 +++++++++++++
>  4 files changed, 147 insertions(+)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index e8c49ae28828..682993ace4ae 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -899,6 +899,10 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  		trace_cxl_generic_event(cxlmd, type, uuid, &evt->generic);
>  		return;
>  	}
> +	if (event_type == CXL_CPER_EVENT_MEM_SPARING) {
> +		trace_cxl_memory_sparing(cxlmd, type, &evt->mem_sparing);
> +		return;
> +	}
>  
>  	if (trace_cxl_general_media_enabled() || trace_cxl_dram_enabled()) {
>  		u64 dpa, hpa = ULLONG_MAX, hpa_alias = ULLONG_MAX;
> @@ -970,6 +974,8 @@ static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  		ev_type = CXL_CPER_EVENT_DRAM;
>  	else if (uuid_equal(uuid, &CXL_EVENT_MEM_MODULE_UUID))
>  		ev_type = CXL_CPER_EVENT_MEM_MODULE;
> +	else if (uuid_equal(uuid, &CXL_EVENT_MEM_SPARING_UUID))
> +		ev_type = CXL_CPER_EVENT_MEM_SPARING;
>  
>  	cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
>  }
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index c3cd871942c5..2c291fb1857c 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -888,6 +888,106 @@ TRACE_EVENT(cxl_memory_module,
>  	)
>  );
>  
> +#define CXL_MSER_QUERY_RESOURCE_FLAG			BIT(0)
> +#define CXL_MSER_HARD_SPARING_FLAG			BIT(1)
> +#define CXL_MSER_DEV_INITED_FLAG			BIT(2)
> +#define show_mem_sparing_flags(flags)	__print_flags(flags, "|",		\
> +	{ CXL_MSER_QUERY_RESOURCE_FLAG,		"Query Resources" },		\
> +	{ CXL_MSER_HARD_SPARING_FLAG,		"Hard Sparing" },		\
> +	{ CXL_MSER_DEV_INITED_FLAG,	"Device Initiated Sparing"	}	\
> +)
> +
> +#define CXL_MSER_VALID_CHANNEL				BIT(0)
> +#define CXL_MSER_VALID_RANK				BIT(1)
> +#define CXL_MSER_VALID_NIBBLE				BIT(2)
> +#define CXL_MSER_VALID_BANK_GROUP			BIT(3)
> +#define CXL_MSER_VALID_BANK				BIT(4)
> +#define CXL_MSER_VALID_ROW				BIT(5)
> +#define CXL_MSER_VALID_COLUMN				BIT(6)
> +#define CXL_MSER_VALID_COMPONENT_ID			BIT(7)
> +#define CXL_MSER_VALID_COMPONENT_ID_FORMAT		BIT(8)
> +#define CXL_MSER_VALID_SUB_CHANNEL			BIT(9)
> +#define show_mem_sparing_valid_flags(flags)	__print_flags(flags, "|",		\
> +	{ CXL_MSER_VALID_CHANNEL,			"CHANNEL"		},	\
> +	{ CXL_MSER_VALID_RANK,				"RANK"			},	\
> +	{ CXL_MSER_VALID_NIBBLE,			"NIBBLE"		},	\
> +	{ CXL_MSER_VALID_BANK_GROUP,			"BANK GROUP"		},	\
> +	{ CXL_MSER_VALID_BANK,				"BANK"			},	\
> +	{ CXL_MSER_VALID_ROW,				"ROW"			},	\
> +	{ CXL_MSER_VALID_COLUMN,			"COLUMN"		},	\
> +	{ CXL_MSER_VALID_COMPONENT_ID,			"COMPONENT ID"		},	\
> +	{ CXL_MSER_VALID_COMPONENT_ID_FORMAT,		"COMPONENT ID PLDM FORMAT" },	\
> +	{ CXL_MSER_VALID_SUB_CHANNEL,			"SUB CHANNEL"		}	\
> +)
> +
> +TRACE_EVENT(cxl_memory_sparing,
> +
> +	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
> +		 struct cxl_event_mem_sparing *rec),
> +
> +	TP_ARGS(cxlmd, log, rec),
> +
> +	TP_STRUCT__entry(
> +		CXL_EVT_TP_entry
> +
> +		/* Memory Sparing Event */
> +		__field(u8, flags)
> +		__field(u8, result)
> +		__field(u16, validity_flags)
> +		__field(u16, res_avail)
> +		__field(u8, channel)
> +		__field(u8, rank)
> +		__field(u32, nibble_mask)
> +		__field(u8, bank_group)
> +		__field(u8, bank)
> +		__field(u32, row)
> +		__field(u16, column)
> +		__field(u8, sub_channel)
> +		__array(u8, comp_id, CXL_EVENT_GEN_MED_COMP_ID_SIZE)
> +	),
> +
> +	TP_fast_assign(
> +		CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
> +		__entry->hdr_uuid = CXL_EVENT_MEM_SPARING_UUID;
> +
> +		/* Memory Sparing Event */
> +		__entry->flags = rec->flags;
> +		__entry->result = rec->result;
> +		__entry->validity_flags = le16_to_cpu(rec->validity_flags);
> +		__entry->res_avail = le16_to_cpu(rec->res_avail);
> +		__entry->channel = rec->channel;
> +		__entry->rank = rec->rank;
> +		__entry->nibble_mask = get_unaligned_le24(rec->nibble_mask);
> +		__entry->bank_group = rec->bank_group;
> +		__entry->bank = rec->bank;
> +		__entry->row = get_unaligned_le24(rec->row);
> +		__entry->column = le16_to_cpu(rec->column);
> +		__entry->sub_channel = rec->sub_channel;
> +		memcpy(__entry->comp_id, &rec->component_id,
> +		       CXL_EVENT_GEN_MED_COMP_ID_SIZE);
> +	),
> +
> +	CXL_EVT_TP_printk("flags='%s' result=%u validity_flags='%s' " \
> +		"spare resource avail=%u channel=%u rank=%u " \
> +		"nibble_mask=%x bank_group=%u bank=%u " \
> +		"row=%u column=%u sub_channel=%u " \
> +		"comp_id=%s comp_id_pldm_valid_flags='%s' " \
> +		"pldm_entity_id=%s pldm_resource_id=%s",
> +		show_mem_sparing_flags(__entry->flags),
> +		__entry->result,
> +		show_mem_sparing_valid_flags(__entry->validity_flags),
> +		__entry->res_avail, __entry->channel, __entry->rank,
> +		__entry->nibble_mask, __entry->bank_group, __entry->bank,
> +		__entry->row, __entry->column, __entry->sub_channel,
> +		__print_hex(__entry->comp_id, CXL_EVENT_GEN_MED_COMP_ID_SIZE),
> +		show_comp_id_pldm_flags(__entry->comp_id[0]),
> +		show_pldm_entity_id(__entry->validity_flags, CXL_MSER_VALID_COMPONENT_ID,
> +				    CXL_MSER_VALID_COMPONENT_ID_FORMAT, __entry->comp_id),
> +		show_pldm_resource_id(__entry->validity_flags, CXL_MSER_VALID_COMPONENT_ID,
> +				      CXL_MSER_VALID_COMPONENT_ID_FORMAT, __entry->comp_id)
> +	)
> +);
> +
>  #define show_poison_trace_type(type)			\
>  	__print_symbolic(type,				\
>  	{ CXL_POISON_TRACE_LIST,	"List"   },	\
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 551b0ba2caa1..f98311f357b7 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -633,6 +633,14 @@ struct cxl_mbox_identify {
>  	UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \
>  		  0x13, 0xb7, 0x74)
>  
> +/*
> + * Memory Sparing Event Record UUID
> + * CXL rev 3.2 section 8.2.10.2.1.4: Table 8-60
> + */
> +#define CXL_EVENT_MEM_SPARING_UUID                                          \
> +	UUID_INIT(0xe71f3a40, 0x2d29, 0x4092, 0x8a, 0x39, 0x4d, 0x1c, 0x96, \
> +		  0x6c, 0x7c, 0x65)
> +
>  /*
>   * Get Event Records output payload
>   * CXL rev 3.0 section 8.2.9.2.2; Table 8-50
> diff --git a/include/cxl/event.h b/include/cxl/event.h
> index f4cb8568566b..6fd90f9cc203 100644
> --- a/include/cxl/event.h
> +++ b/include/cxl/event.h
> @@ -110,11 +110,43 @@ struct cxl_event_mem_module {
>  	u8 reserved[0x2a];
>  } __packed;
>  
> +/*
> + * Memory Sparing Event Record - MSER
> + * CXL rev 3.2 section 8.2.10.2.1.4; Table 8-60
> + */
> +struct cxl_event_mem_sparing {
> +	struct cxl_event_record_hdr hdr;
> +	/*
> +	 * The fields maintenance operation class and maintenance operation
> +	 * subclass defined in the Memory Sparing Event Record are the
> +	 * duplication of the same in the common event record. Thus defined
> +	 * as reserved and to be removed after the spec correction.
> +	 */
> +	u8 rsv1;
> +	u8 rsv2;
> +	u8 flags;
> +	u8 result;
> +	__le16 validity_flags;
> +	u8 reserved1[6];
> +	__le16 res_avail;
> +	u8 channel;
> +	u8 rank;
> +	u8 nibble_mask[3];
> +	u8 bank_group;
> +	u8 bank;
> +	u8 row[3];
> +	__le16 column;
> +	u8 component_id[CXL_EVENT_GEN_MED_COMP_ID_SIZE];
> +	u8 sub_channel;
> +	u8 reserved2[0x25];
> +} __packed;
> +
>  union cxl_event {
>  	struct cxl_event_generic generic;
>  	struct cxl_event_gen_media gen_media;
>  	struct cxl_event_dram dram;
>  	struct cxl_event_mem_module mem_module;
> +	struct cxl_event_mem_sparing mem_sparing;
>  	/* dram & gen_media event header */
>  	struct cxl_event_media_hdr media_hdr;
>  } __packed;
> @@ -133,6 +165,7 @@ enum cxl_event_type {
>  	CXL_CPER_EVENT_GEN_MEDIA,
>  	CXL_CPER_EVENT_DRAM,
>  	CXL_CPER_EVENT_MEM_MODULE,
> +	CXL_CPER_EVENT_MEM_SPARING,
>  };
>  
>  #define CPER_CXL_DEVICE_ID_VALID		BIT(0)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record
  2025-07-16 10:49 ` [PATCH 2/4] cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record shiju.jose
  2025-07-16 13:04   ` Jonathan Cameron
  2025-07-16 21:40   ` Dave Jiang
@ 2025-07-17  3:32   ` kernel test robot
  2 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2025-07-17  3:32 UTC (permalink / raw)
  To: shiju.jose, linux-cxl, dan.j.williams, dave.jiang,
	jonathan.cameron, alison.schofield, dave, vishal.l.verma,
	ira.weiny
  Cc: llvm, oe-kbuild-all, tanxiaofei, prime.zeng, linuxarm, shiju.jose

Hi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on cxl/next]
[also build test WARNING on linus/master v6.16-rc6 next-20250716]
[cannot apply to cxl/pending]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/shiju-jose-huawei-com/cxl-events-Update-Common-Event-Record-to-CXL-spec-rev-3-2/20250716-192336
base:   https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git next
patch link:    https://lore.kernel.org/r/20250716104945.2002-3-shiju.jose%40huawei.com
patch subject: [PATCH 2/4] cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record
config: arm-randconfig-004-20250717 (https://download.01.org/0day-ci/archive/20250717/202507171153.p2RrAdN4-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project 16534d19bf50bde879a83f0ae62875e2c5120e64)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250717/202507171153.p2RrAdN4-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507171153.p2RrAdN4-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/cxl/core/mbox.c:933:27: warning: address of array 'evt->gen_media.cme_count' will always evaluate to 'true' [-Wpointer-bool-conversion]
     933 |                                              !evt->gen_media.cme_count);
         |                                              ~~~~~~~~~~~~~~~~^~~~~~~~~
   include/asm-generic/bug.h:148:18: note: expanded from macro 'WARN_ON_ONCE'
     148 |         DO_ONCE_LITE_IF(condition, WARN_ON, 1)
         |                         ^~~~~~~~~
   include/linux/once_lite.h:28:27: note: expanded from macro 'DO_ONCE_LITE_IF'
      28 |                 bool __ret_do_once = !!(condition);                     \
         |                                         ^~~~~~~~~
   1 warning generated.


vim +933 drivers/cxl/core/mbox.c

   888	
   889	void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
   890				    enum cxl_event_log_type type,
   891				    enum cxl_event_type event_type,
   892				    const uuid_t *uuid, union cxl_event *evt)
   893	{
   894		if (event_type == CXL_CPER_EVENT_MEM_MODULE) {
   895			trace_cxl_memory_module(cxlmd, type, &evt->mem_module);
   896			return;
   897		}
   898		if (event_type == CXL_CPER_EVENT_GENERIC) {
   899			trace_cxl_generic_event(cxlmd, type, uuid, &evt->generic);
   900			return;
   901		}
   902	
   903		if (trace_cxl_general_media_enabled() || trace_cxl_dram_enabled()) {
   904			u64 dpa, hpa = ULLONG_MAX, hpa_alias = ULLONG_MAX;
   905			struct cxl_region *cxlr;
   906	
   907			/*
   908			 * These trace points are annotated with HPA and region
   909			 * translations. Take topology mutation locks and lookup
   910			 * { HPA, REGION } from { DPA, MEMDEV } in the event record.
   911			 */
   912			guard(rwsem_read)(&cxl_region_rwsem);
   913			guard(rwsem_read)(&cxl_dpa_rwsem);
   914	
   915			dpa = le64_to_cpu(evt->media_hdr.phys_addr) & CXL_DPA_MASK;
   916			cxlr = cxl_dpa_to_region(cxlmd, dpa);
   917			if (cxlr) {
   918				u64 cache_size = cxlr->params.cache_size;
   919	
   920				hpa = cxl_dpa_to_hpa(cxlr, cxlmd, dpa);
   921				if (cache_size)
   922					hpa_alias = hpa - cache_size;
   923			}
   924	
   925			if (event_type == CXL_CPER_EVENT_GEN_MEDIA) {
   926				if (cxl_store_rec_gen_media((struct cxl_memdev *)cxlmd, evt))
   927					dev_dbg(&cxlmd->dev, "CXL store rec_gen_media failed\n");
   928	
   929				if (evt->gen_media.media_hdr.descriptor &
   930				    CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
   931					WARN_ON_ONCE((evt->gen_media.media_hdr.type &
   932						      CXL_GMER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE) &&
 > 933						     !evt->gen_media.cme_count);
   934				else
   935					WARN_ON_ONCE(evt->gen_media.media_hdr.type &
   936						     CXL_GMER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE);
   937	
   938				trace_cxl_general_media(cxlmd, type, cxlr, hpa,
   939							hpa_alias, &evt->gen_media);
   940			} else if (event_type == CXL_CPER_EVENT_DRAM) {
   941				if (cxl_store_rec_dram((struct cxl_memdev *)cxlmd, evt))
   942					dev_dbg(&cxlmd->dev, "CXL store rec_dram failed\n");
   943	
   944				trace_cxl_dram(cxlmd, type, cxlr, hpa, hpa_alias,
   945					       &evt->dram);
   946			}
   947		}
   948	}
   949	EXPORT_SYMBOL_NS_GPL(cxl_event_trace_record, "CXL");
   950	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] cxl/events: Add extra validity checks for CVME count in DRAM Event Record
  2025-07-16 10:49 ` [PATCH 3/4] cxl/events: Add extra validity checks for CVME count in DRAM " shiju.jose
  2025-07-16 13:07   ` Jonathan Cameron
  2025-07-16 21:53   ` Dave Jiang
@ 2025-07-17  5:16   ` kernel test robot
  2 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2025-07-17  5:16 UTC (permalink / raw)
  To: shiju.jose, linux-cxl, dan.j.williams, dave.jiang,
	jonathan.cameron, alison.schofield, dave, vishal.l.verma,
	ira.weiny
  Cc: llvm, oe-kbuild-all, tanxiaofei, prime.zeng, linuxarm, shiju.jose

Hi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on cxl/next]
[also build test WARNING on linus/master v6.16-rc6 next-20250716]
[cannot apply to cxl/pending]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/shiju-jose-huawei-com/cxl-events-Update-Common-Event-Record-to-CXL-spec-rev-3-2/20250716-192336
base:   https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git next
patch link:    https://lore.kernel.org/r/20250716104945.2002-4-shiju.jose%40huawei.com
patch subject: [PATCH 3/4] cxl/events: Add extra validity checks for CVME count in DRAM Event Record
config: arm-randconfig-004-20250717 (https://download.01.org/0day-ci/archive/20250717/202507171217.6p5GHqr0-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project 16534d19bf50bde879a83f0ae62875e2c5120e64)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250717/202507171217.6p5GHqr0-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507171217.6p5GHqr0-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/cxl/core/mbox.c:933:27: warning: address of array 'evt->gen_media.cme_count' will always evaluate to 'true' [-Wpointer-bool-conversion]
     933 |                                              !evt->gen_media.cme_count);
         |                                              ~~~~~~~~~~~~~~~~^~~~~~~~~
   include/asm-generic/bug.h:148:18: note: expanded from macro 'WARN_ON_ONCE'
     148 |         DO_ONCE_LITE_IF(condition, WARN_ON, 1)
         |                         ^~~~~~~~~
   include/linux/once_lite.h:28:27: note: expanded from macro 'DO_ONCE_LITE_IF'
      28 |                 bool __ret_do_once = !!(condition);                     \
         |                                         ^~~~~~~~~
>> drivers/cxl/core/mbox.c:948:22: warning: address of array 'evt->dram.cvme_count' will always evaluate to 'true' [-Wpointer-bool-conversion]
     948 |                                              !evt->dram.cvme_count);
         |                                              ~~~~~~~~~~~^~~~~~~~~~
   include/asm-generic/bug.h:148:18: note: expanded from macro 'WARN_ON_ONCE'
     148 |         DO_ONCE_LITE_IF(condition, WARN_ON, 1)
         |                         ^~~~~~~~~
   include/linux/once_lite.h:28:27: note: expanded from macro 'DO_ONCE_LITE_IF'
      28 |                 bool __ret_do_once = !!(condition);                     \
         |                                         ^~~~~~~~~
   2 warnings generated.


vim +948 drivers/cxl/core/mbox.c

   888	
   889	void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
   890				    enum cxl_event_log_type type,
   891				    enum cxl_event_type event_type,
   892				    const uuid_t *uuid, union cxl_event *evt)
   893	{
   894		if (event_type == CXL_CPER_EVENT_MEM_MODULE) {
   895			trace_cxl_memory_module(cxlmd, type, &evt->mem_module);
   896			return;
   897		}
   898		if (event_type == CXL_CPER_EVENT_GENERIC) {
   899			trace_cxl_generic_event(cxlmd, type, uuid, &evt->generic);
   900			return;
   901		}
   902	
   903		if (trace_cxl_general_media_enabled() || trace_cxl_dram_enabled()) {
   904			u64 dpa, hpa = ULLONG_MAX, hpa_alias = ULLONG_MAX;
   905			struct cxl_region *cxlr;
   906	
   907			/*
   908			 * These trace points are annotated with HPA and region
   909			 * translations. Take topology mutation locks and lookup
   910			 * { HPA, REGION } from { DPA, MEMDEV } in the event record.
   911			 */
   912			guard(rwsem_read)(&cxl_region_rwsem);
   913			guard(rwsem_read)(&cxl_dpa_rwsem);
   914	
   915			dpa = le64_to_cpu(evt->media_hdr.phys_addr) & CXL_DPA_MASK;
   916			cxlr = cxl_dpa_to_region(cxlmd, dpa);
   917			if (cxlr) {
   918				u64 cache_size = cxlr->params.cache_size;
   919	
   920				hpa = cxl_dpa_to_hpa(cxlr, cxlmd, dpa);
   921				if (cache_size)
   922					hpa_alias = hpa - cache_size;
   923			}
   924	
   925			if (event_type == CXL_CPER_EVENT_GEN_MEDIA) {
   926				if (cxl_store_rec_gen_media((struct cxl_memdev *)cxlmd, evt))
   927					dev_dbg(&cxlmd->dev, "CXL store rec_gen_media failed\n");
   928	
   929				if (evt->gen_media.media_hdr.descriptor &
   930				    CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
   931					WARN_ON_ONCE((evt->gen_media.media_hdr.type &
   932						      CXL_GMER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE) &&
   933						     !evt->gen_media.cme_count);
   934				else
   935					WARN_ON_ONCE(evt->gen_media.media_hdr.type &
   936						     CXL_GMER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE);
   937	
   938				trace_cxl_general_media(cxlmd, type, cxlr, hpa,
   939							hpa_alias, &evt->gen_media);
   940			} else if (event_type == CXL_CPER_EVENT_DRAM) {
   941				if (cxl_store_rec_dram((struct cxl_memdev *)cxlmd, evt))
   942					dev_dbg(&cxlmd->dev, "CXL store rec_dram failed\n");
   943	
   944				if (evt->dram.media_hdr.descriptor &
   945				    CXL_GMER_EVT_DESC_THRESHOLD_EVENT)
   946					WARN_ON_ONCE((evt->dram.media_hdr.type &
   947						      CXL_DER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE) &&
 > 948						     !evt->dram.cvme_count);
   949				else
   950					WARN_ON_ONCE(evt->dram.media_hdr.type &
   951						     CXL_DER_MEM_EVT_TYPE_AP_CME_COUNTER_EXPIRE);
   952	
   953				trace_cxl_dram(cxlmd, type, cxlr, hpa, hpa_alias,
   954					       &evt->dram);
   955			}
   956		}
   957	}
   958	EXPORT_SYMBOL_NS_GPL(cxl_event_trace_record, "CXL");
   959	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-07-17  5:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-16 10:49 [PATCH 0/4] cxl/events: Update to rev 3.2, improvements and add trace memory sparing event record shiju.jose
2025-07-16 10:49 ` [PATCH 1/4] cxl/events: Update Common Event Record to CXL spec rev 3.2 shiju.jose
2025-07-16 12:53   ` Jonathan Cameron
2025-07-16 10:49 ` [PATCH 2/4] cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record shiju.jose
2025-07-16 13:04   ` Jonathan Cameron
2025-07-16 21:40   ` Dave Jiang
2025-07-17  3:32   ` kernel test robot
2025-07-16 10:49 ` [PATCH 3/4] cxl/events: Add extra validity checks for CVME count in DRAM " shiju.jose
2025-07-16 13:07   ` Jonathan Cameron
2025-07-16 21:53   ` Dave Jiang
2025-07-17  5:16   ` kernel test robot
2025-07-16 10:49 ` [PATCH 4/4] cxl/events: Trace Memory Sparing " shiju.jose
2025-07-16 13:16   ` Jonathan Cameron
2025-07-16 15:07     ` Shiju Jose
2025-07-16 22:23   ` Dave Jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).