All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND PATCH V3 0/4] rasdaemon: Add support for the CXL error events
@ 2023-02-02 18:18 shiju.jose
  2023-02-02 18:18 ` [RESEND PATCH V3 1/4] rasdaemon: Move definition for BIT and BIT_ULL to a common file shiju.jose
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: shiju.jose @ 2023-02-02 18:18 UTC (permalink / raw)
  To: mchehab, linux-edac, linux-cxl; +Cc: jonathan.cameron, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Log and record the following CXL errors reported through the kernel
trace events. CXL poison errors, CXL AER uncorrectable errors and CXL AER
correctable errors.

Note1: This V3 patch set resend due to email delivery issues to
some of the recipients.


Note2: The default poll and read method in the rasdaemon to receive
the trace events do not work due to a commit in the kernel trace system.
Thus instead used the pthread way for testing the CXL error events.
To do so, please make following change in the ras-events.c
<change start ...>
/* rc = read_ras_event_all_cpus(data, cpus); */
rc = -255;
< ...change end >
/* Poll doesn't work on this kernel. Fallback to pthread way */
if (rc == -255) {
...

Shiju Jose (4):
  rasdaemon: Move definition for BIT and BIT_ULL to a common file
  rasdaemon: Add support for the CXL poison events
  rasdaemon: Add support for the CXL AER uncorrectable errors
  rasdaemon: Add support for the CXL AER correctable errors

Changes:
RFC V2 -> V3
1. Fix for the comments from Dave Jiang.

RFC V1 -> V2
1. Rename uuid to region_uuid in the log and SQLite DB.
2. Rebase to the latest rasdaemon code.
3. Modify to match the name changes of interface structures and
   functions in the latest libtraceevent-dev, use in the rasdaemon. 

 Makefile.am                |   7 +-
 configure.ac               |  11 ++
 ras-cxl-handler.c          | 378 +++++++++++++++++++++++++++++++++++++
 ras-cxl-handler.h          |  32 ++++
 ras-events.c               |  33 ++++
 ras-events.h               |   3 +
 ras-non-standard-handler.h |   3 -
 ras-record.c               | 203 ++++++++++++++++++++
 ras-record.h               |  49 +++++
 ras-report.c               | 219 +++++++++++++++++++++
 ras-report.h               |   6 +
 11 files changed, 940 insertions(+), 4 deletions(-)
 create mode 100644 ras-cxl-handler.c
 create mode 100644 ras-cxl-handler.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RESEND PATCH V3 1/4] rasdaemon: Move definition for BIT and BIT_ULL to a common file
  2023-02-02 18:18 [RESEND PATCH V3 0/4] rasdaemon: Add support for the CXL error events shiju.jose
@ 2023-02-02 18:18 ` shiju.jose
  2023-02-02 18:18 ` [RESEND PATCH V3 2/4] rasdaemon: Add support for the CXL poison events shiju.jose
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: shiju.jose @ 2023-02-02 18:18 UTC (permalink / raw)
  To: mchehab, linux-edac, linux-cxl; +Cc: jonathan.cameron, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Move definition for BIT() and BIT_ULL() to the
common file ras-record.h

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
 ras-non-standard-handler.h | 3 ---
 ras-record.h               | 3 +++
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/ras-non-standard-handler.h b/ras-non-standard-handler.h
index 4d9f938..c360eaf 100644
--- a/ras-non-standard-handler.h
+++ b/ras-non-standard-handler.h
@@ -17,9 +17,6 @@
 #include "ras-events.h"
 #include <traceevent/event-parse.h>
 
-#define BIT(nr)                 (1UL << (nr))
-#define BIT_ULL(nr)             (1ULL << (nr))
-
 struct ras_ns_ev_decoder {
 	struct ras_ns_ev_decoder *next;
 	const char *sec_type;
diff --git a/ras-record.h b/ras-record.h
index d9f7733..219f10b 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -25,6 +25,9 @@
 
 #define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
 
+#define BIT(nr)                 (1UL << (nr))
+#define BIT_ULL(nr)             (1ULL << (nr))
+
 extern long user_hz;
 
 struct ras_events;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RESEND PATCH V3 2/4] rasdaemon: Add support for the CXL poison events
  2023-02-02 18:18 [RESEND PATCH V3 0/4] rasdaemon: Add support for the CXL error events shiju.jose
  2023-02-02 18:18 ` [RESEND PATCH V3 1/4] rasdaemon: Move definition for BIT and BIT_ULL to a common file shiju.jose
@ 2023-02-02 18:18 ` shiju.jose
  2023-02-02 19:41   ` Alison Schofield
  2023-02-02 18:18 ` [RESEND PATCH V3 3/4] rasdaemon: Add support for the CXL AER uncorrectable errors shiju.jose
  2023-02-02 18:18 ` [RESEND PATCH V3 4/4] rasdaemon: Add support for the CXL AER correctable errors shiju.jose
  3 siblings, 1 reply; 7+ messages in thread
From: shiju.jose @ 2023-02-02 18:18 UTC (permalink / raw)
  To: mchehab, linux-edac, linux-cxl; +Cc: jonathan.cameron, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add support to log and record the CXL poison events.

The corresponding Kernel patches here:
https://lore.kernel.org/linux-cxl/de11785ff05844299b40b100f8e0f56c7eef7f08.1674070170.git.alison.schofield@intel.com/

Presently RFC draft version for logging, could be extended for the policy
based recovery action for the frequent poison events depending on the above
kernel patches.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 Makefile.am       |   7 +-
 configure.ac      |  11 +++
 ras-cxl-handler.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++
 ras-cxl-handler.h |  24 +++++++
 ras-events.c      |  15 ++++
 ras-events.h      |   1 +
 ras-record.c      |  81 ++++++++++++++++++++++
 ras-record.h      |  20 ++++++
 ras-report.c      |  83 ++++++++++++++++++++++
 ras-report.h      |   2 +
 10 files changed, 415 insertions(+), 1 deletion(-)
 create mode 100644 ras-cxl-handler.c
 create mode 100644 ras-cxl-handler.h

diff --git a/Makefile.am b/Makefile.am
index a9832cc..bd7b2ae 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -73,6 +73,11 @@ endif
 if WITH_CPU_FAULT_ISOLATION
    rasdaemon_SOURCES += ras-cpu-isolation.c queue.c
 endif
+
+if WITH_CXL
+   rasdaemon_SOURCES += ras-cxl-handler.c
+endif
+
 rasdaemon_LDADD = -lpthread $(SQLITE3_LIBS) $(LIBTRACEEVENT_LIBS)
 rasdaemon_CFLAGS = $(SQLITE3_CFLAGS) $(LIBTRACEEVENT_CFLAGS)
 
@@ -81,7 +86,7 @@ include_HEADERS = config.h  ras-events.h  ras-logger.h  ras-mc-handler.h \
 		  ras-extlog-handler.h ras-arm-handler.h ras-non-standard-handler.h \
 		  ras-devlink-handler.h ras-diskerror-handler.h rbtree.h ras-page-isolation.h \
 		  non-standard-hisilicon.h non-standard-ampere.h ras-memory-failure-handler.h \
-		  ras-cpu-isolation.h queue.h
+		  ras-cxl-handler.h ras-cpu-isolation.h queue.h
 
 # This rule can't be called with more than one Makefile job (like make -j8)
 # I can't figure out a way to fix that
diff --git a/configure.ac b/configure.ac
index c973aaf..028b9b3 100644
--- a/configure.ac
+++ b/configure.ac
@@ -127,6 +127,16 @@ AS_IF([test "x$enable_memory_failure" = "xyes" || test "x$enable_all" = "xyes"],
 AM_CONDITIONAL([WITH_MEMORY_FAILURE], [test x$enable_memory_failure = xyes || test x$enable_all = xyes])
 AM_COND_IF([WITH_MEMORY_FAILURE], [USE_MEMORY_FAILURE="yes"], [USE_MEMORY_FAILURE="no"])
 
+AC_ARG_ENABLE([cxl],
+    AS_HELP_STRING([--enable-cxl], [enable CXL events (currently experimental)]))
+
+AS_IF([test "x$enable_cxl" = "xyes" || test "x$enable_all" == "xyes"], [
+  AC_DEFINE(HAVE_CXL,1,"have CXL events collect")
+  AC_SUBST([WITH_CXL])
+])
+AM_CONDITIONAL([WITH_CXL], [test x$enable_cxl = xyes || test x$enable_all == xyes])
+AM_COND_IF([WITH_CXL], [USE_CXL="yes"], [USE_CXL="no"])
+
 AC_ARG_ENABLE([abrt_report],
     AS_HELP_STRING([--enable-abrt-report], [enable report event to ABRT (currently experimental)]))
 
@@ -215,6 +225,7 @@ compile time options summary
     DEVLINK             : $USE_DEVLINK
     Disk I/O errors     : $USE_DISKERROR
     Memory Failure      : $USE_MEMORY_FAILURE
+    CXL events          : $USE_CXL
     Memory CE PFA       : $USE_MEMORY_CE_PFA
     AMP RAS errors      : $USE_AMP_NS_DECODE
     CPU fault isolation : $USE_CPU_FAULT_ISOLATION
diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c
new file mode 100644
index 0000000..0ba2519
--- /dev/null
+++ b/ras-cxl-handler.c
@@ -0,0 +1,172 @@
+/*
+ * Copyright (c) Huawei Technologies Co., Ltd. 2023. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <traceevent/kbuffer.h>
+#include "ras-cxl-handler.h"
+#include "ras-record.h"
+#include "ras-logger.h"
+#include "ras-report.h"
+
+/* Poison List: Payload out flags */
+#define CXL_POISON_FLAG_MORE            BIT(0)
+#define CXL_POISON_FLAG_OVERFLOW        BIT(1)
+#define CXL_POISON_FLAG_SCANNING        BIT(2)
+
+/* CXL poison - source types */
+enum cxl_poison_source {
+	CXL_POISON_SOURCE_UNKNOWN = 0,
+	CXL_POISON_SOURCE_EXTERNAL = 1,
+	CXL_POISON_SOURCE_INTERNAL = 2,
+	CXL_POISON_SOURCE_INJECTED = 3,
+	CXL_POISON_SOURCE_VENDOR = 7,
+};
+
+int ras_cxl_poison_event_handler(struct trace_seq *s,
+				 struct tep_record *record,
+				 struct tep_event *event, void *context)
+{
+	int len;
+	unsigned long long val;
+	struct ras_events *ras = context;
+	time_t now;
+	struct tm *tm;
+	struct ras_cxl_poison_event ev;
+
+	now = record->ts / user_hz + ras->uptime_diff;
+	tm = localtime(&now);
+	if (tm)
+		strftime(ev.timestamp, sizeof(ev.timestamp),
+			 "%Y-%m-%d %H:%M:%S %z", tm);
+	else
+		strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000", sizeof(ev.timestamp));
+	if (trace_seq_printf(s, "%s ", ev.timestamp) <= 0)
+		return -1;
+
+	ev.memdev = tep_get_field_raw(s, event, "memdev",
+				      record, &len, 1);
+	if (!ev.memdev)
+		return -1;
+	if (trace_seq_printf(s, "memdev:%s ", ev.memdev) <= 0)
+		return -1;
+
+	ev.pcidev = tep_get_field_raw(s, event, "pcidev",
+				      record, &len, 1);
+	if (!ev.pcidev)
+		return -1;
+	if (trace_seq_printf(s, "pcidev:%s ", ev.pcidev) <= 0)
+		return -1;
+
+	ev.region = tep_get_field_raw(s, event, "region",
+				      record, &len, 1);
+	if (!ev.region)
+		return -1;
+	if (trace_seq_printf(s, "region:%s ", ev.region) <= 0)
+		return -1;
+
+	ev.uuid = tep_get_field_raw(s, event, "uuid",
+				    record, &len, 1);
+	if (!ev.uuid)
+		return -1;
+	if (trace_seq_printf(s, "region_uuid:%s ", ev.uuid) <= 0)
+		return -1;
+
+	if (tep_get_field_val(s, event, "hpa", record, &val, 1) < 0)
+		return -1;
+	ev.hpa = val;
+	if (trace_seq_printf(s, "poison list: hpa:0x%llx ", (unsigned long long)ev.hpa) <= 0)
+		return -1;
+
+	if (tep_get_field_val(s, event, "dpa", record, &val, 1) < 0)
+		return -1;
+	ev.dpa = val;
+	if (trace_seq_printf(s, "dpa:0x%llx ", (unsigned long long)ev.dpa) <= 0)
+		return -1;
+
+	if (tep_get_field_val(s, event, "length", record, &val, 1) < 0)
+		return -1;
+	ev.length = val;
+	if (trace_seq_printf(s, "length:%d ", ev.length) <= 0)
+		return -1;
+
+	if (tep_get_field_val(s,  event, "source", record, &val, 1) < 0)
+		return -1;
+
+	switch (val) {
+	case CXL_POISON_SOURCE_UNKNOWN:
+		ev.source = "Unknown";
+		break;
+	case CXL_POISON_SOURCE_EXTERNAL:
+		ev.source = "External";
+		break;
+	case CXL_POISON_SOURCE_INTERNAL:
+		ev.source = "Internal";
+		break;
+	case CXL_POISON_SOURCE_INJECTED:
+		ev.source = "Injected";
+		break;
+	case CXL_POISON_SOURCE_VENDOR:
+		ev.source = "Vendor";
+		break;
+	default:
+		ev.source = "Invalid";
+	}
+	if (trace_seq_printf(s, "source:%s ", ev.source) <= 0)
+		return -1;
+
+	if (tep_get_field_val(s,  event, "flags", record, &val, 1) < 0)
+		return -1;
+	ev.flags = val;
+	if (trace_seq_printf(s, "flags:%d ", ev.flags) <= 0)
+		return -1;
+
+	if (ev.flags & CXL_POISON_FLAG_OVERFLOW) {
+		if (tep_get_field_val(s,  event, "overflow_t", record, &val, 1) < 0)
+			return -1;
+		if (val) {
+			/* CXL Specification 3.0
+			 * Overflow timestamp - The number of unsigned nanoseconds
+			 * that have elapsed since midnight, 01-Jan-1970 UTC
+			 */
+			time_t ovf_ts_secs = val / 1000000000ULL;
+
+			tm = localtime(&ovf_ts_secs);
+			if (tm) {
+				strftime(ev.overflow_ts, sizeof(ev.overflow_ts),
+					 "%Y-%m-%d %H:%M:%S %z", tm);
+			}
+		}
+		if (!val || !tm)
+			strncpy(ev.overflow_ts, "1970-01-01 00:00:00 +0000",
+				sizeof(ev.overflow_ts));
+	} else
+		strncpy(ev.overflow_ts, "1970-01-01 00:00:00 +0000", sizeof(ev.overflow_ts));
+	if (trace_seq_printf(s, "overflow timestamp:%s\n", ev.overflow_ts) <= 0)
+		return -1;
+
+	/* Insert data into the SGBD */
+#ifdef HAVE_SQLITE3
+	ras_store_cxl_poison_event(ras, &ev);
+#endif
+
+#ifdef HAVE_ABRT_REPORT
+	/* Report event to ABRT */
+	ras_report_cxl_poison_event(ras, &ev);
+#endif
+
+	return 0;
+}
diff --git a/ras-cxl-handler.h b/ras-cxl-handler.h
new file mode 100644
index 0000000..84d5cc6
--- /dev/null
+++ b/ras-cxl-handler.h
@@ -0,0 +1,24 @@
+/*
+ * Copyright (c) Huawei Technologies Co., Ltd. 2023. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __RAS_CXL_HANDLER_H
+#define __RAS_CXL_HANDLER_H
+
+#include "ras-events.h"
+#include <traceevent/event-parse.h>
+
+int ras_cxl_poison_event_handler(struct trace_seq *s,
+				 struct tep_record *record,
+				 struct tep_event *event, void *context);
+#endif
diff --git a/ras-events.c b/ras-events.c
index 39f9ce2..6555125 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -40,6 +40,7 @@
 #include "ras-devlink-handler.h"
 #include "ras-diskerror-handler.h"
 #include "ras-memory-failure-handler.h"
+#include "ras-cxl-handler.h"
 #include "ras-record.h"
 #include "ras-logger.h"
 #include "ras-page-isolation.h"
@@ -243,6 +244,10 @@ int toggle_ras_mc_event(int enable)
 	rc |= __toggle_ras_mc_event(ras, "ras", "memory_failure_event", enable);
 #endif
 
+#ifdef HAVE_CXL
+	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_poison", enable);
+#endif
+
 free_ras:
 	free(ras);
 	return rc;
@@ -951,6 +956,16 @@ int handle_ras_events(int record_events)
 		    "ras", "memory_failure_event");
 #endif
 
+#ifdef HAVE_CXL
+	rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_poison",
+			       ras_cxl_poison_event_handler, NULL, CXL_POISON_EVENT);
+	if (!rc)
+		num_events++;
+	else
+		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+		    "cxl", "cxl_poison");
+#endif
+
 	if (!num_events) {
 		log(ALL, LOG_INFO,
 		    "Failed to trace all supported RAS events. Aborting.\n");
diff --git a/ras-events.h b/ras-events.h
index 6c9f507..fc51070 100644
--- a/ras-events.h
+++ b/ras-events.h
@@ -39,6 +39,7 @@ enum {
 	DEVLINK_EVENT,
 	DISKERROR_EVENT,
 	MF_EVENT,
+	CXL_POISON_EVENT,
 	NR_EVENTS
 };
 
diff --git a/ras-record.c b/ras-record.c
index a367939..f54fb41 100644
--- a/ras-record.c
+++ b/ras-record.c
@@ -559,6 +559,67 @@ int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev)
 }
 #endif
 
+#ifdef HAVE_CXL
+/*
+ * Table and functions to handle cxl:cxl_poison
+ */
+static const struct db_fields cxl_poison_event_fields[] = {
+	{ .name = "id",                   .type = "INTEGER PRIMARY KEY" },
+	{ .name = "timestamp",            .type = "TEXT" },
+	{ .name = "memdev",               .type = "TEXT" },
+	{ .name = "pcidev",               .type = "TEXT" },
+	{ .name = "region",               .type = "TEXT" },
+	{ .name = "region_uuid",          .type = "TEXT" },
+	{ .name = "hpa",                  .type = "INTEGER" },
+	{ .name = "dpa",                  .type = "INTEGER" },
+	{ .name = "length",               .type = "INTEGER" },
+	{ .name = "source",               .type = "TEXT" },
+	{ .name = "flags",                .type = "INTEGER" },
+	{ .name = "overflow_ts",          .type = "TEXT" },
+};
+
+static const struct db_table_descriptor cxl_poison_event_tab = {
+	.name = "cxl_poison_event",
+	.fields = cxl_poison_event_fields,
+	.num_fields = ARRAY_SIZE(cxl_poison_event_fields),
+};
+
+int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev)
+{
+	int rc;
+	struct sqlite3_priv *priv = ras->db_priv;
+
+	if (!priv || !priv->stmt_cxl_poison_event)
+		return 0;
+	log(TERM, LOG_INFO, "cxl_poison_event store: %p\n", priv->stmt_cxl_poison_event);
+
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 1, ev->timestamp, -1, NULL);
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 2, ev->memdev, -1, NULL);
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 3, ev->pcidev, -1, NULL);
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 4, ev->region, -1, NULL);
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 5, ev->uuid, -1, NULL);
+	sqlite3_bind_int64(priv->stmt_cxl_poison_event, 6, ev->hpa);
+	sqlite3_bind_int64(priv->stmt_cxl_poison_event, 7, ev->dpa);
+	sqlite3_bind_int(priv->stmt_cxl_poison_event, 8, ev->length);
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 9, ev->source, -1, NULL);
+	sqlite3_bind_int(priv->stmt_cxl_poison_event, 10, ev->flags);
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 11, ev->overflow_ts, -1, NULL);
+
+	rc = sqlite3_step(priv->stmt_cxl_poison_event);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed to do cxl_poison_event step on sqlite: error = %d\n", rc);
+	rc = sqlite3_reset(priv->stmt_cxl_poison_event);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed reset cxl_poison_event on sqlite: error = %d\n",
+		    rc);
+	log(TERM, LOG_INFO, "register inserted at db\n");
+
+	return rc;
+}
+#endif
+
 /*
  * Generic code
  */
@@ -896,6 +957,16 @@ int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras)
 	}
 #endif
 
+#ifdef HAVE_CXL
+	rc = ras_mc_create_table(priv, &cxl_poison_event_tab);
+	if (rc == SQLITE_OK) {
+		rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_poison_event,
+					 &cxl_poison_event_tab);
+		if (rc != SQLITE_OK)
+			goto error;
+	}
+#endif
+
 	ras->db_priv = priv;
 	return 0;
 
@@ -1008,6 +1079,16 @@ int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras)
 	}
 #endif
 
+#ifdef HAVE_CXL
+	if (priv->stmt_cxl_poison_event) {
+		rc = sqlite3_finalize(priv->stmt_cxl_poison_event);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize cxl_poison_event sqlite: error = %d\n",
+			    cpu, rc);
+	}
+#endif
+
 	rc = sqlite3_close_v2(db);
 	if (rc != SQLITE_OK)
 		log(TERM, LOG_ERR,
diff --git a/ras-record.h b/ras-record.h
index 219f10b..e5bf483 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -114,6 +114,20 @@ struct ras_mf_event {
 	const char *action_result;
 };
 
+struct ras_cxl_poison_event {
+	char timestamp[64];
+	const char *memdev;
+	const char *pcidev;
+	const char *region;
+	const char *uuid;
+	uint64_t hpa;
+	uint64_t dpa;
+	uint32_t length;
+	const char *source;
+	uint8_t flags;
+	char overflow_ts[64];
+};
+
 struct ras_mc_event;
 struct ras_aer_event;
 struct ras_extlog_event;
@@ -123,6 +137,7 @@ struct mce_event;
 struct devlink_event;
 struct diskerror_event;
 struct ras_mf_event;
+struct ras_cxl_poison_event;
 
 #ifdef HAVE_SQLITE3
 
@@ -155,6 +170,9 @@ struct sqlite3_priv {
 #ifdef HAVE_MEMORY_FAILURE
 	sqlite3_stmt	*stmt_mf_event;
 #endif
+#ifdef HAVE_CXL
+	sqlite3_stmt	*stmt_cxl_poison_event;
+#endif
 };
 
 struct db_fields {
@@ -182,6 +200,7 @@ int ras_store_arm_record(struct ras_events *ras, struct ras_arm_event *ev);
 int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev);
 int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
 int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
+int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
 
 #else
 static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; };
@@ -195,6 +214,7 @@ static inline int ras_store_arm_record(struct ras_events *ras, struct ras_arm_ev
 static inline int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev) { return 0; };
 static inline int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
 static inline int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
+static inline int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
 
 #endif
 
diff --git a/ras-report.c b/ras-report.c
index 62d5eb7..589e640 100644
--- a/ras-report.c
+++ b/ras-report.c
@@ -331,6 +331,42 @@ static int set_mf_event_backtrace(char *buf, struct ras_mf_event *ev)
 	return 0;
 }
 
+static int set_cxl_poison_event_backtrace(char *buf, struct ras_cxl_poison_event *ev)
+{
+	char bt_buf[MAX_BACKTRACE_SIZE];
+
+	if (!buf || !ev)
+		return -1;
+
+	sprintf(bt_buf, "BACKTRACE="	\
+						"timestamp=%s\n"	\
+						"memdev=%s\n"		\
+						"pcidev=%s\n"		\
+						"region=%s\n"		\
+						"uuid=%s\n"		\
+						"hpa=0x%lx\n"		\
+						"dpa=0x%lx\n"		\
+						"length=%d\n"		\
+						"source=%s\n"		\
+						"flags=%d\n"		\
+						"overflow_timestamp=%s\n" \
+						ev->timestamp,		\
+						ev->memdev,		\
+						ev->pcidev,		\
+						ev->region,		\
+						ev->uuid,		\
+						ev->hpa,		\
+						ev->dpa,		\
+						ev->length,		\
+						ev->source,		\
+						ev->flags,		\
+						ev->overflow_ts);
+
+	strcat(buf, bt_buf);
+
+	return 0;
+}
+
 static int commit_report_backtrace(int sockfd, int type, void *ev){
 	char buf[MAX_BACKTRACE_SIZE];
 	char *pbuf = buf;
@@ -368,6 +404,9 @@ static int commit_report_backtrace(int sockfd, int type, void *ev){
 	case MF_EVENT:
 		rc = set_mf_event_backtrace(buf, (struct ras_mf_event *)ev);
 		break;
+	case CXL_POISON_EVENT:
+		rc = set_cxl_poison_event_backtrace(buf, (struct ras_cxl_poison_event *)ev);
+		break;
 	default:
 		return -1;
 	}
@@ -776,3 +815,47 @@ mf_fail:
 	else
 		return -1;
 }
+
+int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev)
+{
+	char buf[MAX_MESSAGE_SIZE];
+	int sockfd = 0;
+	int done = 0;
+	int rc = -1;
+
+	memset(buf, 0, sizeof(buf));
+
+	sockfd = setup_report_socket();
+	if (sockfd < 0)
+		return -1;
+
+	rc = commit_report_basic(sockfd);
+	if (rc < 0)
+		goto cxl_poison_fail;
+
+	rc = commit_report_backtrace(sockfd, CXL_POISON_EVENT, ev);
+	if (rc < 0)
+		goto cxl_poison_fail;
+
+	sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-poison");
+	rc = write(sockfd, buf, strlen(buf) + 1);
+	if (rc < strlen(buf) + 1)
+		goto cxl_poison_fail;
+
+	sprintf(buf, "REASON=%s", "CXL poison");
+	rc = write(sockfd, buf, strlen(buf) + 1);
+	if (rc < strlen(buf) + 1)
+		goto cxl_poison_fail;
+
+	done = 1;
+
+cxl_poison_fail:
+
+	if (sockfd >= 0)
+		close(sockfd);
+
+	if (done)
+		return 0;
+	else
+		return -1;
+}
diff --git a/ras-report.h b/ras-report.h
index e605eb1..d1591ce 100644
--- a/ras-report.h
+++ b/ras-report.h
@@ -39,6 +39,7 @@ int ras_report_arm_event(struct ras_events *ras, struct ras_arm_event *ev);
 int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev);
 int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
 int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
+int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
 
 #else
 
@@ -50,6 +51,7 @@ static inline int ras_report_arm_event(struct ras_events *ras, struct ras_arm_ev
 static inline int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev) { return 0; };
 static inline int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
 static inline int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
+static inline int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
 
 #endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RESEND PATCH V3 3/4] rasdaemon: Add support for the CXL AER uncorrectable errors
  2023-02-02 18:18 [RESEND PATCH V3 0/4] rasdaemon: Add support for the CXL error events shiju.jose
  2023-02-02 18:18 ` [RESEND PATCH V3 1/4] rasdaemon: Move definition for BIT and BIT_ULL to a common file shiju.jose
  2023-02-02 18:18 ` [RESEND PATCH V3 2/4] rasdaemon: Add support for the CXL poison events shiju.jose
@ 2023-02-02 18:18 ` shiju.jose
  2023-02-02 18:18 ` [RESEND PATCH V3 4/4] rasdaemon: Add support for the CXL AER correctable errors shiju.jose
  3 siblings, 0 replies; 7+ messages in thread
From: shiju.jose @ 2023-02-02 18:18 UTC (permalink / raw)
  To: mchehab, linux-edac, linux-cxl; +Cc: jonathan.cameron, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add support to log and record the CXL AER uncorrectable errors.

The corresponding Kernel patch here:
https://patchwork.kernel.org/project/cxl/patch/166974413388.1608150.5875712482260436188.stgit@djiang5-desk3.ch.intel.com/

It was found that the header log data to be converted to the
big-endian format to correctly store in the SQLite database likely
because the SQLite database seems uses the big-endian storage.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 ras-cxl-handler.c | 138 ++++++++++++++++++++++++++++++++++++++++++++++
 ras-cxl-handler.h |   5 ++
 ras-events.c      |   9 +++
 ras-events.h      |   1 +
 ras-record.c      |  65 ++++++++++++++++++++++
 ras-record.h      |  16 ++++++
 ras-report.c      |  69 +++++++++++++++++++++++
 ras-report.h      |   2 +
 8 files changed, 305 insertions(+)

diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c
index 0ba2519..50bbdb0 100644
--- a/ras-cxl-handler.c
+++ b/ras-cxl-handler.c
@@ -21,6 +21,7 @@
 #include "ras-record.h"
 #include "ras-logger.h"
 #include "ras-report.h"
+#include <endian.h>
 
 /* Poison List: Payload out flags */
 #define CXL_POISON_FLAG_MORE            BIT(0)
@@ -170,3 +171,140 @@ int ras_cxl_poison_event_handler(struct trace_seq *s,
 
 	return 0;
 }
+
+/* CXL AER Errors */
+
+#define CXL_AER_UE_CACHE_DATA_PARITY	BIT(0)
+#define CXL_AER_UE_CACHE_ADDR_PARITY	BIT(1)
+#define CXL_AER_UE_CACHE_BE_PARITY	BIT(2)
+#define CXL_AER_UE_CACHE_DATA_ECC	BIT(3)
+#define CXL_AER_UE_MEM_DATA_PARITY	BIT(4)
+#define CXL_AER_UE_MEM_ADDR_PARITY	BIT(5)
+#define CXL_AER_UE_MEM_BE_PARITY	BIT(6)
+#define CXL_AER_UE_MEM_DATA_ECC		BIT(7)
+#define CXL_AER_UE_REINIT_THRESH	BIT(8)
+#define CXL_AER_UE_RSVD_ENCODE		BIT(9)
+#define CXL_AER_UE_POISON		BIT(10)
+#define CXL_AER_UE_RECV_OVERFLOW	BIT(11)
+#define CXL_AER_UE_INTERNAL_ERR		BIT(14)
+#define CXL_AER_UE_IDE_TX_ERR		BIT(15)
+#define CXL_AER_UE_IDE_RX_ERR		BIT(16)
+
+struct cxl_error_list {
+	uint32_t bit;
+	const char *error;
+};
+
+static const struct cxl_error_list cxl_aer_ue[] = {
+	{ .bit = CXL_AER_UE_CACHE_DATA_PARITY, .error = "Cache Data Parity Error" },
+	{ .bit = CXL_AER_UE_CACHE_ADDR_PARITY, .error = "Cache Address Parity Error" },
+	{ .bit = CXL_AER_UE_CACHE_BE_PARITY, .error = "Cache Byte Enable Parity Error" },
+	{ .bit = CXL_AER_UE_CACHE_DATA_ECC, .error = "Cache Data ECC Error" },
+	{ .bit = CXL_AER_UE_MEM_DATA_PARITY, .error = "Memory Data Parity Error" },
+	{ .bit = CXL_AER_UE_MEM_ADDR_PARITY, .error = "Memory Address Parity Error" },
+	{ .bit = CXL_AER_UE_MEM_BE_PARITY, .error = "Memory Byte Enable Parity Error" },
+	{ .bit = CXL_AER_UE_MEM_DATA_ECC, .error = "Memory Data ECC Error" },
+	{ .bit = CXL_AER_UE_REINIT_THRESH, .error = "REINIT Threshold Hit" },
+	{ .bit = CXL_AER_UE_RSVD_ENCODE, .error = "Received Unrecognized Encoding" },
+	{ .bit = CXL_AER_UE_POISON, .error = "Received Poison From Peer" },
+	{ .bit = CXL_AER_UE_RECV_OVERFLOW, .error = "Receiver Overflow" },
+	{ .bit = CXL_AER_UE_INTERNAL_ERR, .error = "Component Specific Error" },
+	{ .bit = CXL_AER_UE_IDE_TX_ERR, .error = "IDE Tx Error" },
+	{ .bit = CXL_AER_UE_IDE_RX_ERR, .error = "IDE Rx Error" },
+};
+
+static int decode_cxl_error_status(struct trace_seq *s, uint32_t status,
+				   const struct cxl_error_list *cxl_error_list,
+				   uint8_t num_elems)
+{
+	int i;
+
+	for (i = 0; i < num_elems; i++) {
+		if (status & cxl_error_list[i].bit)
+			if (trace_seq_printf(s, "\'%s\' ", cxl_error_list[i].error) <= 0)
+				return -1;
+	}
+	return 0;
+}
+
+int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
+				 struct tep_record *record,
+				 struct tep_event *event, void *context)
+{
+	int len, i;
+	unsigned long long val;
+	time_t now;
+	struct tm *tm;
+	struct ras_events *ras = context;
+	struct ras_cxl_aer_ue_event ev;
+
+	memset(&ev, 0, sizeof(ev));
+	now = record->ts / user_hz + ras->uptime_diff;
+	tm = localtime(&now);
+	if (tm)
+		strftime(ev.timestamp, sizeof(ev.timestamp),
+			 "%Y-%m-%d %H:%M:%S %z", tm);
+	else
+		strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000", sizeof(ev.timestamp));
+	if (trace_seq_printf(s, "%s ", ev.timestamp) <= 0)
+		return -1;
+
+	ev.dev_name = tep_get_field_raw(s, event, "dev_name",
+					record, &len, 1);
+	if (!ev.dev_name)
+		return -1;
+	if (trace_seq_printf(s, "dev_name:%s ", ev.dev_name) <= 0)
+		return -1;
+
+	if (tep_get_field_val(s, event, "status", record, &val, 1) < 0)
+		return -1;
+	ev.error_status = val;
+
+	if (trace_seq_printf(s, "error status:") <= 0)
+		return -1;
+	if (decode_cxl_error_status(s, ev.error_status,
+				    cxl_aer_ue, ARRAY_SIZE(cxl_aer_ue)) < 0)
+		return -1;
+
+	if (tep_get_field_val(s,  event, "first_error", record, &val, 1) < 0)
+		return -1;
+	ev.first_error = val;
+
+	if (trace_seq_printf(s, "first error:") <= 0)
+		return -1;
+	if (decode_cxl_error_status(s, ev.first_error,
+				    cxl_aer_ue, ARRAY_SIZE(cxl_aer_ue)) < 0)
+		return -1;
+
+	ev.header_log = tep_get_field_raw(s, event, "header_log",
+					  record, &len, 1);
+	if (!ev.header_log)
+		return -1;
+	if (trace_seq_printf(s, "header log:\n") <= 0)
+		return -1;
+	for (i = 0; i < CXL_HEADERLOG_SIZE_U32; i++) {
+		if (trace_seq_printf(s, "%08x ", ev.header_log[i]) <= 0)
+			break;
+		if ((i > 0) && ((i % 20) == 0))
+			if (trace_seq_printf(s, "\n") <= 0)
+				break;
+		/* Convert header log data to the big-endian format because
+		 * the SQLite database seems uses the big-endian storage.
+		 */
+		ev.header_log[i] = htobe32(ev.header_log[i]);
+	}
+	if (i < CXL_HEADERLOG_SIZE_U32)
+		return -1;
+
+	/* Insert data into the SGBD */
+#ifdef HAVE_SQLITE3
+	ras_store_cxl_aer_ue_event(ras, &ev);
+#endif
+
+#ifdef HAVE_ABRT_REPORT
+	/* Report event to ABRT */
+	ras_report_cxl_aer_ue_event(ras, &ev);
+#endif
+
+	return 0;
+}
diff --git a/ras-cxl-handler.h b/ras-cxl-handler.h
index 84d5cc6..18b3120 100644
--- a/ras-cxl-handler.h
+++ b/ras-cxl-handler.h
@@ -21,4 +21,9 @@
 int ras_cxl_poison_event_handler(struct trace_seq *s,
 				 struct tep_record *record,
 				 struct tep_event *event, void *context);
+
+int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
+				 struct tep_record *record,
+				 struct tep_event *event, void *context);
+
 #endif
diff --git a/ras-events.c b/ras-events.c
index 6555125..ead792b 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -246,6 +246,7 @@ int toggle_ras_mc_event(int enable)
 
 #ifdef HAVE_CXL
 	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_poison", enable);
+	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_aer_uncorrectable_error", enable);
 #endif
 
 free_ras:
@@ -964,6 +965,14 @@ int handle_ras_events(int record_events)
 	else
 		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
 		    "cxl", "cxl_poison");
+
+	rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_aer_uncorrectable_error",
+			       ras_cxl_aer_ue_event_handler, NULL, CXL_AER_UE_EVENT);
+	if (!rc)
+		num_events++;
+	else
+		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+		    "cxl", "cxl_aer_uncorrectable_error");
 #endif
 
 	if (!num_events) {
diff --git a/ras-events.h b/ras-events.h
index fc51070..65f9d9a 100644
--- a/ras-events.h
+++ b/ras-events.h
@@ -40,6 +40,7 @@ enum {
 	DISKERROR_EVENT,
 	MF_EVENT,
 	CXL_POISON_EVENT,
+	CXL_AER_UE_EVENT,
 	NR_EVENTS
 };
 
diff --git a/ras-record.c b/ras-record.c
index f54fb41..4703790 100644
--- a/ras-record.c
+++ b/ras-record.c
@@ -618,6 +618,54 @@ int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_eve
 
 	return rc;
 }
+
+/*
+ * Table and functions to handle cxl:cxl_aer_uncorrectable_error
+ */
+static const struct db_fields cxl_aer_ue_event_fields[] = {
+	{ .name = "id",                   .type = "INTEGER PRIMARY KEY" },
+	{ .name = "timestamp",            .type = "TEXT" },
+	{ .name = "dev_name",             .type = "TEXT" },
+	{ .name = "error_status",         .type = "INTEGER" },
+	{ .name = "first_error",          .type = "INTEGER" },
+	{ .name = "header_log",           .type = "BLOB" },
+};
+
+static const struct db_table_descriptor cxl_aer_ue_event_tab = {
+	.name = "cxl_aer_ue_event",
+	.fields = cxl_aer_ue_event_fields,
+	.num_fields = ARRAY_SIZE(cxl_aer_ue_event_fields),
+};
+
+int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev)
+{
+	int rc;
+	struct sqlite3_priv *priv = ras->db_priv;
+
+	if (!priv || !priv->stmt_cxl_aer_ue_event)
+		return 0;
+	log(TERM, LOG_INFO, "cxl_aer_ue_event store: %p\n", priv->stmt_cxl_aer_ue_event);
+
+	sqlite3_bind_text(priv->stmt_cxl_aer_ue_event, 1, ev->timestamp, -1, NULL);
+	sqlite3_bind_text(priv->stmt_cxl_aer_ue_event, 2, ev->dev_name, -1, NULL);
+	sqlite3_bind_int(priv->stmt_cxl_aer_ue_event, 3, ev->error_status);
+	sqlite3_bind_int(priv->stmt_cxl_aer_ue_event, 4, ev->first_error);
+	sqlite3_bind_blob(priv->stmt_cxl_aer_ue_event, 5, ev->header_log, CXL_HEADERLOG_SIZE, NULL);
+
+	rc = sqlite3_step(priv->stmt_cxl_aer_ue_event);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed to do cxl_aer_ue_event step on sqlite: error = %d\n", rc);
+	rc = sqlite3_reset(priv->stmt_cxl_aer_ue_event);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed reset cxl_aer_ue_event on sqlite: error = %d\n",
+		    rc);
+	log(TERM, LOG_INFO, "register inserted at db\n");
+
+	return rc;
+}
+
 #endif
 
 /*
@@ -965,6 +1013,15 @@ int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras)
 		if (rc != SQLITE_OK)
 			goto error;
 	}
+
+	rc = ras_mc_create_table(priv, &cxl_aer_ue_event_tab);
+	if (rc == SQLITE_OK) {
+		rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_aer_ue_event,
+					 &cxl_aer_ue_event_tab);
+		if (rc != SQLITE_OK)
+			goto error;
+	}
+
 #endif
 
 	ras->db_priv = priv;
@@ -1087,6 +1144,14 @@ int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras)
 			    "cpu %u: Failed to finalize cxl_poison_event sqlite: error = %d\n",
 			    cpu, rc);
 	}
+
+	if (priv->stmt_cxl_aer_ue_event) {
+		rc = sqlite3_finalize(priv->stmt_cxl_aer_ue_event);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize cxl_aer_ue_event sqlite: error = %d\n",
+			    cpu, rc);
+	}
 #endif
 
 	rc = sqlite3_close_v2(db);
diff --git a/ras-record.h b/ras-record.h
index e5bf483..0e2c178 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -128,6 +128,18 @@ struct ras_cxl_poison_event {
 	char overflow_ts[64];
 };
 
+#define SZ_512                          0x200
+#define CXL_HEADERLOG_SIZE              SZ_512
+#define CXL_HEADERLOG_SIZE_U32          (SZ_512 / sizeof(uint32_t))
+
+struct ras_cxl_aer_ue_event {
+	char timestamp[64];
+	const char *dev_name;
+	uint32_t error_status;
+	uint32_t first_error;
+	uint32_t *header_log;
+};
+
 struct ras_mc_event;
 struct ras_aer_event;
 struct ras_extlog_event;
@@ -138,6 +150,7 @@ struct devlink_event;
 struct diskerror_event;
 struct ras_mf_event;
 struct ras_cxl_poison_event;
+struct ras_cxl_aer_ue_event;
 
 #ifdef HAVE_SQLITE3
 
@@ -172,6 +185,7 @@ struct sqlite3_priv {
 #endif
 #ifdef HAVE_CXL
 	sqlite3_stmt	*stmt_cxl_poison_event;
+	sqlite3_stmt	*stmt_cxl_aer_ue_event;
 #endif
 };
 
@@ -201,6 +215,7 @@ int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev);
 int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
 int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
 int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
+int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev);
 
 #else
 static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; };
@@ -215,6 +230,7 @@ static inline int ras_store_devlink_event(struct ras_events *ras, struct devlink
 static inline int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
 static inline int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
 static inline int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
+static inline int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; };
 
 #endif
 
diff --git a/ras-report.c b/ras-report.c
index 589e640..4c09061 100644
--- a/ras-report.c
+++ b/ras-report.c
@@ -367,6 +367,28 @@ static int set_cxl_poison_event_backtrace(char *buf, struct ras_cxl_poison_event
 	return 0;
 }
 
+static int set_cxl_aer_ue_event_backtrace(char *buf, struct ras_cxl_aer_ue_event *ev)
+{
+	char bt_buf[MAX_BACKTRACE_SIZE];
+
+	if (!buf || !ev)
+		return -1;
+
+	sprintf(bt_buf, "BACKTRACE="	\
+						"timestamp=%s\n"	\
+						"dev_name=%s\n"		\
+						"error_status=%u\n"	\
+						"first_error=%u\n"	\
+						ev->timestamp,		\
+						ev->dev_name,		\
+						ev->error_status,	\
+						ev->first_error);
+
+	strcat(buf, bt_buf);
+
+	return 0;
+}
+
 static int commit_report_backtrace(int sockfd, int type, void *ev){
 	char buf[MAX_BACKTRACE_SIZE];
 	char *pbuf = buf;
@@ -407,6 +429,9 @@ static int commit_report_backtrace(int sockfd, int type, void *ev){
 	case CXL_POISON_EVENT:
 		rc = set_cxl_poison_event_backtrace(buf, (struct ras_cxl_poison_event *)ev);
 		break;
+	case CXL_AER_UE_EVENT:
+		rc = set_cxl_aer_ue_event_backtrace(buf, (struct ras_cxl_aer_ue_event *)ev);
+		break;
 	default:
 		return -1;
 	}
@@ -859,3 +884,47 @@ cxl_poison_fail:
 	else
 		return -1;
 }
+
+int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev)
+{
+	char buf[MAX_MESSAGE_SIZE];
+	int sockfd = 0;
+	int done = 0;
+	int rc = -1;
+
+	memset(buf, 0, sizeof(buf));
+
+	sockfd = setup_report_socket();
+	if (sockfd < 0)
+		return -1;
+
+	rc = commit_report_basic(sockfd);
+	if (rc < 0)
+		goto cxl_aer_ue_fail;
+
+	rc = commit_report_backtrace(sockfd, CXL_AER_UE_EVENT, ev);
+	if (rc < 0)
+		goto cxl_aer_ue_fail;
+
+	sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-aer-uncorrectable-error");
+	rc = write(sockfd, buf, strlen(buf) + 1);
+	if (rc < strlen(buf) + 1)
+		goto cxl_aer_ue_fail;
+
+	sprintf(buf, "REASON=%s", "CXL AER uncorrectable error");
+	rc = write(sockfd, buf, strlen(buf) + 1);
+	if (rc < strlen(buf) + 1)
+		goto cxl_aer_ue_fail;
+
+	done = 1;
+
+cxl_aer_ue_fail:
+
+	if (sockfd >= 0)
+		close(sockfd);
+
+	if (done)
+		return 0;
+	else
+		return -1;
+}
diff --git a/ras-report.h b/ras-report.h
index d1591ce..dfe89d1 100644
--- a/ras-report.h
+++ b/ras-report.h
@@ -40,6 +40,7 @@ int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev);
 int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
 int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
 int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
+int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev);
 
 #else
 
@@ -52,6 +53,7 @@ static inline int ras_report_devlink_event(struct ras_events *ras, struct devlin
 static inline int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
 static inline int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
 static inline int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
+static inline int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; };
 
 #endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RESEND PATCH V3 4/4] rasdaemon: Add support for the CXL AER correctable errors
  2023-02-02 18:18 [RESEND PATCH V3 0/4] rasdaemon: Add support for the CXL error events shiju.jose
                   ` (2 preceding siblings ...)
  2023-02-02 18:18 ` [RESEND PATCH V3 3/4] rasdaemon: Add support for the CXL AER uncorrectable errors shiju.jose
@ 2023-02-02 18:18 ` shiju.jose
  3 siblings, 0 replies; 7+ messages in thread
From: shiju.jose @ 2023-02-02 18:18 UTC (permalink / raw)
  To: mchehab, linux-edac, linux-cxl; +Cc: jonathan.cameron, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add support to log and record the CXL AER correctable errors.

The corresponding Kernel patch here:
https://patchwork.kernel.org/project/cxl/patch/166974413388.1608150.5875712482260436188.stgit@djiang5-desk3.ch.intel.com/

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 ras-cxl-handler.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++
 ras-cxl-handler.h |  3 +++
 ras-events.c      |  9 +++++++
 ras-events.h      |  1 +
 ras-record.c      | 57 +++++++++++++++++++++++++++++++++++++++
 ras-record.h      | 10 +++++++
 ras-report.c      | 67 ++++++++++++++++++++++++++++++++++++++++++++++
 ras-report.h      |  2 ++
 8 files changed, 217 insertions(+)

diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c
index 50bbdb0..5ba350a 100644
--- a/ras-cxl-handler.c
+++ b/ras-cxl-handler.c
@@ -190,6 +190,14 @@ int ras_cxl_poison_event_handler(struct trace_seq *s,
 #define CXL_AER_UE_IDE_TX_ERR		BIT(15)
 #define CXL_AER_UE_IDE_RX_ERR		BIT(16)
 
+#define CXL_AER_CE_CACHE_DATA_ECC	BIT(0)
+#define CXL_AER_CE_MEM_DATA_ECC		BIT(1)
+#define CXL_AER_CE_CRC_THRESH		BIT(2)
+#define CXL_AER_CE_RETRY_THRESH		BIT(3)
+#define CXL_AER_CE_CACHE_POISON		BIT(4)
+#define CXL_AER_CE_MEM_POISON		BIT(5)
+#define CXL_AER_CE_PHYS_LAYER_ERR	BIT(6)
+
 struct cxl_error_list {
 	uint32_t bit;
 	const char *error;
@@ -213,6 +221,16 @@ static const struct cxl_error_list cxl_aer_ue[] = {
 	{ .bit = CXL_AER_UE_IDE_RX_ERR, .error = "IDE Rx Error" },
 };
 
+static const struct cxl_error_list cxl_aer_ce[] = {
+	{ .bit = CXL_AER_CE_CACHE_DATA_ECC, .error = "Cache Data ECC Error" },
+	{ .bit = CXL_AER_CE_MEM_DATA_ECC, .error = "Memory Data ECC Error" },
+	{ .bit = CXL_AER_CE_CRC_THRESH, .error = "CRC Threshold Hit" },
+	{ .bit = CXL_AER_CE_RETRY_THRESH, .error = "Retry Threshold" },
+	{ .bit = CXL_AER_CE_CACHE_POISON, .error = "Received Cache Poison From Peer" },
+	{ .bit = CXL_AER_CE_MEM_POISON, .error = "Received Memory Poison From Peer" },
+	{ .bit = CXL_AER_CE_PHYS_LAYER_ERR, .error = "Received Error From Physical Layer" },
+};
+
 static int decode_cxl_error_status(struct trace_seq *s, uint32_t status,
 				   const struct cxl_error_list *cxl_error_list,
 				   uint8_t num_elems)
@@ -308,3 +326,53 @@ int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
 
 	return 0;
 }
+
+int ras_cxl_aer_ce_event_handler(struct trace_seq *s,
+				 struct tep_record *record,
+				 struct tep_event *event, void *context)
+{
+	int len;
+	unsigned long long val;
+	time_t now;
+	struct tm *tm;
+	struct ras_events *ras = context;
+	struct ras_cxl_aer_ce_event ev;
+
+	now = record->ts / user_hz + ras->uptime_diff;
+	tm = localtime(&now);
+	if (tm)
+		strftime(ev.timestamp, sizeof(ev.timestamp),
+			 "%Y-%m-%d %H:%M:%S %z", tm);
+	else
+		strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000", sizeof(ev.timestamp));
+	if (trace_seq_printf(s, "%s ", ev.timestamp) <= 0)
+		return -1;
+
+	ev.dev_name = tep_get_field_raw(s, event, "dev_name",
+					record, &len, 1);
+	if (!ev.dev_name)
+		return -1;
+	if (trace_seq_printf(s, "dev_name:%s ", ev.dev_name) <= 0)
+		return -1;
+
+	if (tep_get_field_val(s, event, "status", record, &val, 1) < 0)
+		return -1;
+	ev.error_status = val;
+	if (trace_seq_printf(s, "error status:") <= 0)
+		return -1;
+	if (decode_cxl_error_status(s, ev.error_status,
+				    cxl_aer_ce, ARRAY_SIZE(cxl_aer_ce)) < 0)
+		return -1;
+
+	/* Insert data into the SGBD */
+#ifdef HAVE_SQLITE3
+	ras_store_cxl_aer_ce_event(ras, &ev);
+#endif
+
+#ifdef HAVE_ABRT_REPORT
+	/* Report event to ABRT */
+	ras_report_cxl_aer_ce_event(ras, &ev);
+#endif
+
+	return 0;
+}
diff --git a/ras-cxl-handler.h b/ras-cxl-handler.h
index 18b3120..711daf4 100644
--- a/ras-cxl-handler.h
+++ b/ras-cxl-handler.h
@@ -26,4 +26,7 @@ int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
 				 struct tep_record *record,
 				 struct tep_event *event, void *context);
 
+int ras_cxl_aer_ce_event_handler(struct trace_seq *s,
+				 struct tep_record *record,
+				 struct tep_event *event, void *context);
 #endif
diff --git a/ras-events.c b/ras-events.c
index ead792b..3691311 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -247,6 +247,7 @@ int toggle_ras_mc_event(int enable)
 #ifdef HAVE_CXL
 	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_poison", enable);
 	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_aer_uncorrectable_error", enable);
+	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_aer_correctable_error", enable);
 #endif
 
 free_ras:
@@ -973,6 +974,14 @@ int handle_ras_events(int record_events)
 	else
 		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
 		    "cxl", "cxl_aer_uncorrectable_error");
+
+	rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_aer_correctable_error",
+			       ras_cxl_aer_ce_event_handler, NULL, CXL_AER_CE_EVENT);
+	if (!rc)
+		num_events++;
+	else
+		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+		    "cxl", "cxl_aer_correctable_error");
 #endif
 
 	if (!num_events) {
diff --git a/ras-events.h b/ras-events.h
index 65f9d9a..dc7bdfb 100644
--- a/ras-events.h
+++ b/ras-events.h
@@ -41,6 +41,7 @@ enum {
 	MF_EVENT,
 	CXL_POISON_EVENT,
 	CXL_AER_UE_EVENT,
+	CXL_AER_CE_EVENT,
 	NR_EVENTS
 };
 
diff --git a/ras-record.c b/ras-record.c
index 4703790..c318a18 100644
--- a/ras-record.c
+++ b/ras-record.c
@@ -666,6 +666,48 @@ int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_eve
 	return rc;
 }
 
+/*
+ * Table and functions to handle cxl:cxl_aer_correctable_error
+ */
+static const struct db_fields cxl_aer_ce_event_fields[] = {
+	{ .name = "id",                   .type = "INTEGER PRIMARY KEY" },
+	{ .name = "timestamp",            .type = "TEXT" },
+	{ .name = "dev_name",             .type = "TEXT" },
+	{ .name = "error_status",         .type = "INTEGER" },
+};
+
+static const struct db_table_descriptor cxl_aer_ce_event_tab = {
+	.name = "cxl_aer_ce_event",
+	.fields = cxl_aer_ce_event_fields,
+	.num_fields = ARRAY_SIZE(cxl_aer_ce_event_fields),
+};
+
+int ras_store_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev)
+{
+	int rc;
+	struct sqlite3_priv *priv = ras->db_priv;
+
+	if (!priv || !priv->stmt_cxl_aer_ce_event)
+		return 0;
+	log(TERM, LOG_INFO, "cxl_aer_ce_event store: %p\n", priv->stmt_cxl_aer_ce_event);
+
+	sqlite3_bind_text(priv->stmt_cxl_aer_ce_event, 1, ev->timestamp, -1, NULL);
+	sqlite3_bind_text(priv->stmt_cxl_aer_ce_event, 2, ev->dev_name, -1, NULL);
+	sqlite3_bind_int(priv->stmt_cxl_aer_ce_event, 3, ev->error_status);
+
+	rc = sqlite3_step(priv->stmt_cxl_aer_ce_event);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed to do cxl_aer_ce_event step on sqlite: error = %d\n", rc);
+	rc = sqlite3_reset(priv->stmt_cxl_aer_ce_event);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed reset cxl_aer_ce_event on sqlite: error = %d\n",
+		    rc);
+	log(TERM, LOG_INFO, "register inserted at db\n");
+
+	return rc;
+}
 #endif
 
 /*
@@ -1022,6 +1064,13 @@ int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras)
 			goto error;
 	}
 
+	rc = ras_mc_create_table(priv, &cxl_aer_ce_event_tab);
+	if (rc == SQLITE_OK) {
+		rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_aer_ce_event,
+					 &cxl_aer_ce_event_tab);
+		if (rc != SQLITE_OK)
+			goto error;
+	}
 #endif
 
 	ras->db_priv = priv;
@@ -1152,6 +1201,14 @@ int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras)
 			    "cpu %u: Failed to finalize cxl_aer_ue_event sqlite: error = %d\n",
 			    cpu, rc);
 	}
+
+	if (priv->stmt_cxl_aer_ce_event) {
+		rc = sqlite3_finalize(priv->stmt_cxl_aer_ce_event);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize cxl_aer_ce_event sqlite: error = %d\n",
+			    cpu, rc);
+	}
 #endif
 
 	rc = sqlite3_close_v2(db);
diff --git a/ras-record.h b/ras-record.h
index 0e2c178..1f28cc1 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -140,6 +140,12 @@ struct ras_cxl_aer_ue_event {
 	uint32_t *header_log;
 };
 
+struct ras_cxl_aer_ce_event {
+	char timestamp[64];
+	const char *dev_name;
+	uint32_t error_status;
+};
+
 struct ras_mc_event;
 struct ras_aer_event;
 struct ras_extlog_event;
@@ -151,6 +157,7 @@ struct diskerror_event;
 struct ras_mf_event;
 struct ras_cxl_poison_event;
 struct ras_cxl_aer_ue_event;
+struct ras_cxl_aer_ce_event;
 
 #ifdef HAVE_SQLITE3
 
@@ -186,6 +193,7 @@ struct sqlite3_priv {
 #ifdef HAVE_CXL
 	sqlite3_stmt	*stmt_cxl_poison_event;
 	sqlite3_stmt	*stmt_cxl_aer_ue_event;
+	sqlite3_stmt	*stmt_cxl_aer_ce_event;
 #endif
 };
 
@@ -216,6 +224,7 @@ int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev
 int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
 int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
 int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev);
+int ras_store_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev);
 
 #else
 static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; };
@@ -231,6 +240,7 @@ static inline int ras_store_diskerror_event(struct ras_events *ras, struct diske
 static inline int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
 static inline int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
 static inline int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; };
+static inline int ras_store_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev) { return 0; };
 
 #endif
 
diff --git a/ras-report.c b/ras-report.c
index 4c09061..796abab 100644
--- a/ras-report.c
+++ b/ras-report.c
@@ -389,6 +389,26 @@ static int set_cxl_aer_ue_event_backtrace(char *buf, struct ras_cxl_aer_ue_event
 	return 0;
 }
 
+static int set_cxl_aer_ce_event_backtrace(char *buf, struct ras_cxl_aer_ce_event *ev)
+{
+	char bt_buf[MAX_BACKTRACE_SIZE];
+
+	if (!buf || !ev)
+		return -1;
+
+	sprintf(bt_buf, "BACKTRACE="	\
+						"timestamp=%s\n"	\
+						"dev_name=%s\n"		\
+						"error_status=%u\n"	\
+						ev->timestamp,		\
+						ev->dev_name,		\
+						ev->error_status);
+
+	strcat(buf, bt_buf);
+
+	return 0;
+}
+
 static int commit_report_backtrace(int sockfd, int type, void *ev){
 	char buf[MAX_BACKTRACE_SIZE];
 	char *pbuf = buf;
@@ -432,6 +452,9 @@ static int commit_report_backtrace(int sockfd, int type, void *ev){
 	case CXL_AER_UE_EVENT:
 		rc = set_cxl_aer_ue_event_backtrace(buf, (struct ras_cxl_aer_ue_event *)ev);
 		break;
+	case CXL_AER_CE_EVENT:
+		rc = set_cxl_aer_ce_event_backtrace(buf, (struct ras_cxl_aer_ce_event *)ev);
+		break;
 	default:
 		return -1;
 	}
@@ -928,3 +951,47 @@ cxl_aer_ue_fail:
 	else
 		return -1;
 }
+
+int ras_report_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev)
+{
+	char buf[MAX_MESSAGE_SIZE];
+	int sockfd = 0;
+	int done = 0;
+	int rc = -1;
+
+	memset(buf, 0, sizeof(buf));
+
+	sockfd = setup_report_socket();
+	if (sockfd < 0)
+		return -1;
+
+	rc = commit_report_basic(sockfd);
+	if (rc < 0)
+		goto cxl_aer_ce_fail;
+
+	rc = commit_report_backtrace(sockfd, CXL_AER_CE_EVENT, ev);
+	if (rc < 0)
+		goto cxl_aer_ce_fail;
+
+	sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-aer-correctable-error");
+	rc = write(sockfd, buf, strlen(buf) + 1);
+	if (rc < strlen(buf) + 1)
+		goto cxl_aer_ce_fail;
+
+	sprintf(buf, "REASON=%s", "CXL AER correctable error");
+	rc = write(sockfd, buf, strlen(buf) + 1);
+	if (rc < strlen(buf) + 1)
+		goto cxl_aer_ce_fail;
+
+	done = 1;
+
+cxl_aer_ce_fail:
+
+	if (sockfd >= 0)
+		close(sockfd);
+
+	if (done)
+		return 0;
+	else
+		return -1;
+}
diff --git a/ras-report.h b/ras-report.h
index dfe89d1..46155ee 100644
--- a/ras-report.h
+++ b/ras-report.h
@@ -41,6 +41,7 @@ int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *e
 int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
 int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
 int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev);
+int ras_report_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev);
 
 #else
 
@@ -54,6 +55,7 @@ static inline int ras_report_diskerror_event(struct ras_events *ras, struct disk
 static inline int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
 static inline int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
 static inline int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; };
+static inline int ras_report_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev) { return 0; };
 
 #endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RESEND PATCH V3 2/4] rasdaemon: Add support for the CXL poison events
  2023-02-02 18:18 ` [RESEND PATCH V3 2/4] rasdaemon: Add support for the CXL poison events shiju.jose
@ 2023-02-02 19:41   ` Alison Schofield
  2023-02-03  9:31     ` Shiju Jose
  0 siblings, 1 reply; 7+ messages in thread
From: Alison Schofield @ 2023-02-02 19:41 UTC (permalink / raw)
  To: shiju.jose; +Cc: mchehab, linux-edac, linux-cxl, jonathan.cameron, linuxarm

On Thu, Feb 02, 2023 at 06:18:44PM +0000, shiju.jose@huawei.com wrote:
> From: Shiju Jose <shiju.jose@huawei.com>
> 
> Add support to log and record the CXL poison events.
> 
> The corresponding Kernel patches here:
> https://lore.kernel.org/linux-cxl/de11785ff05844299b40b100f8e0f56c7eef7f08.1674070170.git.alison.schofield@intel.com/
> 
> Presently RFC draft version for logging, could be extended for the policy
> based recovery action for the frequent poison events depending on the above
> kernel patches.

Hi Shiju,

Looks good to me based on the kernel patches you reference above.
I want to let you know that a v6 is in the works, and will lead to
some naming changes below - but minor stuff.

Alison

> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  Makefile.am       |   7 +-
>  configure.ac      |  11 +++
>  ras-cxl-handler.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++
>  ras-cxl-handler.h |  24 +++++++
>  ras-events.c      |  15 ++++
>  ras-events.h      |   1 +
>  ras-record.c      |  81 ++++++++++++++++++++++
>  ras-record.h      |  20 ++++++
>  ras-report.c      |  83 ++++++++++++++++++++++
>  ras-report.h      |   2 +
>  10 files changed, 415 insertions(+), 1 deletion(-)
>  create mode 100644 ras-cxl-handler.c
>  create mode 100644 ras-cxl-handler.h
> 
> diff --git a/Makefile.am b/Makefile.am
> index a9832cc..bd7b2ae 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -73,6 +73,11 @@ endif
>  if WITH_CPU_FAULT_ISOLATION
>     rasdaemon_SOURCES += ras-cpu-isolation.c queue.c
>  endif
> +
> +if WITH_CXL
> +   rasdaemon_SOURCES += ras-cxl-handler.c
> +endif
> +
>  rasdaemon_LDADD = -lpthread $(SQLITE3_LIBS) $(LIBTRACEEVENT_LIBS)
>  rasdaemon_CFLAGS = $(SQLITE3_CFLAGS) $(LIBTRACEEVENT_CFLAGS)
>  
> @@ -81,7 +86,7 @@ include_HEADERS = config.h  ras-events.h  ras-logger.h  ras-mc-handler.h \
>  		  ras-extlog-handler.h ras-arm-handler.h ras-non-standard-handler.h \
>  		  ras-devlink-handler.h ras-diskerror-handler.h rbtree.h ras-page-isolation.h \
>  		  non-standard-hisilicon.h non-standard-ampere.h ras-memory-failure-handler.h \
> -		  ras-cpu-isolation.h queue.h
> +		  ras-cxl-handler.h ras-cpu-isolation.h queue.h
>  
>  # This rule can't be called with more than one Makefile job (like make -j8)
>  # I can't figure out a way to fix that
> diff --git a/configure.ac b/configure.ac
> index c973aaf..028b9b3 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -127,6 +127,16 @@ AS_IF([test "x$enable_memory_failure" = "xyes" || test "x$enable_all" = "xyes"],
>  AM_CONDITIONAL([WITH_MEMORY_FAILURE], [test x$enable_memory_failure = xyes || test x$enable_all = xyes])
>  AM_COND_IF([WITH_MEMORY_FAILURE], [USE_MEMORY_FAILURE="yes"], [USE_MEMORY_FAILURE="no"])
>  
> +AC_ARG_ENABLE([cxl],
> +    AS_HELP_STRING([--enable-cxl], [enable CXL events (currently experimental)]))
> +
> +AS_IF([test "x$enable_cxl" = "xyes" || test "x$enable_all" == "xyes"], [
> +  AC_DEFINE(HAVE_CXL,1,"have CXL events collect")
> +  AC_SUBST([WITH_CXL])
> +])
> +AM_CONDITIONAL([WITH_CXL], [test x$enable_cxl = xyes || test x$enable_all == xyes])
> +AM_COND_IF([WITH_CXL], [USE_CXL="yes"], [USE_CXL="no"])
> +
>  AC_ARG_ENABLE([abrt_report],
>      AS_HELP_STRING([--enable-abrt-report], [enable report event to ABRT (currently experimental)]))
>  
> @@ -215,6 +225,7 @@ compile time options summary
>      DEVLINK             : $USE_DEVLINK
>      Disk I/O errors     : $USE_DISKERROR
>      Memory Failure      : $USE_MEMORY_FAILURE
> +    CXL events          : $USE_CXL
>      Memory CE PFA       : $USE_MEMORY_CE_PFA
>      AMP RAS errors      : $USE_AMP_NS_DECODE
>      CPU fault isolation : $USE_CPU_FAULT_ISOLATION
> diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c
> new file mode 100644
> index 0000000..0ba2519
> --- /dev/null
> +++ b/ras-cxl-handler.c
> @@ -0,0 +1,172 @@
> +/*
> + * Copyright (c) Huawei Technologies Co., Ltd. 2023. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <traceevent/kbuffer.h>
> +#include "ras-cxl-handler.h"
> +#include "ras-record.h"
> +#include "ras-logger.h"
> +#include "ras-report.h"
> +
> +/* Poison List: Payload out flags */
> +#define CXL_POISON_FLAG_MORE            BIT(0)
> +#define CXL_POISON_FLAG_OVERFLOW        BIT(1)
> +#define CXL_POISON_FLAG_SCANNING        BIT(2)
> +
> +/* CXL poison - source types */
> +enum cxl_poison_source {
> +	CXL_POISON_SOURCE_UNKNOWN = 0,
> +	CXL_POISON_SOURCE_EXTERNAL = 1,
> +	CXL_POISON_SOURCE_INTERNAL = 2,
> +	CXL_POISON_SOURCE_INJECTED = 3,
> +	CXL_POISON_SOURCE_VENDOR = 7,
> +};
> +
> +int ras_cxl_poison_event_handler(struct trace_seq *s,
> +				 struct tep_record *record,
> +				 struct tep_event *event, void *context)
> +{
> +	int len;
> +	unsigned long long val;
> +	struct ras_events *ras = context;
> +	time_t now;
> +	struct tm *tm;
> +	struct ras_cxl_poison_event ev;
> +
> +	now = record->ts / user_hz + ras->uptime_diff;
> +	tm = localtime(&now);
> +	if (tm)
> +		strftime(ev.timestamp, sizeof(ev.timestamp),
> +			 "%Y-%m-%d %H:%M:%S %z", tm);
> +	else
> +		strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000", sizeof(ev.timestamp));
> +	if (trace_seq_printf(s, "%s ", ev.timestamp) <= 0)
> +		return -1;
> +
> +	ev.memdev = tep_get_field_raw(s, event, "memdev",
> +				      record, &len, 1);
> +	if (!ev.memdev)
> +		return -1;
> +	if (trace_seq_printf(s, "memdev:%s ", ev.memdev) <= 0)
> +		return -1;
> +
> +	ev.pcidev = tep_get_field_raw(s, event, "pcidev",
> +				      record, &len, 1);
> +	if (!ev.pcidev)
> +		return -1;
> +	if (trace_seq_printf(s, "pcidev:%s ", ev.pcidev) <= 0)
> +		return -1;
> +
> +	ev.region = tep_get_field_raw(s, event, "region",
> +				      record, &len, 1);
> +	if (!ev.region)
> +		return -1;
> +	if (trace_seq_printf(s, "region:%s ", ev.region) <= 0)
> +		return -1;
> +
> +	ev.uuid = tep_get_field_raw(s, event, "uuid",
> +				    record, &len, 1);
> +	if (!ev.uuid)
> +		return -1;
> +	if (trace_seq_printf(s, "region_uuid:%s ", ev.uuid) <= 0)
> +		return -1;
> +
> +	if (tep_get_field_val(s, event, "hpa", record, &val, 1) < 0)
> +		return -1;
> +	ev.hpa = val;
> +	if (trace_seq_printf(s, "poison list: hpa:0x%llx ", (unsigned long long)ev.hpa) <= 0)
> +		return -1;
> +
> +	if (tep_get_field_val(s, event, "dpa", record, &val, 1) < 0)
> +		return -1;
> +	ev.dpa = val;
> +	if (trace_seq_printf(s, "dpa:0x%llx ", (unsigned long long)ev.dpa) <= 0)
> +		return -1;
> +
> +	if (tep_get_field_val(s, event, "length", record, &val, 1) < 0)
> +		return -1;
> +	ev.length = val;
> +	if (trace_seq_printf(s, "length:%d ", ev.length) <= 0)
> +		return -1;
> +
> +	if (tep_get_field_val(s,  event, "source", record, &val, 1) < 0)
> +		return -1;
> +
> +	switch (val) {
> +	case CXL_POISON_SOURCE_UNKNOWN:
> +		ev.source = "Unknown";
> +		break;
> +	case CXL_POISON_SOURCE_EXTERNAL:
> +		ev.source = "External";
> +		break;
> +	case CXL_POISON_SOURCE_INTERNAL:
> +		ev.source = "Internal";
> +		break;
> +	case CXL_POISON_SOURCE_INJECTED:
> +		ev.source = "Injected";
> +		break;
> +	case CXL_POISON_SOURCE_VENDOR:
> +		ev.source = "Vendor";
> +		break;
> +	default:
> +		ev.source = "Invalid";
> +	}
> +	if (trace_seq_printf(s, "source:%s ", ev.source) <= 0)
> +		return -1;
> +
> +	if (tep_get_field_val(s,  event, "flags", record, &val, 1) < 0)
> +		return -1;
> +	ev.flags = val;
> +	if (trace_seq_printf(s, "flags:%d ", ev.flags) <= 0)
> +		return -1;
> +
> +	if (ev.flags & CXL_POISON_FLAG_OVERFLOW) {
> +		if (tep_get_field_val(s,  event, "overflow_t", record, &val, 1) < 0)
> +			return -1;
> +		if (val) {
> +			/* CXL Specification 3.0
> +			 * Overflow timestamp - The number of unsigned nanoseconds
> +			 * that have elapsed since midnight, 01-Jan-1970 UTC
> +			 */
> +			time_t ovf_ts_secs = val / 1000000000ULL;
> +
> +			tm = localtime(&ovf_ts_secs);
> +			if (tm) {
> +				strftime(ev.overflow_ts, sizeof(ev.overflow_ts),
> +					 "%Y-%m-%d %H:%M:%S %z", tm);
> +			}
> +		}
> +		if (!val || !tm)
> +			strncpy(ev.overflow_ts, "1970-01-01 00:00:00 +0000",
> +				sizeof(ev.overflow_ts));
> +	} else
> +		strncpy(ev.overflow_ts, "1970-01-01 00:00:00 +0000", sizeof(ev.overflow_ts));
> +	if (trace_seq_printf(s, "overflow timestamp:%s\n", ev.overflow_ts) <= 0)
> +		return -1;
> +
> +	/* Insert data into the SGBD */
> +#ifdef HAVE_SQLITE3
> +	ras_store_cxl_poison_event(ras, &ev);
> +#endif
> +
> +#ifdef HAVE_ABRT_REPORT
> +	/* Report event to ABRT */
> +	ras_report_cxl_poison_event(ras, &ev);
> +#endif
> +
> +	return 0;
> +}
> diff --git a/ras-cxl-handler.h b/ras-cxl-handler.h
> new file mode 100644
> index 0000000..84d5cc6
> --- /dev/null
> +++ b/ras-cxl-handler.h
> @@ -0,0 +1,24 @@
> +/*
> + * Copyright (c) Huawei Technologies Co., Ltd. 2023. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#ifndef __RAS_CXL_HANDLER_H
> +#define __RAS_CXL_HANDLER_H
> +
> +#include "ras-events.h"
> +#include <traceevent/event-parse.h>
> +
> +int ras_cxl_poison_event_handler(struct trace_seq *s,
> +				 struct tep_record *record,
> +				 struct tep_event *event, void *context);
> +#endif
> diff --git a/ras-events.c b/ras-events.c
> index 39f9ce2..6555125 100644
> --- a/ras-events.c
> +++ b/ras-events.c
> @@ -40,6 +40,7 @@
>  #include "ras-devlink-handler.h"
>  #include "ras-diskerror-handler.h"
>  #include "ras-memory-failure-handler.h"
> +#include "ras-cxl-handler.h"
>  #include "ras-record.h"
>  #include "ras-logger.h"
>  #include "ras-page-isolation.h"
> @@ -243,6 +244,10 @@ int toggle_ras_mc_event(int enable)
>  	rc |= __toggle_ras_mc_event(ras, "ras", "memory_failure_event", enable);
>  #endif
>  
> +#ifdef HAVE_CXL
> +	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_poison", enable);
> +#endif
> +
>  free_ras:
>  	free(ras);
>  	return rc;
> @@ -951,6 +956,16 @@ int handle_ras_events(int record_events)
>  		    "ras", "memory_failure_event");
>  #endif
>  
> +#ifdef HAVE_CXL
> +	rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_poison",
> +			       ras_cxl_poison_event_handler, NULL, CXL_POISON_EVENT);
> +	if (!rc)
> +		num_events++;
> +	else
> +		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
> +		    "cxl", "cxl_poison");
> +#endif
> +
>  	if (!num_events) {
>  		log(ALL, LOG_INFO,
>  		    "Failed to trace all supported RAS events. Aborting.\n");
> diff --git a/ras-events.h b/ras-events.h
> index 6c9f507..fc51070 100644
> --- a/ras-events.h
> +++ b/ras-events.h
> @@ -39,6 +39,7 @@ enum {
>  	DEVLINK_EVENT,
>  	DISKERROR_EVENT,
>  	MF_EVENT,
> +	CXL_POISON_EVENT,
>  	NR_EVENTS
>  };
>  
> diff --git a/ras-record.c b/ras-record.c
> index a367939..f54fb41 100644
> --- a/ras-record.c
> +++ b/ras-record.c
> @@ -559,6 +559,67 @@ int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev)
>  }
>  #endif
>  
> +#ifdef HAVE_CXL
> +/*
> + * Table and functions to handle cxl:cxl_poison
> + */
> +static const struct db_fields cxl_poison_event_fields[] = {
> +	{ .name = "id",                   .type = "INTEGER PRIMARY KEY" },
> +	{ .name = "timestamp",            .type = "TEXT" },
> +	{ .name = "memdev",               .type = "TEXT" },
> +	{ .name = "pcidev",               .type = "TEXT" },
> +	{ .name = "region",               .type = "TEXT" },
> +	{ .name = "region_uuid",          .type = "TEXT" },
> +	{ .name = "hpa",                  .type = "INTEGER" },
> +	{ .name = "dpa",                  .type = "INTEGER" },
> +	{ .name = "length",               .type = "INTEGER" },
> +	{ .name = "source",               .type = "TEXT" },
> +	{ .name = "flags",                .type = "INTEGER" },
> +	{ .name = "overflow_ts",          .type = "TEXT" },
> +};
> +
> +static const struct db_table_descriptor cxl_poison_event_tab = {
> +	.name = "cxl_poison_event",
> +	.fields = cxl_poison_event_fields,
> +	.num_fields = ARRAY_SIZE(cxl_poison_event_fields),
> +};
> +
> +int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev)
> +{
> +	int rc;
> +	struct sqlite3_priv *priv = ras->db_priv;
> +
> +	if (!priv || !priv->stmt_cxl_poison_event)
> +		return 0;
> +	log(TERM, LOG_INFO, "cxl_poison_event store: %p\n", priv->stmt_cxl_poison_event);
> +
> +	sqlite3_bind_text(priv->stmt_cxl_poison_event, 1, ev->timestamp, -1, NULL);
> +	sqlite3_bind_text(priv->stmt_cxl_poison_event, 2, ev->memdev, -1, NULL);
> +	sqlite3_bind_text(priv->stmt_cxl_poison_event, 3, ev->pcidev, -1, NULL);
> +	sqlite3_bind_text(priv->stmt_cxl_poison_event, 4, ev->region, -1, NULL);
> +	sqlite3_bind_text(priv->stmt_cxl_poison_event, 5, ev->uuid, -1, NULL);
> +	sqlite3_bind_int64(priv->stmt_cxl_poison_event, 6, ev->hpa);
> +	sqlite3_bind_int64(priv->stmt_cxl_poison_event, 7, ev->dpa);
> +	sqlite3_bind_int(priv->stmt_cxl_poison_event, 8, ev->length);
> +	sqlite3_bind_text(priv->stmt_cxl_poison_event, 9, ev->source, -1, NULL);
> +	sqlite3_bind_int(priv->stmt_cxl_poison_event, 10, ev->flags);
> +	sqlite3_bind_text(priv->stmt_cxl_poison_event, 11, ev->overflow_ts, -1, NULL);
> +
> +	rc = sqlite3_step(priv->stmt_cxl_poison_event);
> +	if (rc != SQLITE_OK && rc != SQLITE_DONE)
> +		log(TERM, LOG_ERR,
> +		    "Failed to do cxl_poison_event step on sqlite: error = %d\n", rc);
> +	rc = sqlite3_reset(priv->stmt_cxl_poison_event);
> +	if (rc != SQLITE_OK && rc != SQLITE_DONE)
> +		log(TERM, LOG_ERR,
> +		    "Failed reset cxl_poison_event on sqlite: error = %d\n",
> +		    rc);
> +	log(TERM, LOG_INFO, "register inserted at db\n");
> +
> +	return rc;
> +}
> +#endif
> +
>  /*
>   * Generic code
>   */
> @@ -896,6 +957,16 @@ int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras)
>  	}
>  #endif
>  
> +#ifdef HAVE_CXL
> +	rc = ras_mc_create_table(priv, &cxl_poison_event_tab);
> +	if (rc == SQLITE_OK) {
> +		rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_poison_event,
> +					 &cxl_poison_event_tab);
> +		if (rc != SQLITE_OK)
> +			goto error;
> +	}
> +#endif
> +
>  	ras->db_priv = priv;
>  	return 0;
>  
> @@ -1008,6 +1079,16 @@ int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras)
>  	}
>  #endif
>  
> +#ifdef HAVE_CXL
> +	if (priv->stmt_cxl_poison_event) {
> +		rc = sqlite3_finalize(priv->stmt_cxl_poison_event);
> +		if (rc != SQLITE_OK)
> +			log(TERM, LOG_ERR,
> +			    "cpu %u: Failed to finalize cxl_poison_event sqlite: error = %d\n",
> +			    cpu, rc);
> +	}
> +#endif
> +
>  	rc = sqlite3_close_v2(db);
>  	if (rc != SQLITE_OK)
>  		log(TERM, LOG_ERR,
> diff --git a/ras-record.h b/ras-record.h
> index 219f10b..e5bf483 100644
> --- a/ras-record.h
> +++ b/ras-record.h
> @@ -114,6 +114,20 @@ struct ras_mf_event {
>  	const char *action_result;
>  };
>  
> +struct ras_cxl_poison_event {
> +	char timestamp[64];
> +	const char *memdev;
> +	const char *pcidev;
> +	const char *region;
> +	const char *uuid;
> +	uint64_t hpa;
> +	uint64_t dpa;
> +	uint32_t length;
> +	const char *source;
> +	uint8_t flags;
> +	char overflow_ts[64];
> +};
> +
>  struct ras_mc_event;
>  struct ras_aer_event;
>  struct ras_extlog_event;
> @@ -123,6 +137,7 @@ struct mce_event;
>  struct devlink_event;
>  struct diskerror_event;
>  struct ras_mf_event;
> +struct ras_cxl_poison_event;
>  
>  #ifdef HAVE_SQLITE3
>  
> @@ -155,6 +170,9 @@ struct sqlite3_priv {
>  #ifdef HAVE_MEMORY_FAILURE
>  	sqlite3_stmt	*stmt_mf_event;
>  #endif
> +#ifdef HAVE_CXL
> +	sqlite3_stmt	*stmt_cxl_poison_event;
> +#endif
>  };
>  
>  struct db_fields {
> @@ -182,6 +200,7 @@ int ras_store_arm_record(struct ras_events *ras, struct ras_arm_event *ev);
>  int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev);
>  int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
>  int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
> +int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
>  
>  #else
>  static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; };
> @@ -195,6 +214,7 @@ static inline int ras_store_arm_record(struct ras_events *ras, struct ras_arm_ev
>  static inline int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev) { return 0; };
>  static inline int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
>  static inline int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
> +static inline int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
>  
>  #endif
>  
> diff --git a/ras-report.c b/ras-report.c
> index 62d5eb7..589e640 100644
> --- a/ras-report.c
> +++ b/ras-report.c
> @@ -331,6 +331,42 @@ static int set_mf_event_backtrace(char *buf, struct ras_mf_event *ev)
>  	return 0;
>  }
>  
> +static int set_cxl_poison_event_backtrace(char *buf, struct ras_cxl_poison_event *ev)
> +{
> +	char bt_buf[MAX_BACKTRACE_SIZE];
> +
> +	if (!buf || !ev)
> +		return -1;
> +
> +	sprintf(bt_buf, "BACKTRACE="	\
> +						"timestamp=%s\n"	\
> +						"memdev=%s\n"		\
> +						"pcidev=%s\n"		\
> +						"region=%s\n"		\
> +						"uuid=%s\n"		\
> +						"hpa=0x%lx\n"		\
> +						"dpa=0x%lx\n"		\
> +						"length=%d\n"		\
> +						"source=%s\n"		\
> +						"flags=%d\n"		\
> +						"overflow_timestamp=%s\n" \
> +						ev->timestamp,		\
> +						ev->memdev,		\
> +						ev->pcidev,		\
> +						ev->region,		\
> +						ev->uuid,		\
> +						ev->hpa,		\
> +						ev->dpa,		\
> +						ev->length,		\
> +						ev->source,		\
> +						ev->flags,		\
> +						ev->overflow_ts);
> +
> +	strcat(buf, bt_buf);
> +
> +	return 0;
> +}
> +
>  static int commit_report_backtrace(int sockfd, int type, void *ev){
>  	char buf[MAX_BACKTRACE_SIZE];
>  	char *pbuf = buf;
> @@ -368,6 +404,9 @@ static int commit_report_backtrace(int sockfd, int type, void *ev){
>  	case MF_EVENT:
>  		rc = set_mf_event_backtrace(buf, (struct ras_mf_event *)ev);
>  		break;
> +	case CXL_POISON_EVENT:
> +		rc = set_cxl_poison_event_backtrace(buf, (struct ras_cxl_poison_event *)ev);
> +		break;
>  	default:
>  		return -1;
>  	}
> @@ -776,3 +815,47 @@ mf_fail:
>  	else
>  		return -1;
>  }
> +
> +int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev)
> +{
> +	char buf[MAX_MESSAGE_SIZE];
> +	int sockfd = 0;
> +	int done = 0;
> +	int rc = -1;
> +
> +	memset(buf, 0, sizeof(buf));
> +
> +	sockfd = setup_report_socket();
> +	if (sockfd < 0)
> +		return -1;
> +
> +	rc = commit_report_basic(sockfd);
> +	if (rc < 0)
> +		goto cxl_poison_fail;
> +
> +	rc = commit_report_backtrace(sockfd, CXL_POISON_EVENT, ev);
> +	if (rc < 0)
> +		goto cxl_poison_fail;
> +
> +	sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-poison");
> +	rc = write(sockfd, buf, strlen(buf) + 1);
> +	if (rc < strlen(buf) + 1)
> +		goto cxl_poison_fail;
> +
> +	sprintf(buf, "REASON=%s", "CXL poison");
> +	rc = write(sockfd, buf, strlen(buf) + 1);
> +	if (rc < strlen(buf) + 1)
> +		goto cxl_poison_fail;
> +
> +	done = 1;
> +
> +cxl_poison_fail:
> +
> +	if (sockfd >= 0)
> +		close(sockfd);
> +
> +	if (done)
> +		return 0;
> +	else
> +		return -1;
> +}
> diff --git a/ras-report.h b/ras-report.h
> index e605eb1..d1591ce 100644
> --- a/ras-report.h
> +++ b/ras-report.h
> @@ -39,6 +39,7 @@ int ras_report_arm_event(struct ras_events *ras, struct ras_arm_event *ev);
>  int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev);
>  int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
>  int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
> +int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
>  
>  #else
>  
> @@ -50,6 +51,7 @@ static inline int ras_report_arm_event(struct ras_events *ras, struct ras_arm_ev
>  static inline int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev) { return 0; };
>  static inline int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
>  static inline int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
> +static inline int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
>  
>  #endif
>  
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RESEND PATCH V3 2/4] rasdaemon: Add support for the CXL poison events
  2023-02-02 19:41   ` Alison Schofield
@ 2023-02-03  9:31     ` Shiju Jose
  0 siblings, 0 replies; 7+ messages in thread
From: Shiju Jose @ 2023-02-03  9:31 UTC (permalink / raw)
  To: Alison Schofield
  Cc: mchehab@kernel.org, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, Jonathan Cameron, Linuxarm

>-----Original Message-----
>From: Alison Schofield <alison.schofield@intel.com>
>Sent: 02 February 2023 19:42
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: mchehab@kernel.org; linux-edac@vger.kernel.org; linux-
>cxl@vger.kernel.org; Jonathan Cameron <jonathan.cameron@huawei.com>;
>Linuxarm <linuxarm@huawei.com>
>Subject: Re: [RESEND PATCH V3 2/4] rasdaemon: Add support for the CXL poison
>events
>
>On Thu, Feb 02, 2023 at 06:18:44PM +0000, shiju.jose@huawei.com wrote:
>> From: Shiju Jose <shiju.jose@huawei.com>
>>
>> Add support to log and record the CXL poison events.
>>
>> The corresponding Kernel patches here:
>> https://lore.kernel.org/linux-cxl/de11785ff05844299b40b100f8e0f56c7eef
>> 7f08.1674070170.git.alison.schofield@intel.com/
>>
>> Presently RFC draft version for logging, could be extended for the
>> policy based recovery action for the frequent poison events depending
>> on the above kernel patches.
>
>Hi Shiju,
>
>Looks good to me based on the kernel patches you reference above.
>I want to let you know that a v6 is in the works, and will lead to some naming
>changes below - but minor stuff.

Hi Alison,

Thanks for the information. I will make changes according to the v6.
>
>Alison
>

Thanks,
Shiju




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-02-03  9:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-02 18:18 [RESEND PATCH V3 0/4] rasdaemon: Add support for the CXL error events shiju.jose
2023-02-02 18:18 ` [RESEND PATCH V3 1/4] rasdaemon: Move definition for BIT and BIT_ULL to a common file shiju.jose
2023-02-02 18:18 ` [RESEND PATCH V3 2/4] rasdaemon: Add support for the CXL poison events shiju.jose
2023-02-02 19:41   ` Alison Schofield
2023-02-03  9:31     ` Shiju Jose
2023-02-02 18:18 ` [RESEND PATCH V3 3/4] rasdaemon: Add support for the CXL AER uncorrectable errors shiju.jose
2023-02-02 18:18 ` [RESEND PATCH V3 4/4] rasdaemon: Add support for the CXL AER correctable errors shiju.jose

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.