Linux CXL
 help / color / mirror / Atom feed
* [ndctl PATCH 0/3] Support poison list retrieval
@ 2022-11-11  3:20 alison.schofield
  2022-11-11  3:20 ` [ndctl PATCH 1/5] libcxl: add interfaces for GET_POISON_LIST mailbox commands alison.schofield
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: alison.schofield @ 2022-11-11  3:20 UTC (permalink / raw)
  To: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky
  Cc: Alison Schofield, nvdimm, linux-cxl

From: Alison Schofield <alison.schofield@intel.com>

Changes RFC->v1:
- Resync with DaveJ's v5 monitor patchset. [1]
  (It provides the event tracing functionality used here.)
- Resync with the kernel patchset adding poison list support. [2]
- Add cxl-get-poison.sh unit test to cxl test suite.
- JSON object naming cleanups, replace spaces with '_'.
- Use common event pid field to restrict events to this cxl list instance.
- Use json_object_get_int64() for addresses.
- Remove empty hpa fields. Add back with dpa->hpa translation.

[1] https://lore.kernel.org/linux-cxl/166803877747.145141.11853418648969334939.stgit@djiang5-desk3.ch.intel.com/
[2] https://lore.kernel.org/linux-cxl/cover.1668115235.git.alison.schofield@intel.com/

The first patch adds a libcxl API for triggering the read of a
poison list from a memory device. Users of that API will need to
trace the kernel events to collect the error records.

Patches 2 adds a PID filtering option to event tracing and then
patches 3 & 4 add a pretty option, --media-errors to cxl list.
The last patch (5) adds a unit test to the cxl test suite.

Examples:
cxl list -m mem2 --media-errors
[
  {
    "memdev":"mem2",
    "pmem_size":1073741824,
    "ram_size":0,
    "serial":2,
    "host":"cxl_mem.2",
    "media_errors":{
      "nr_media_errors":2,
      "media_error_records":[
        {
          "dpa":64,
          "length":128,
          "source":"Injected",
          "flags":"Overflow,",
          "overflow_time":1656711046
        },
        {
          "dpa":192,
          "length":192,
          "source":"Internal",
          "flags":"Overflow,",
          "overflow_time":1656711046
        },
      ]
    }
  }
]

# cxl list -r region5 --media-errors
[
  {
    "region":"region5",
    "resource":1035623989248,
    "size":2147483648,
    "interleave_ways":2,
    "interleave_granularity":4096,
    "decode_state":"commit",
    "media_errors":{
      "nr_media_errors":2,
      "media_error_records":[
        {
          "memdev":"mem2",
          "dpa":0,
          "length":64,
          "source":"Internal",
          "flags":"",
          "overflow_time":0
        },
	{
          "memdev":"mem5",
          "dpa":0,
          "length":256,
          "source":"Injected",
          "flags":"",
          "overflow_time":0
        }
      ]
    }
  }
]

Alison Schofield (5):
  libcxl: add interfaces for GET_POISON_LIST mailbox commands
  cxl: add an optional pid check to event parsing
  cxl/list: collect and parse the poison list records
  cxl/list: add --media-errors option to cxl list
  test: add a cxl-get-poison test

 Documentation/cxl/cxl-list.txt |  64 ++++++++++++
 cxl/event_trace.c              |   5 +
 cxl/event_trace.h              |   1 +
 cxl/filter.c                   |   2 +
 cxl/filter.h                   |   1 +
 cxl/json.c                     | 185 +++++++++++++++++++++++++++++++++
 cxl/lib/libcxl.c               |  44 ++++++++
 cxl/lib/libcxl.sym             |   6 ++
 cxl/libcxl.h                   |   2 +
 cxl/list.c                     |   2 +
 test/cxl-get-poison.sh         |  78 ++++++++++++++
 test/meson.build               |   2 +
 12 files changed, 392 insertions(+)
 create mode 100644 test/cxl-get-poison.sh

-- 
2.37.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ndctl PATCH 1/5] libcxl: add interfaces for GET_POISON_LIST mailbox commands
  2022-11-11  3:20 [ndctl PATCH 0/3] Support poison list retrieval alison.schofield
@ 2022-11-11  3:20 ` alison.schofield
  2022-11-16 12:56   ` Jonathan Cameron
  2022-11-11  3:20 ` [ndctl PATCH 2/5] cxl: add an optional pid check to event parsing alison.schofield
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: alison.schofield @ 2022-11-11  3:20 UTC (permalink / raw)
  To: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky
  Cc: Alison Schofield, nvdimm, linux-cxl

From: Alison Schofield <alison.schofield@intel.com>

CXL devices maintain a list of locations that are poisoned or result
in poison if the addresses are accessed by the host.

Per the spec (CXL 3.0 8.2.9.8.4.1), the device returns this Poison
list as a set of  Media Error Records that include the source of the
error, the starting device physical address and length.

Trigger the retrieval of the poison list by writing to the device
sysfs attribute: trigger_poison_list.

Retrieval is offered by memdev or by region:
int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev);
int cxl_region_trigger_poison_list(struct cxl_region *region);

This interface triggers the retrieval of the poison list from the
devices and logs the error records as kernel trace events named
'cxl_poison'.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
 cxl/lib/libcxl.c   | 44 ++++++++++++++++++++++++++++++++++++++++++++
 cxl/lib/libcxl.sym |  6 ++++++
 cxl/libcxl.h       |  2 ++
 3 files changed, 52 insertions(+)

diff --git a/cxl/lib/libcxl.c b/cxl/lib/libcxl.c
index e8c5d4444dd0..1a8a8eb0ffcb 100644
--- a/cxl/lib/libcxl.c
+++ b/cxl/lib/libcxl.c
@@ -1331,6 +1331,50 @@ CXL_EXPORT int cxl_memdev_disable_invalidate(struct cxl_memdev *memdev)
 	return 0;
 }
 
+CXL_EXPORT int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev)
+{
+	struct cxl_ctx *ctx = cxl_memdev_get_ctx(memdev);
+	char *path = memdev->dev_buf;
+	int len = memdev->buf_len, rc;
+
+	if (snprintf(path, len, "%s/trigger_poison_list", memdev->dev_path) >=
+	    len) {
+		err(ctx, "%s: buffer too small\n",
+		    cxl_memdev_get_devname(memdev));
+		return -ENXIO;
+	}
+	rc = sysfs_write_attr(ctx, path, "1\n");
+	if (rc < 0) {
+		fprintf(stderr,
+			"%s: Failed write sysfs attr trigger_poison_list\n",
+			cxl_memdev_get_devname(memdev));
+		return rc;
+	}
+	return 0;
+}
+
+CXL_EXPORT int cxl_region_trigger_poison_list(struct cxl_region *region)
+{
+	struct cxl_ctx *ctx = cxl_region_get_ctx(region);
+	char *path = region->dev_buf;
+	int len = region->buf_len, rc;
+
+	if (snprintf(path, len, "%s/trigger_poison_list", region->dev_path) >=
+	    len) {
+		err(ctx, "%s: buffer too small\n",
+		    cxl_region_get_devname(region));
+		return -ENXIO;
+	}
+	rc = sysfs_write_attr(ctx, path, "1\n");
+	if (rc < 0) {
+		fprintf(stderr,
+			"%s: Failed write sysfs attr trigger_poison_list\n",
+			cxl_region_get_devname(region));
+		return rc;
+	}
+	return 0;
+}
+
 CXL_EXPORT int cxl_memdev_enable(struct cxl_memdev *memdev)
 {
 	struct cxl_ctx *ctx = cxl_memdev_get_ctx(memdev);
diff --git a/cxl/lib/libcxl.sym b/cxl/lib/libcxl.sym
index 8bb91e05638b..ecf98e6c7af2 100644
--- a/cxl/lib/libcxl.sym
+++ b/cxl/lib/libcxl.sym
@@ -217,3 +217,9 @@ global:
 	cxl_decoder_get_max_available_extent;
 	cxl_decoder_get_region;
 } LIBCXL_2;
+
+LIBCXL_4 {
+global:
+	cxl_memdev_trigger_poison_list;
+	cxl_region_trigger_poison_list;
+} LIBCXL_3;
diff --git a/cxl/libcxl.h b/cxl/libcxl.h
index 9fe4e99263dd..5ebdf0879325 100644
--- a/cxl/libcxl.h
+++ b/cxl/libcxl.h
@@ -375,6 +375,8 @@ enum cxl_setpartition_mode {
 
 int cxl_cmd_partition_set_mode(struct cxl_cmd *cmd,
 		enum cxl_setpartition_mode mode);
+int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev);
+int cxl_region_trigger_poison_list(struct cxl_region *region);
 
 #ifdef __cplusplus
 } /* extern "C" */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [ndctl PATCH 2/5] cxl: add an optional pid check to event parsing
  2022-11-11  3:20 [ndctl PATCH 0/3] Support poison list retrieval alison.schofield
  2022-11-11  3:20 ` [ndctl PATCH 1/5] libcxl: add interfaces for GET_POISON_LIST mailbox commands alison.schofield
@ 2022-11-11  3:20 ` alison.schofield
  2022-11-16 12:57   ` Jonathan Cameron
  2022-11-11  3:20 ` [ndctl PATCH 3/5] cxl/list: collect and parse the poison list records alison.schofield
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: alison.schofield @ 2022-11-11  3:20 UTC (permalink / raw)
  To: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky
  Cc: Alison Schofield, nvdimm, linux-cxl

From: Alison Schofield <alison.schofield@intel.com>

When parsing CXL events, callers may only be interested in events
that originate from the current process. Introduce an optional
argument to the event trace context: event_pid. When event_pid is
present, only include events with a matching pid in the returned
JSON list. It is not a failure to see other, non matching results.
Simply skip those.

The initial use case for this is the listing of media errors,
where only the media errors requested by this process are wanted.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
 cxl/event_trace.c | 5 +++++
 cxl/event_trace.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/cxl/event_trace.c b/cxl/event_trace.c
index a973a1f62d35..70ab892bbfcb 100644
--- a/cxl/event_trace.c
+++ b/cxl/event_trace.c
@@ -207,6 +207,11 @@ static int cxl_event_parse(struct tep_event *event, struct tep_record *record,
 			return 0;
 	}
 
+	if (event_ctx->event_pid) {
+		if (event_ctx->event_pid != tep_data_pid(event->tep, record))
+			return 0;
+	}
+
 	if (event_ctx->parse_event)
 		return event_ctx->parse_event(event, record,
 					      &event_ctx->jlist_head);
diff --git a/cxl/event_trace.h b/cxl/event_trace.h
index ec6267202c8b..7f7773b2201f 100644
--- a/cxl/event_trace.h
+++ b/cxl/event_trace.h
@@ -15,6 +15,7 @@ struct event_ctx {
 	const char *system;
 	struct list_head jlist_head;
 	const char *event_name; /* optional */
+	int event_pid; /* optional */
 	int (*parse_event)(struct tep_event *event, struct tep_record *record,
 			   struct list_head *jlist_head); /* optional */
 };
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [ndctl PATCH 3/5] cxl/list: collect and parse the poison list records
  2022-11-11  3:20 [ndctl PATCH 0/3] Support poison list retrieval alison.schofield
  2022-11-11  3:20 ` [ndctl PATCH 1/5] libcxl: add interfaces for GET_POISON_LIST mailbox commands alison.schofield
  2022-11-11  3:20 ` [ndctl PATCH 2/5] cxl: add an optional pid check to event parsing alison.schofield
@ 2022-11-11  3:20 ` alison.schofield
  2022-11-11  3:20 ` [ndctl PATCH 4/5] cxl/list: add --media-errors option to cxl list alison.schofield
  2022-11-11  3:20 ` [ndctl PATCH 5/5] test: add a cxl-get-poison test alison.schofield
  4 siblings, 0 replies; 12+ messages in thread
From: alison.schofield @ 2022-11-11  3:20 UTC (permalink / raw)
  To: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky
  Cc: Alison Schofield, nvdimm, linux-cxl

From: Alison Schofield <alison.schofield@intel.com>

When triggered, poison list error records are logged as events
in the kernel tracing subsystem. Trace, trigger, and parse the
events when the --media-error option is selected in cxl list.

Include the total number of media-errors, even when zero.

The media-error records matches the definition in the CXL 3.0
Spec Table 8.107.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
 cxl/json.c | 185 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 185 insertions(+)

diff --git a/cxl/json.c b/cxl/json.c
index 63c17519aba1..1b3c0bda6bda 100644
--- a/cxl/json.c
+++ b/cxl/json.c
@@ -2,14 +2,18 @@
 // Copyright (C) 2015-2021 Intel Corporation. All rights reserved.
 #include <limits.h>
 #include <util/json.h>
+#include <util/bitmap.h>
 #include <uuid/uuid.h>
 #include <cxl/libcxl.h>
 #include <json-c/json.h>
 #include <json-c/printbuf.h>
 #include <ccan/short_types/short_types.h>
+#include <traceevent/event-parse.h>
+#include <tracefs/tracefs.h>
 
 #include "filter.h"
 #include "json.h"
+#include "event_trace.h"
 
 static struct json_object *util_cxl_memdev_health_to_json(
 		struct cxl_memdev *memdev, unsigned long flags)
@@ -300,6 +304,167 @@ err_jobj:
 	return NULL;
 }
 
+/* CXL 8.2.9.5.4.1 Get Poison List: Poison Source */
+#define CXL_POISON_SOURCE_UNKNOWN 0
+#define CXL_POISON_SOURCE_EXTERNAL 1
+#define CXL_POISON_SOURCE_INTERNAL 2
+#define CXL_POISON_SOURCE_INJECTED 3
+#define CXL_POISON_SOURCE_VENDOR 7
+
+/* CXL 8.2.9.5.4.1 Get Poison List: Payload out flags */
+#define CXL_POISON_FLAG_MORE BIT(0)
+#define CXL_POISON_FLAG_OVERFLOW BIT(1)
+#define CXL_POISON_FLAG_SCANNING BIT(2)
+
+static struct json_object *
+util_cxl_poison_events_to_json(struct tracefs_instance *inst, bool is_region,
+			       unsigned long flags)
+{
+	struct json_object *jerrors, *jmedia, *jobj = NULL;
+	struct jlist_node *jnode, *next;
+	struct event_ctx ectx = {
+		.event_name = "cxl_poison",
+		.event_pid = getpid(),
+		.system = "cxl",
+	};
+	int rc, count = 0;
+
+	list_head_init(&ectx.jlist_head);
+	rc = cxl_parse_events(inst, &ectx);
+	if (rc < 0) {
+		fprintf(stderr, "Failed to parse events: %d\n", rc);
+		return NULL;
+	}
+	if (list_empty(&ectx.jlist_head))
+		return NULL;
+
+	jerrors = json_object_new_array();
+	if (!jerrors)
+		return NULL;
+
+	list_for_each_safe (&ectx.jlist_head, jnode, next, list) {
+		struct json_object *jval = NULL;
+		struct json_object *jp = NULL;
+		int source, pflags;
+		u64 addr, len;
+
+		jp = json_object_new_object();
+		if (!jp)
+			return NULL;
+
+		if (is_region) {
+			/* Per-region JSON includes memdev names */
+			if (json_object_object_get_ex(jnode->jobj, "memdev",
+						      &jval))
+				json_object_object_add(jp, "memdev", jval);
+		}
+		if (json_object_object_get_ex(jnode->jobj, "dpa", &jval)) {
+			addr = json_object_get_int64(jval);
+			jobj = util_json_object_hex(addr, flags);
+			json_object_object_add(jp, "dpa", jobj);
+		}
+		if (json_object_object_get_ex(jnode->jobj, "length", &jval)) {
+			len = json_object_get_int64(jval);
+			jobj = util_json_object_size(len, flags);
+			json_object_object_add(jp, "length", jobj);
+		}
+		if (json_object_object_get_ex(jnode->jobj, "source", &jval)) {
+			source = json_object_get_int(jval);
+			if (source == CXL_POISON_SOURCE_UNKNOWN)
+				jobj = json_object_new_string("Unknown");
+			else if (source == CXL_POISON_SOURCE_EXTERNAL)
+				jobj = json_object_new_string("External");
+			else if (source == CXL_POISON_SOURCE_INTERNAL)
+				jobj = json_object_new_string("Internal");
+			else if (source == CXL_POISON_SOURCE_INJECTED)
+				jobj = json_object_new_string("Injected");
+			else if (source == CXL_POISON_SOURCE_VENDOR)
+				jobj = json_object_new_string("Vendor");
+			else
+				jobj = json_object_new_string("Reserved");
+			json_object_object_add(jp, "source", jobj);
+		}
+		if (json_object_object_get_ex(jnode->jobj, "flags", &jval)) {
+			char flag_str[32] = { '\0' };
+
+			pflags = json_object_get_int(jval);
+			if (pflags & CXL_POISON_FLAG_MORE)
+				strcat(flag_str, "More,");
+			if (pflags & CXL_POISON_FLAG_OVERFLOW)
+				strcat(flag_str, "Overflow,");
+			if (pflags & CXL_POISON_FLAG_SCANNING)
+				strcat(flag_str, "Scanning,");
+			jobj = json_object_new_string(flag_str);
+			if (jobj)
+				json_object_object_add(jp, "flags", jobj);
+		}
+		if (json_object_object_get_ex(jnode->jobj, "overflow_t", &jval))
+			json_object_object_add(jp, "overflow_time", jval);
+
+		json_object_array_add(jerrors, jp);
+		count++;
+	} /* list_for_each_safe */
+
+	jmedia = json_object_new_object();
+	if (!jmedia)
+		return NULL;
+
+	/* Always return the count. If count is zero, no records follow. */
+	jobj = json_object_new_int(count);
+	if (jobj)
+		json_object_object_add(jmedia, "nr_media_errors", jobj);
+	if (count)
+		json_object_object_add(jmedia, "media_error_records", jerrors);
+
+	return jmedia;
+}
+
+struct cxl_media_err_ctx {
+	void *dev;
+	bool is_region;
+};
+
+static struct json_object *
+util_cxl_media_errors_to_json(struct cxl_media_err_ctx *mectx,
+			      unsigned long flags)
+{
+	struct json_object *jmedia = NULL;
+	struct tracefs_instance *inst;
+	int rc;
+
+	inst = tracefs_instance_create("cxl list");
+	if (!inst) {
+		fprintf(stderr, "tracefs_instance_create() failed\n");
+		return NULL;
+	}
+
+	rc = cxl_event_tracing_enable(inst, "cxl", "cxl_poison");
+	if (rc < 0) {
+		fprintf(stderr, "Failed to enable trace: %d\n", rc);
+		goto err_free;
+	}
+
+	if (mectx->is_region)
+		rc = cxl_region_trigger_poison_list(mectx->dev);
+	else
+		rc = cxl_memdev_trigger_poison_list(mectx->dev);
+	if (rc) {
+		fprintf(stderr, "Failed write of sysfs attribute: %d\n", rc);
+		goto err_free;
+	}
+
+	rc = cxl_event_tracing_disable(inst);
+	if (rc < 0) {
+		fprintf(stderr, "Failed to disable trace: %d\n", rc);
+		goto err_free;
+	}
+
+	jmedia = util_cxl_poison_events_to_json(inst, mectx->is_region, flags);
+err_free:
+	tracefs_instance_free(inst);
+	return jmedia;
+}
+
 struct json_object *util_cxl_memdev_to_json(struct cxl_memdev *memdev,
 		unsigned long flags)
 {
@@ -359,6 +524,16 @@ struct json_object *util_cxl_memdev_to_json(struct cxl_memdev *memdev,
 		if (jobj)
 			json_object_object_add(jdev, "partition_info", jobj);
 	}
+
+	if (flags & UTIL_JSON_MEDIA_ERRORS) {
+		struct cxl_media_err_ctx mectx = {
+			.dev = memdev,
+			.is_region = false,
+		};
+		jobj = util_cxl_media_errors_to_json(&mectx, flags);
+		if (jobj)
+			json_object_object_add(jdev, "media_errors", jobj);
+	}
 	return jdev;
 }
 
@@ -678,6 +853,16 @@ struct json_object *util_cxl_region_to_json(struct cxl_region *region,
 			json_object_object_add(jregion, "state", jobj);
 	}
 
+	if (flags & UTIL_JSON_MEDIA_ERRORS) {
+		struct cxl_media_err_ctx mectx = {
+			.dev = region,
+			.is_region = true,
+		};
+		jobj = util_cxl_media_errors_to_json(&mectx, flags);
+		if (jobj)
+			json_object_object_add(jregion, "media_errors", jobj);
+	}
+
 	util_cxl_mappings_append_json(jregion, region, flags);
 
 	return jregion;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [ndctl PATCH 4/5] cxl/list: add --media-errors option to cxl list
  2022-11-11  3:20 [ndctl PATCH 0/3] Support poison list retrieval alison.schofield
                   ` (2 preceding siblings ...)
  2022-11-11  3:20 ` [ndctl PATCH 3/5] cxl/list: collect and parse the poison list records alison.schofield
@ 2022-11-11  3:20 ` alison.schofield
  2022-11-16 13:03   ` Jonathan Cameron
  2022-11-11  3:20 ` [ndctl PATCH 5/5] test: add a cxl-get-poison test alison.schofield
  4 siblings, 1 reply; 12+ messages in thread
From: alison.schofield @ 2022-11-11  3:20 UTC (permalink / raw)
  To: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky
  Cc: Alison Schofield, nvdimm, linux-cxl

From: Alison Schofield <alison.schofield@intel.com>

The --media-errors option to 'cxl list' retrieves poison lists
from memory devices (supporting the capability) and displays
the returned media-error records in the cxl list json. This
option can apply to memdevs or regions.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
 Documentation/cxl/cxl-list.txt | 64 ++++++++++++++++++++++++++++++++++
 cxl/filter.c                   |  2 ++
 cxl/filter.h                   |  1 +
 cxl/list.c                     |  2 ++
 4 files changed, 69 insertions(+)

diff --git a/Documentation/cxl/cxl-list.txt b/Documentation/cxl/cxl-list.txt
index 14a2b4bb5c2a..24a0cf97cef2 100644
--- a/Documentation/cxl/cxl-list.txt
+++ b/Documentation/cxl/cxl-list.txt
@@ -344,6 +344,70 @@ OPTIONS
 --region::
 	Specify CXL region device name(s), or device id(s), to filter the listing.
 
+-a::
+--media-errors::
+	Include media-error information. The poison list is retrieved
+	from the device(s) and media error records are added to the
+	listing. This option applies to memdevs and regions where
+	devices support the poison list capability.
+
+----
+# cxl list -m mem11 --media-errors
+[
+  {
+    "memdev":"mem11",
+    "pmem_size":268435456,
+    "ram_size":0,
+    "serial":0,
+    "host":"0000:37:00.0",
+    "media_errors":{
+      "nr_media_errors":1,
+      "media_error_records":[
+        {
+          "dpa":0,
+          "length":64,
+          "source":"Internal",
+          "flags":"",
+          "overflow_time":0
+        }
+      ]
+    }
+  }
+]
+# cxl list -r region5 --media-errors
+[
+  {
+    "region":"region5",
+    "resource":1035623989248,
+    "size":2147483648,
+    "interleave_ways":2,
+    "interleave_granularity":4096,
+    "decode_state":"commit",
+    "media_errors":{
+      "nr_media_errors":2,
+      "media_error_records":[
+        {
+          "memdev":"mem2",
+          "dpa":0,
+          "length":64,
+          "source":"Internal",
+          "flags":"",
+          "overflow_time":0
+        },
+        {
+          "memdev":"mem5",
+          "dpa":0,
+          "length":512,
+          "source":"Vendor",
+          "flags":"",
+          "overflow_time":0
+        }
+      ]
+    }
+  }
+]
+----
+
 -v::
 --verbose::
 	Increase verbosity of the output. This can be specified
diff --git a/cxl/filter.c b/cxl/filter.c
index 56c659965891..fe6c29148fb4 100644
--- a/cxl/filter.c
+++ b/cxl/filter.c
@@ -686,6 +686,8 @@ static unsigned long params_to_flags(struct cxl_filter_params *param)
 		flags |= UTIL_JSON_TARGETS;
 	if (param->partition)
 		flags |= UTIL_JSON_PARTITION;
+	if (param->media_errors)
+		flags |= UTIL_JSON_MEDIA_ERRORS;
 	return flags;
 }
 
diff --git a/cxl/filter.h b/cxl/filter.h
index 256df49c3d0c..a92295fe2511 100644
--- a/cxl/filter.h
+++ b/cxl/filter.h
@@ -26,6 +26,7 @@ struct cxl_filter_params {
 	bool human;
 	bool health;
 	bool partition;
+	bool media_errors;
 	int verbose;
 	struct log_ctx ctx;
 };
diff --git a/cxl/list.c b/cxl/list.c
index 8c48fbbaaec3..df2ae5a3fec0 100644
--- a/cxl/list.c
+++ b/cxl/list.c
@@ -52,6 +52,8 @@ static const struct option options[] = {
 		    "include memory device health information"),
 	OPT_BOOLEAN('I', "partition", &param.partition,
 		    "include memory device partition information"),
+	OPT_BOOLEAN('a', "media-errors", &param.media_errors,
+		    "include media error information "),
 	OPT_INCR('v', "verbose", &param.verbose,
 		 "increase output detail"),
 #ifdef ENABLE_DEBUG
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [ndctl PATCH 5/5] test: add a cxl-get-poison test
  2022-11-11  3:20 [ndctl PATCH 0/3] Support poison list retrieval alison.schofield
                   ` (3 preceding siblings ...)
  2022-11-11  3:20 ` [ndctl PATCH 4/5] cxl/list: add --media-errors option to cxl list alison.schofield
@ 2022-11-11  3:20 ` alison.schofield
  4 siblings, 0 replies; 12+ messages in thread
From: alison.schofield @ 2022-11-11  3:20 UTC (permalink / raw)
  To: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky
  Cc: Alison Schofield, nvdimm, linux-cxl

From: Alison Schofield <alison.schofield@intel.com>

Exercise cxl list, libcxl, and driver pieces of the get poison
pathway. The poison records themselves are mocked by cxl_test,
but the work of triggering the poison read, logging as trace events,
and then collecting and parsing is all for real.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
 test/cxl-get-poison.sh | 78 ++++++++++++++++++++++++++++++++++++++++++
 test/meson.build       |  2 ++
 2 files changed, 80 insertions(+)
 create mode 100644 test/cxl-get-poison.sh

diff --git a/test/cxl-get-poison.sh b/test/cxl-get-poison.sh
new file mode 100644
index 000000000000..fe93a67a5240
--- /dev/null
+++ b/test/cxl-get-poison.sh
@@ -0,0 +1,78 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2022 Intel Corporation. All rights reserved.
+
+. $(dirname $0)/common
+
+rc=1
+
+set -ex
+
+trap 'err $LINENO' ERR
+
+check_prereq "jq"
+
+modprobe -r cxl_test
+modprobe cxl_test
+udevadm settle
+
+# The number or errors that cxl_test mocks is subject to change.
+NR_ERRS=2
+
+# THEORY OF OPERATION: Exercise cxl-cli and cxl driver capabilites wrt
+# retrieving poison lists. The poison list is maintained by the device.
+# It may be requested per memdev or per region.
+
+create_region()
+{
+	region=$($CXL create-region -d $decoder -m $memdevs | jq -r ".region")
+
+	if [[ ! $region ]]; then
+		echo "create-region failed for $decoder"
+		err "$LINENO"
+	fi
+}
+
+setup_x2_region()
+{
+        # Find an x2 decoder
+        decoder=$($CXL list -b cxl_test -D -d root | jq -r ".[] |
+          select(.pmem_capable == true) |
+          select(.nr_targets == 2) |
+          .decoder")
+
+        # Find a memdev for each host-bridge interleave position
+        port_dev0=$($CXL list -T -d $decoder | jq -r ".[] |
+            .targets | .[] | select(.position == 0) | .target")
+        port_dev1=$($CXL list -T -d $decoder | jq -r ".[] |
+            .targets | .[] | select(.position == 1) | .target")
+        mem0=$($CXL list -M -p $port_dev0 | jq -r ".[0].memdev")
+        mem1=$($CXL list -M -p $port_dev1 | jq -r ".[0].memdev")
+        memdevs="$mem0 $mem1"
+}
+
+find_media_errors()
+{
+	nr=$(echo $json | jq -r ".nr_media_errors")
+	if [[ $nr -ne $NR_ERRS ]]; then
+		echo "$mem: $NR_ERRS media errors expected, $nr found"
+		err "$LINENO"
+	fi
+}
+
+# Read poison from each available memdev
+readarray -t mems < <("$CXL" list -b cxl_test -Mi | jq -r '.[].memdev')
+for mem in ${mems[@]}; do
+	json=$("$CXL" list -m "$mem" --media-errors | jq -r '.[].media_errors')
+	find_media_errors
+done
+
+# Read poison from one region
+setup_x2_region
+create_region
+json=$("$CXL" list -r "$region" --media-errors | jq -r '.[].media_errors')
+find_media_errors
+cxl disable-region $region
+cxl destroy-region $region
+
+modprobe -r cxl_test
diff --git a/test/meson.build b/test/meson.build
index 5953c286d13f..721c69e79f5e 100644
--- a/test/meson.build
+++ b/test/meson.build
@@ -154,6 +154,7 @@ cxl_topo = find_program('cxl-topology.sh')
 cxl_sysfs = find_program('cxl-region-sysfs.sh')
 cxl_labels = find_program('cxl-labels.sh')
 cxl_create_region = find_program('cxl-create-region.sh')
+cxl_get_poison = find_program('cxl-get-poison.sh')
 
 tests = [
   [ 'libndctl',               libndctl,		  'ndctl' ],
@@ -182,6 +183,7 @@ tests = [
   [ 'cxl-region-sysfs.sh',    cxl_sysfs,	  'cxl'   ],
   [ 'cxl-labels.sh',          cxl_labels,	  'cxl'   ],
   [ 'cxl-create-region.sh',   cxl_create_region,  'cxl'   ],
+  [ 'cxl-get-poison.sh',      cxl_get_poison,     'cxl'   ],
 ]
 
 if get_option('destructive').enabled()
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [ndctl PATCH 1/5] libcxl: add interfaces for GET_POISON_LIST mailbox commands
  2022-11-11  3:20 ` [ndctl PATCH 1/5] libcxl: add interfaces for GET_POISON_LIST mailbox commands alison.schofield
@ 2022-11-16 12:56   ` Jonathan Cameron
  2022-11-17 23:45     ` Alison Schofield
  0 siblings, 1 reply; 12+ messages in thread
From: Jonathan Cameron @ 2022-11-16 12:56 UTC (permalink / raw)
  To: alison.schofield
  Cc: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky,
	nvdimm, linux-cxl

On Thu, 10 Nov 2022 19:20:04 -0800
alison.schofield@intel.com wrote:

> From: Alison Schofield <alison.schofield@intel.com>
> 
> CXL devices maintain a list of locations that are poisoned or result
> in poison if the addresses are accessed by the host.
> 
> Per the spec (CXL 3.0 8.2.9.8.4.1), the device returns this Poison
> list as a set of  Media Error Records that include the source of the
> error, the starting device physical address and length.
> 
> Trigger the retrieval of the poison list by writing to the device
> sysfs attribute: trigger_poison_list.
> 
> Retrieval is offered by memdev or by region:
> int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev);
> int cxl_region_trigger_poison_list(struct cxl_region *region);
> 
> This interface triggers the retrieval of the poison list from the
> devices and logs the error records as kernel trace events named
> 'cxl_poison'.
> 
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Trivial comment inline + I haven't been tracking closely development
of this tool closely so hopefully this will get other eyes on it who
are more familiar.  With that in mind:

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---
>  cxl/lib/libcxl.c   | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  cxl/lib/libcxl.sym |  6 ++++++
>  cxl/libcxl.h       |  2 ++
>  3 files changed, 52 insertions(+)
> 
> diff --git a/cxl/lib/libcxl.c b/cxl/lib/libcxl.c
> index e8c5d4444dd0..1a8a8eb0ffcb 100644
> --- a/cxl/lib/libcxl.c
> +++ b/cxl/lib/libcxl.c
> @@ -1331,6 +1331,50 @@ CXL_EXPORT int cxl_memdev_disable_invalidate(struct cxl_memdev *memdev)
>  	return 0;
>  }
>  
> +CXL_EXPORT int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev)
> +{
> +	struct cxl_ctx *ctx = cxl_memdev_get_ctx(memdev);
> +	char *path = memdev->dev_buf;
> +	int len = memdev->buf_len, rc;
> +
> +	if (snprintf(path, len, "%s/trigger_poison_list", memdev->dev_path) >=
> +	    len) {

Ugly line break choice to break mid argument..
	if (snprintf(path, len, "%s/trigger_poison_list",
		     memdev->dev_path) >= len) {
would be better.

> +		err(ctx, "%s: buffer too small\n",
> +		    cxl_memdev_get_devname(memdev));
> +		return -ENXIO;
> +	}
> +	rc = sysfs_write_attr(ctx, path, "1\n");
> +	if (rc < 0) {
> +		fprintf(stderr,
> +			"%s: Failed write sysfs attr trigger_poison_list\n",
> +			cxl_memdev_get_devname(memdev));
> +		return rc;
> +	}
> +	return 0;
> +}
> +
> +CXL_EXPORT int cxl_region_trigger_poison_list(struct cxl_region *region)
> +{
> +	struct cxl_ctx *ctx = cxl_region_get_ctx(region);
> +	char *path = region->dev_buf;
> +	int len = region->buf_len, rc;
> +
> +	if (snprintf(path, len, "%s/trigger_poison_list", region->dev_path) >=
> +	    len) {
as above.

> +		err(ctx, "%s: buffer too small\n",
> +		    cxl_region_get_devname(region));
> +		return -ENXIO;
> +	}
> +	rc = sysfs_write_attr(ctx, path, "1\n");
> +	if (rc < 0) {
> +		fprintf(stderr,
> +			"%s: Failed write sysfs attr trigger_poison_list\n",
> +			cxl_region_get_devname(region));
> +		return rc;
> +	}
> +	return 0;
> +}
> +
>  CXL_EXPORT int cxl_memdev_enable(struct cxl_memdev *memdev)
>  {
>  	struct cxl_ctx *ctx = cxl_memdev_get_ctx(memdev);
> diff --git a/cxl/lib/libcxl.sym b/cxl/lib/libcxl.sym
> index 8bb91e05638b..ecf98e6c7af2 100644
> --- a/cxl/lib/libcxl.sym
> +++ b/cxl/lib/libcxl.sym
> @@ -217,3 +217,9 @@ global:
>  	cxl_decoder_get_max_available_extent;
>  	cxl_decoder_get_region;
>  } LIBCXL_2;
> +
> +LIBCXL_4 {
> +global:
> +	cxl_memdev_trigger_poison_list;
> +	cxl_region_trigger_poison_list;
> +} LIBCXL_3;
> diff --git a/cxl/libcxl.h b/cxl/libcxl.h
> index 9fe4e99263dd..5ebdf0879325 100644
> --- a/cxl/libcxl.h
> +++ b/cxl/libcxl.h
> @@ -375,6 +375,8 @@ enum cxl_setpartition_mode {
>  
>  int cxl_cmd_partition_set_mode(struct cxl_cmd *cmd,
>  		enum cxl_setpartition_mode mode);
> +int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev);
> +int cxl_region_trigger_poison_list(struct cxl_region *region);
>  
>  #ifdef __cplusplus
>  } /* extern "C" */


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ndctl PATCH 2/5] cxl: add an optional pid check to event parsing
  2022-11-11  3:20 ` [ndctl PATCH 2/5] cxl: add an optional pid check to event parsing alison.schofield
@ 2022-11-16 12:57   ` Jonathan Cameron
  0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2022-11-16 12:57 UTC (permalink / raw)
  To: alison.schofield
  Cc: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky,
	nvdimm, linux-cxl

On Thu, 10 Nov 2022 19:20:05 -0800
alison.schofield@intel.com wrote:

> From: Alison Schofield <alison.schofield@intel.com>
> 
> When parsing CXL events, callers may only be interested in events
> that originate from the current process. Introduce an optional
> argument to the event trace context: event_pid. When event_pid is
> present, only include events with a matching pid in the returned
> JSON list. It is not a failure to see other, non matching results.
> Simply skip those.
> 
> The initial use case for this is the listing of media errors,
> where only the media errors requested by this process are wanted.
> 
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Makes sense to me

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---
>  cxl/event_trace.c | 5 +++++
>  cxl/event_trace.h | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/cxl/event_trace.c b/cxl/event_trace.c
> index a973a1f62d35..70ab892bbfcb 100644
> --- a/cxl/event_trace.c
> +++ b/cxl/event_trace.c
> @@ -207,6 +207,11 @@ static int cxl_event_parse(struct tep_event *event, struct tep_record *record,
>  			return 0;
>  	}
>  
> +	if (event_ctx->event_pid) {
> +		if (event_ctx->event_pid != tep_data_pid(event->tep, record))
> +			return 0;
> +	}
> +
>  	if (event_ctx->parse_event)
>  		return event_ctx->parse_event(event, record,
>  					      &event_ctx->jlist_head);
> diff --git a/cxl/event_trace.h b/cxl/event_trace.h
> index ec6267202c8b..7f7773b2201f 100644
> --- a/cxl/event_trace.h
> +++ b/cxl/event_trace.h
> @@ -15,6 +15,7 @@ struct event_ctx {
>  	const char *system;
>  	struct list_head jlist_head;
>  	const char *event_name; /* optional */
> +	int event_pid; /* optional */
>  	int (*parse_event)(struct tep_event *event, struct tep_record *record,
>  			   struct list_head *jlist_head); /* optional */
>  };


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ndctl PATCH 4/5] cxl/list: add --media-errors option to cxl list
  2022-11-11  3:20 ` [ndctl PATCH 4/5] cxl/list: add --media-errors option to cxl list alison.schofield
@ 2022-11-16 13:03   ` Jonathan Cameron
  2022-11-17 23:42     ` Alison Schofield
  0 siblings, 1 reply; 12+ messages in thread
From: Jonathan Cameron @ 2022-11-16 13:03 UTC (permalink / raw)
  To: alison.schofield
  Cc: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky,
	nvdimm, linux-cxl

On Thu, 10 Nov 2022 19:20:07 -0800
alison.schofield@intel.com wrote:

> From: Alison Schofield <alison.schofield@intel.com>
> 
> The --media-errors option to 'cxl list' retrieves poison lists
> from memory devices (supporting the capability) and displays
> the returned media-error records in the cxl list json. This
> option can apply to memdevs or regions.
> 
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> ---
>  Documentation/cxl/cxl-list.txt | 64 ++++++++++++++++++++++++++++++++++
>  cxl/filter.c                   |  2 ++
>  cxl/filter.h                   |  1 +
>  cxl/list.c                     |  2 ++
>  4 files changed, 69 insertions(+)
> 
> diff --git a/Documentation/cxl/cxl-list.txt b/Documentation/cxl/cxl-list.txt
> index 14a2b4bb5c2a..24a0cf97cef2 100644
> --- a/Documentation/cxl/cxl-list.txt
> +++ b/Documentation/cxl/cxl-list.txt
> @@ -344,6 +344,70 @@ OPTIONS
>  --region::
>  	Specify CXL region device name(s), or device id(s), to filter the listing.
>  
> +-a::
> +--media-errors::
> +	Include media-error information. The poison list is retrieved
> +	from the device(s) and media error records are added to the
> +	listing. This option applies to memdevs and regions where
> +	devices support the poison list capability.

I'm not sure media errors is a good name.  The poison doesn't have to originate
in the device.  Given we are logging poison with "external" as the source
those definitely don't come from the device and may have nothing to do
with 'media' as such.

Why not just call it poison?

> +
> +----
> +# cxl list -m mem11 --media-errors
> +[
> +  {
> +    "memdev":"mem11",
> +    "pmem_size":268435456,
> +    "ram_size":0,
> +    "serial":0,
> +    "host":"0000:37:00.0",
> +    "media_errors":{
> +      "nr_media_errors":1,
> +      "media_error_records":[
> +        {
> +          "dpa":0,
> +          "length":64,
> +          "source":"Internal",
> +          "flags":"",
> +          "overflow_time":0
> +        }
> +      ]
> +    }
> +  }
> +]
> +# cxl list -r region5 --media-errors
> +[
> +  {
> +    "region":"region5",
> +    "resource":1035623989248,
> +    "size":2147483648,
> +    "interleave_ways":2,
> +    "interleave_granularity":4096,
> +    "decode_state":"commit",
> +    "media_errors":{
> +      "nr_media_errors":2,
> +      "media_error_records":[
> +        {
> +          "memdev":"mem2",
> +          "dpa":0,
> +          "length":64,
> +          "source":"Internal",
> +          "flags":"",
> +          "overflow_time":0
> +        },
> +        {
> +          "memdev":"mem5",
> +          "dpa":0,
> +          "length":512,
> +          "source":"Vendor",
> +          "flags":"",
> +          "overflow_time":0
> +        }
> +      ]
> +    }
> +  }
> +]
> +----
> +
>  -v::
>  --verbose::
>  	Increase verbosity of the output. This can be specified
> diff --git a/cxl/filter.c b/cxl/filter.c
> index 56c659965891..fe6c29148fb4 100644
> --- a/cxl/filter.c
> +++ b/cxl/filter.c
> @@ -686,6 +686,8 @@ static unsigned long params_to_flags(struct cxl_filter_params *param)
>  		flags |= UTIL_JSON_TARGETS;
>  	if (param->partition)
>  		flags |= UTIL_JSON_PARTITION;
> +	if (param->media_errors)
> +		flags |= UTIL_JSON_MEDIA_ERRORS;
>  	return flags;
>  }
>  
> diff --git a/cxl/filter.h b/cxl/filter.h
> index 256df49c3d0c..a92295fe2511 100644
> --- a/cxl/filter.h
> +++ b/cxl/filter.h
> @@ -26,6 +26,7 @@ struct cxl_filter_params {
>  	bool human;
>  	bool health;
>  	bool partition;
> +	bool media_errors;
>  	int verbose;
>  	struct log_ctx ctx;
>  };
> diff --git a/cxl/list.c b/cxl/list.c
> index 8c48fbbaaec3..df2ae5a3fec0 100644
> --- a/cxl/list.c
> +++ b/cxl/list.c
> @@ -52,6 +52,8 @@ static const struct option options[] = {
>  		    "include memory device health information"),
>  	OPT_BOOLEAN('I', "partition", &param.partition,
>  		    "include memory device partition information"),
> +	OPT_BOOLEAN('a', "media-errors", &param.media_errors,
> +		    "include media error information "),
>  	OPT_INCR('v', "verbose", &param.verbose,
>  		 "increase output detail"),
>  #ifdef ENABLE_DEBUG


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ndctl PATCH 4/5] cxl/list: add --media-errors option to cxl list
  2022-11-16 13:03   ` Jonathan Cameron
@ 2022-11-17 23:42     ` Alison Schofield
  2022-11-21 10:57       ` Jonathan Cameron
  0 siblings, 1 reply; 12+ messages in thread
From: Alison Schofield @ 2022-11-17 23:42 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky,
	nvdimm, linux-cxl

On Wed, Nov 16, 2022 at 01:03:45PM +0000, Jonathan Cameron wrote:
> On Thu, 10 Nov 2022 19:20:07 -0800
> alison.schofield@intel.com wrote:
> 
> > From: Alison Schofield <alison.schofield@intel.com>
> > 
> > The --media-errors option to 'cxl list' retrieves poison lists
> > from memory devices (supporting the capability) and displays
> > the returned media-error records in the cxl list json. This
> > option can apply to memdevs or regions.
> > 
> > Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> > ---
> >  Documentation/cxl/cxl-list.txt | 64 ++++++++++++++++++++++++++++++++++
> >  cxl/filter.c                   |  2 ++
> >  cxl/filter.h                   |  1 +
> >  cxl/list.c                     |  2 ++
> >  4 files changed, 69 insertions(+)
> > 
> > diff --git a/Documentation/cxl/cxl-list.txt b/Documentation/cxl/cxl-list.txt
> > index 14a2b4bb5c2a..24a0cf97cef2 100644
> > --- a/Documentation/cxl/cxl-list.txt
> > +++ b/Documentation/cxl/cxl-list.txt
> > @@ -344,6 +344,70 @@ OPTIONS
> >  --region::
> >  	Specify CXL region device name(s), or device id(s), to filter the listing.
> >  
> > +-a::
> > +--media-errors::
> > +	Include media-error information. The poison list is retrieved
> > +	from the device(s) and media error records are added to the
> > +	listing. This option applies to memdevs and regions where
> > +	devices support the poison list capability.
> 
> I'm not sure media errors is a good name.  The poison doesn't have to originate
> in the device.  Given we are logging poison with "external" as the source
> those definitely don't come from the device and may have nothing to do
> with 'media' as such.
> 
> Why not just call it poison?
> 
--media-errors probably originated from ndctl tool which used
that same option name, but it fits in with the CXL Spec language.

The CXL Spec calls the records returned from the 'Get Poison List'
command Media Error Records. It refers to poison as media errors.
So, here, in a command that lists things - the thing(s) being listed
is(are) 'media error record(s)'. 

I see what you're saying about 'External' source. Does that mean
an 'External' source caused an actual media error?

So, that 'Why not poison?' answer. I'm easily swayed either way.
Would you suggest:
> > +
> > +----
> > +# cxl list -m mem11 --media-errors

cxl list -m mem1 --poison

> > +    "media_errors":{
> > +      "nr_media_errors":1,
> > +      "media_error_records":[

and rename the fields above:
	"poison_errors"
	"nr_poison_errors"
	"poison_error_records"



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ndctl PATCH 1/5] libcxl: add interfaces for GET_POISON_LIST mailbox commands
  2022-11-16 12:56   ` Jonathan Cameron
@ 2022-11-17 23:45     ` Alison Schofield
  0 siblings, 0 replies; 12+ messages in thread
From: Alison Schofield @ 2022-11-17 23:45 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky,
	nvdimm, linux-cxl

On Wed, Nov 16, 2022 at 12:56:40PM +0000, Jonathan Cameron wrote:
> On Thu, 10 Nov 2022 19:20:04 -0800
> alison.schofield@intel.com wrote:
> 
> > From: Alison Schofield <alison.schofield@intel.com>
> > 
> > CXL devices maintain a list of locations that are poisoned or result
> > in poison if the addresses are accessed by the host.
> > 
> > Per the spec (CXL 3.0 8.2.9.8.4.1), the device returns this Poison
> > list as a set of  Media Error Records that include the source of the
> > error, the starting device physical address and length.
> > 
> > Trigger the retrieval of the poison list by writing to the device
> > sysfs attribute: trigger_poison_list.
> > 
> > Retrieval is offered by memdev or by region:
> > int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev);
> > int cxl_region_trigger_poison_list(struct cxl_region *region);
> > 
> > This interface triggers the retrieval of the poison list from the
> > devices and logs the error records as kernel trace events named
> > 'cxl_poison'.
> > 
> > Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> Trivial comment inline + I haven't been tracking closely development
> of this tool closely so hopefully this will get other eyes on it who
> are more familiar.  With that in mind:
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Thanks Jonathan!
I cleaned up the ugly line breaks and I'll carry your Reviewed-by
forward.

> 
> > ---
> >  cxl/lib/libcxl.c   | 44 ++++++++++++++++++++++++++++++++++++++++++++
> >  cxl/lib/libcxl.sym |  6 ++++++
> >  cxl/libcxl.h       |  2 ++
> >  3 files changed, 52 insertions(+)
> > 
> > diff --git a/cxl/lib/libcxl.c b/cxl/lib/libcxl.c
> > index e8c5d4444dd0..1a8a8eb0ffcb 100644
> > --- a/cxl/lib/libcxl.c
> > +++ b/cxl/lib/libcxl.c
> > @@ -1331,6 +1331,50 @@ CXL_EXPORT int cxl_memdev_disable_invalidate(struct cxl_memdev *memdev)
> >  	return 0;
> >  }
> >  
> > +CXL_EXPORT int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev)
> > +{
> > +	struct cxl_ctx *ctx = cxl_memdev_get_ctx(memdev);
> > +	char *path = memdev->dev_buf;
> > +	int len = memdev->buf_len, rc;
> > +
> > +	if (snprintf(path, len, "%s/trigger_poison_list", memdev->dev_path) >=
> > +	    len) {
> 
> Ugly line break choice to break mid argument..
> 	if (snprintf(path, len, "%s/trigger_poison_list",
> 		     memdev->dev_path) >= len) {
> would be better.
> 
> > +		err(ctx, "%s: buffer too small\n",
> > +		    cxl_memdev_get_devname(memdev));
> > +		return -ENXIO;
> > +	}
> > +	rc = sysfs_write_attr(ctx, path, "1\n");
> > +	if (rc < 0) {
> > +		fprintf(stderr,
> > +			"%s: Failed write sysfs attr trigger_poison_list\n",
> > +			cxl_memdev_get_devname(memdev));
> > +		return rc;
> > +	}
> > +	return 0;
> > +}
> > +
> > +CXL_EXPORT int cxl_region_trigger_poison_list(struct cxl_region *region)
> > +{
> > +	struct cxl_ctx *ctx = cxl_region_get_ctx(region);
> > +	char *path = region->dev_buf;
> > +	int len = region->buf_len, rc;
> > +
> > +	if (snprintf(path, len, "%s/trigger_poison_list", region->dev_path) >=
> > +	    len) {
> as above.
> 
> > +		err(ctx, "%s: buffer too small\n",
> > +		    cxl_region_get_devname(region));
> > +		return -ENXIO;
> > +	}
> > +	rc = sysfs_write_attr(ctx, path, "1\n");
> > +	if (rc < 0) {
> > +		fprintf(stderr,
> > +			"%s: Failed write sysfs attr trigger_poison_list\n",
> > +			cxl_region_get_devname(region));
> > +		return rc;
> > +	}
> > +	return 0;
> > +}
> > +
> >  CXL_EXPORT int cxl_memdev_enable(struct cxl_memdev *memdev)
> >  {
> >  	struct cxl_ctx *ctx = cxl_memdev_get_ctx(memdev);
> > diff --git a/cxl/lib/libcxl.sym b/cxl/lib/libcxl.sym
> > index 8bb91e05638b..ecf98e6c7af2 100644
> > --- a/cxl/lib/libcxl.sym
> > +++ b/cxl/lib/libcxl.sym
> > @@ -217,3 +217,9 @@ global:
> >  	cxl_decoder_get_max_available_extent;
> >  	cxl_decoder_get_region;
> >  } LIBCXL_2;
> > +
> > +LIBCXL_4 {
> > +global:
> > +	cxl_memdev_trigger_poison_list;
> > +	cxl_region_trigger_poison_list;
> > +} LIBCXL_3;
> > diff --git a/cxl/libcxl.h b/cxl/libcxl.h
> > index 9fe4e99263dd..5ebdf0879325 100644
> > --- a/cxl/libcxl.h
> > +++ b/cxl/libcxl.h
> > @@ -375,6 +375,8 @@ enum cxl_setpartition_mode {
> >  
> >  int cxl_cmd_partition_set_mode(struct cxl_cmd *cmd,
> >  		enum cxl_setpartition_mode mode);
> > +int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev);
> > +int cxl_region_trigger_poison_list(struct cxl_region *region);
> >  
> >  #ifdef __cplusplus
> >  } /* extern "C" */
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ndctl PATCH 4/5] cxl/list: add --media-errors option to cxl list
  2022-11-17 23:42     ` Alison Schofield
@ 2022-11-21 10:57       ` Jonathan Cameron
  0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2022-11-21 10:57 UTC (permalink / raw)
  To: Alison Schofield
  Cc: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky,
	nvdimm, linux-cxl

On Thu, 17 Nov 2022 15:42:43 -0800
Alison Schofield <alison.schofield@intel.com> wrote:

> On Wed, Nov 16, 2022 at 01:03:45PM +0000, Jonathan Cameron wrote:
> > On Thu, 10 Nov 2022 19:20:07 -0800
> > alison.schofield@intel.com wrote:
> >   
> > > From: Alison Schofield <alison.schofield@intel.com>
> > > 
> > > The --media-errors option to 'cxl list' retrieves poison lists
> > > from memory devices (supporting the capability) and displays
> > > the returned media-error records in the cxl list json. This
> > > option can apply to memdevs or regions.
> > > 
> > > Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> > > ---
> > >  Documentation/cxl/cxl-list.txt | 64 ++++++++++++++++++++++++++++++++++
> > >  cxl/filter.c                   |  2 ++
> > >  cxl/filter.h                   |  1 +
> > >  cxl/list.c                     |  2 ++
> > >  4 files changed, 69 insertions(+)
> > > 
> > > diff --git a/Documentation/cxl/cxl-list.txt b/Documentation/cxl/cxl-list.txt
> > > index 14a2b4bb5c2a..24a0cf97cef2 100644
> > > --- a/Documentation/cxl/cxl-list.txt
> > > +++ b/Documentation/cxl/cxl-list.txt
> > > @@ -344,6 +344,70 @@ OPTIONS
> > >  --region::
> > >  	Specify CXL region device name(s), or device id(s), to filter the listing.
> > >  
> > > +-a::
> > > +--media-errors::
> > > +	Include media-error information. The poison list is retrieved
> > > +	from the device(s) and media error records are added to the
> > > +	listing. This option applies to memdevs and regions where
> > > +	devices support the poison list capability.  
> > 
> > I'm not sure media errors is a good name.  The poison doesn't have to originate
> > in the device.  Given we are logging poison with "external" as the source
> > those definitely don't come from the device and may have nothing to do
> > with 'media' as such.
> > 
> > Why not just call it poison?
> >   
> --media-errors probably originated from ndctl tool which used
> that same option name, but it fits in with the CXL Spec language.
> 
> The CXL Spec calls the records returned from the 'Get Poison List'
> command Media Error Records. It refers to poison as media errors.
> So, here, in a command that lists things - the thing(s) being listed
> is(are) 'media error record(s)'. 
> 
> I see what you're saying about 'External' source. Does that mean
> an 'External' source caused an actual media error?

Hmm. I suspect this all evolved.  An External source need not have
anything to do with media (could be corruption in some random cache
or on interconnect or even that a link collapsed potentially).

Ah well, I'm fine with any naming you prefer.  No idea if the NVDIMM
equivalent has a the same issue with externally generated poison.

> 
> So, that 'Why not poison?' answer. I'm easily swayed either way.
> Would you suggest:
> > > +
> > > +----
> > > +# cxl list -m mem11 --media-errors  
> 
> cxl list -m mem1 --poison
> 
> > > +    "media_errors":{
> > > +      "nr_media_errors":1,
> > > +      "media_error_records":[  
> 
> and rename the fields above:
> 	"poison_errors"
> 	"nr_poison_errors"
> 	"poison_error_records"
> 
> 
That works for me, but if it's going to confuse people familiar with
other similar cases, then I don't mind the original naming that much.

Jonathan



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-11-21 10:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-11  3:20 [ndctl PATCH 0/3] Support poison list retrieval alison.schofield
2022-11-11  3:20 ` [ndctl PATCH 1/5] libcxl: add interfaces for GET_POISON_LIST mailbox commands alison.schofield
2022-11-16 12:56   ` Jonathan Cameron
2022-11-17 23:45     ` Alison Schofield
2022-11-11  3:20 ` [ndctl PATCH 2/5] cxl: add an optional pid check to event parsing alison.schofield
2022-11-16 12:57   ` Jonathan Cameron
2022-11-11  3:20 ` [ndctl PATCH 3/5] cxl/list: collect and parse the poison list records alison.schofield
2022-11-11  3:20 ` [ndctl PATCH 4/5] cxl/list: add --media-errors option to cxl list alison.schofield
2022-11-16 13:03   ` Jonathan Cameron
2022-11-17 23:42     ` Alison Schofield
2022-11-21 10:57       ` Jonathan Cameron
2022-11-11  3:20 ` [ndctl PATCH 5/5] test: add a cxl-get-poison test alison.schofield

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox