* [PATCH v6 1/7] libcxl: add interfaces for GET_POISON_LIST mailbox commands
2024-01-18 0:27 [ndctl PATCH v6 0/7] Support poison list retrieval alison.schofield
@ 2024-01-18 0:28 ` alison.schofield
2024-01-19 18:32 ` Dave Jiang
2024-01-18 0:28 ` [PATCH v6 2/7] cxl: add an optional pid check to event parsing alison.schofield
` (6 subsequent siblings)
7 siblings, 1 reply; 16+ messages in thread
From: alison.schofield @ 2024-01-18 0:28 UTC (permalink / raw)
To: Vishal Verma; +Cc: Alison Schofield, nvdimm, linux-cxl
From: Alison Schofield <alison.schofield@intel.com>
CXL devices maintain a list of locations that are poisoned or result
in poison if the addresses are accessed by the host.
Per the spec (CXL 3.1 8.2.9.9.4.1), the device returns the Poison
List as a set of Media Error Records that include the source of the
error, the starting device physical address and length.
Trigger the retrieval of the poison list by writing to the memory
device sysfs attribute: trigger_poison_list. The CXL driver only
offers triggering per memdev, so the trigger by region interface
offered here is a convenience API that triggers a poison list
retrieval for each memdev contributing to a region.
int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev);
int cxl_region_trigger_poison_list(struct cxl_region *region);
The resulting poison records are logged as kernel trace events
named 'cxl_poison'.
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
cxl/lib/libcxl.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++
cxl/lib/libcxl.sym | 6 ++++++
cxl/libcxl.h | 2 ++
3 files changed, 55 insertions(+)
diff --git a/cxl/lib/libcxl.c b/cxl/lib/libcxl.c
index af4ca44eae19..cc95c2d7c94a 100644
--- a/cxl/lib/libcxl.c
+++ b/cxl/lib/libcxl.c
@@ -1647,6 +1647,53 @@ CXL_EXPORT int cxl_memdev_disable_invalidate(struct cxl_memdev *memdev)
return 0;
}
+CXL_EXPORT int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev)
+{
+ struct cxl_ctx *ctx = cxl_memdev_get_ctx(memdev);
+ char *path = memdev->dev_buf;
+ int len = memdev->buf_len, rc;
+
+ if (snprintf(path, len, "%s/trigger_poison_list",
+ memdev->dev_path) >= len) {
+ err(ctx, "%s: buffer too small\n",
+ cxl_memdev_get_devname(memdev));
+ return -ENXIO;
+ }
+ rc = sysfs_write_attr(ctx, path, "1\n");
+ if (rc < 0) {
+ fprintf(stderr,
+ "%s: Failed write sysfs attr trigger_poison_list\n",
+ cxl_memdev_get_devname(memdev));
+ return rc;
+ }
+ return 0;
+}
+
+CXL_EXPORT int cxl_region_trigger_poison_list(struct cxl_region *region)
+{
+ struct cxl_memdev_mapping *mapping;
+ int rc;
+
+ cxl_mapping_foreach(region, mapping) {
+ struct cxl_decoder *decoder;
+ struct cxl_memdev *memdev;
+
+ decoder = cxl_mapping_get_decoder(mapping);
+ if (!decoder)
+ continue;
+
+ memdev = cxl_decoder_get_memdev(decoder);
+ if (!memdev)
+ continue;
+
+ rc = cxl_memdev_trigger_poison_list(memdev);
+ if (rc)
+ return rc;
+ }
+
+ return 0;
+}
+
CXL_EXPORT int cxl_memdev_enable(struct cxl_memdev *memdev)
{
struct cxl_ctx *ctx = cxl_memdev_get_ctx(memdev);
diff --git a/cxl/lib/libcxl.sym b/cxl/lib/libcxl.sym
index 8fa1cca3d0d7..277b7e21d6a6 100644
--- a/cxl/lib/libcxl.sym
+++ b/cxl/lib/libcxl.sym
@@ -264,3 +264,9 @@ global:
cxl_memdev_update_fw;
cxl_memdev_cancel_fw_update;
} LIBCXL_5;
+
+LIBCXL_7 {
+global:
+ cxl_memdev_trigger_poison_list;
+ cxl_region_trigger_poison_list;
+} LIBCXL_6;
diff --git a/cxl/libcxl.h b/cxl/libcxl.h
index 0f4f4b2648fb..ecdffe36df2c 100644
--- a/cxl/libcxl.h
+++ b/cxl/libcxl.h
@@ -460,6 +460,8 @@ enum cxl_setpartition_mode {
int cxl_cmd_partition_set_mode(struct cxl_cmd *cmd,
enum cxl_setpartition_mode mode);
+int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev);
+int cxl_region_trigger_poison_list(struct cxl_region *region);
#ifdef __cplusplus
} /* extern "C" */
--
2.37.3
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v6 1/7] libcxl: add interfaces for GET_POISON_LIST mailbox commands
2024-01-18 0:28 ` [PATCH v6 1/7] libcxl: add interfaces for GET_POISON_LIST mailbox commands alison.schofield
@ 2024-01-19 18:32 ` Dave Jiang
0 siblings, 0 replies; 16+ messages in thread
From: Dave Jiang @ 2024-01-19 18:32 UTC (permalink / raw)
To: alison.schofield, Vishal Verma; +Cc: nvdimm, linux-cxl
On 1/17/24 17:28, alison.schofield@intel.com wrote:
> From: Alison Schofield <alison.schofield@intel.com>
>
> CXL devices maintain a list of locations that are poisoned or result
> in poison if the addresses are accessed by the host.
>
> Per the spec (CXL 3.1 8.2.9.9.4.1), the device returns the Poison
> List as a set of Media Error Records that include the source of the
> error, the starting device physical address and length.
>
> Trigger the retrieval of the poison list by writing to the memory
> device sysfs attribute: trigger_poison_list. The CXL driver only
> offers triggering per memdev, so the trigger by region interface
> offered here is a convenience API that triggers a poison list
> retrieval for each memdev contributing to a region.
>
> int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev);
> int cxl_region_trigger_poison_list(struct cxl_region *region);
>
> The resulting poison records are logged as kernel trace events
> named 'cxl_poison'.
>
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> cxl/lib/libcxl.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++
> cxl/lib/libcxl.sym | 6 ++++++
> cxl/libcxl.h | 2 ++
> 3 files changed, 55 insertions(+)
>
> diff --git a/cxl/lib/libcxl.c b/cxl/lib/libcxl.c
> index af4ca44eae19..cc95c2d7c94a 100644
> --- a/cxl/lib/libcxl.c
> +++ b/cxl/lib/libcxl.c
> @@ -1647,6 +1647,53 @@ CXL_EXPORT int cxl_memdev_disable_invalidate(struct cxl_memdev *memdev)
> return 0;
> }
>
> +CXL_EXPORT int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev)
> +{
> + struct cxl_ctx *ctx = cxl_memdev_get_ctx(memdev);
> + char *path = memdev->dev_buf;
> + int len = memdev->buf_len, rc;
> +
> + if (snprintf(path, len, "%s/trigger_poison_list",
> + memdev->dev_path) >= len) {
> + err(ctx, "%s: buffer too small\n",
> + cxl_memdev_get_devname(memdev));
> + return -ENXIO;
> + }
> + rc = sysfs_write_attr(ctx, path, "1\n");
> + if (rc < 0) {
> + fprintf(stderr,
> + "%s: Failed write sysfs attr trigger_poison_list\n",
> + cxl_memdev_get_devname(memdev));
> + return rc;
> + }
> + return 0;
> +}
> +
> +CXL_EXPORT int cxl_region_trigger_poison_list(struct cxl_region *region)
> +{
> + struct cxl_memdev_mapping *mapping;
> + int rc;
> +
> + cxl_mapping_foreach(region, mapping) {
> + struct cxl_decoder *decoder;
> + struct cxl_memdev *memdev;
> +
> + decoder = cxl_mapping_get_decoder(mapping);
> + if (!decoder)
> + continue;
> +
> + memdev = cxl_decoder_get_memdev(decoder);
> + if (!memdev)
> + continue;
> +
> + rc = cxl_memdev_trigger_poison_list(memdev);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> CXL_EXPORT int cxl_memdev_enable(struct cxl_memdev *memdev)
> {
> struct cxl_ctx *ctx = cxl_memdev_get_ctx(memdev);
> diff --git a/cxl/lib/libcxl.sym b/cxl/lib/libcxl.sym
> index 8fa1cca3d0d7..277b7e21d6a6 100644
> --- a/cxl/lib/libcxl.sym
> +++ b/cxl/lib/libcxl.sym
> @@ -264,3 +264,9 @@ global:
> cxl_memdev_update_fw;
> cxl_memdev_cancel_fw_update;
> } LIBCXL_5;
> +
> +LIBCXL_7 {
> +global:
> + cxl_memdev_trigger_poison_list;
> + cxl_region_trigger_poison_list;
> +} LIBCXL_6;
> diff --git a/cxl/libcxl.h b/cxl/libcxl.h
> index 0f4f4b2648fb..ecdffe36df2c 100644
> --- a/cxl/libcxl.h
> +++ b/cxl/libcxl.h
> @@ -460,6 +460,8 @@ enum cxl_setpartition_mode {
>
> int cxl_cmd_partition_set_mode(struct cxl_cmd *cmd,
> enum cxl_setpartition_mode mode);
> +int cxl_memdev_trigger_poison_list(struct cxl_memdev *memdev);
> +int cxl_region_trigger_poison_list(struct cxl_region *region);
>
> #ifdef __cplusplus
> } /* extern "C" */
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v6 2/7] cxl: add an optional pid check to event parsing
2024-01-18 0:27 [ndctl PATCH v6 0/7] Support poison list retrieval alison.schofield
2024-01-18 0:28 ` [PATCH v6 1/7] libcxl: add interfaces for GET_POISON_LIST mailbox commands alison.schofield
@ 2024-01-18 0:28 ` alison.schofield
2024-01-19 18:35 ` Dave Jiang
2024-01-18 0:28 ` [PATCH v6 3/7] cxl/event_trace: add a private context for private parsers alison.schofield
` (5 subsequent siblings)
7 siblings, 1 reply; 16+ messages in thread
From: alison.schofield @ 2024-01-18 0:28 UTC (permalink / raw)
To: Vishal Verma; +Cc: Alison Schofield, nvdimm, linux-cxl, Jonathan Cameron
From: Alison Schofield <alison.schofield@intel.com>
When parsing CXL events, callers may only be interested in events
that originate from the current process. Introduce an optional
argument to the event trace context: event_pid. When event_pid is
present, simply skip the parsing of events without a matching pid.
It is not a failure to see other, non matching events.
The initial use case for this is device poison listings where
only the media-error records requested by this process are wanted.
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
cxl/event_trace.c | 5 +++++
cxl/event_trace.h | 1 +
2 files changed, 6 insertions(+)
diff --git a/cxl/event_trace.c b/cxl/event_trace.c
index db8cc85f0b6f..269060898118 100644
--- a/cxl/event_trace.c
+++ b/cxl/event_trace.c
@@ -208,6 +208,11 @@ static int cxl_event_parse(struct tep_event *event, struct tep_record *record,
return 0;
}
+ if (event_ctx->event_pid) {
+ if (event_ctx->event_pid != tep_data_pid(event->tep, record))
+ return 0;
+ }
+
if (event_ctx->parse_event)
return event_ctx->parse_event(event, record,
&event_ctx->jlist_head);
diff --git a/cxl/event_trace.h b/cxl/event_trace.h
index ec6267202c8b..7f7773b2201f 100644
--- a/cxl/event_trace.h
+++ b/cxl/event_trace.h
@@ -15,6 +15,7 @@ struct event_ctx {
const char *system;
struct list_head jlist_head;
const char *event_name; /* optional */
+ int event_pid; /* optional */
int (*parse_event)(struct tep_event *event, struct tep_record *record,
struct list_head *jlist_head); /* optional */
};
--
2.37.3
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v6 2/7] cxl: add an optional pid check to event parsing
2024-01-18 0:28 ` [PATCH v6 2/7] cxl: add an optional pid check to event parsing alison.schofield
@ 2024-01-19 18:35 ` Dave Jiang
0 siblings, 0 replies; 16+ messages in thread
From: Dave Jiang @ 2024-01-19 18:35 UTC (permalink / raw)
To: alison.schofield, Vishal Verma; +Cc: nvdimm, linux-cxl, Jonathan Cameron
On 1/17/24 17:28, alison.schofield@intel.com wrote:
> From: Alison Schofield <alison.schofield@intel.com>
>
> When parsing CXL events, callers may only be interested in events
> that originate from the current process. Introduce an optional
> argument to the event trace context: event_pid. When event_pid is
> present, simply skip the parsing of events without a matching pid.
> It is not a failure to see other, non matching events.
>
> The initial use case for this is device poison listings where
> only the media-error records requested by this process are wanted.
>
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> cxl/event_trace.c | 5 +++++
> cxl/event_trace.h | 1 +
> 2 files changed, 6 insertions(+)
>
> diff --git a/cxl/event_trace.c b/cxl/event_trace.c
> index db8cc85f0b6f..269060898118 100644
> --- a/cxl/event_trace.c
> +++ b/cxl/event_trace.c
> @@ -208,6 +208,11 @@ static int cxl_event_parse(struct tep_event *event, struct tep_record *record,
> return 0;
> }
>
> + if (event_ctx->event_pid) {
> + if (event_ctx->event_pid != tep_data_pid(event->tep, record))
> + return 0;
> + }
> +
> if (event_ctx->parse_event)
> return event_ctx->parse_event(event, record,
> &event_ctx->jlist_head);
> diff --git a/cxl/event_trace.h b/cxl/event_trace.h
> index ec6267202c8b..7f7773b2201f 100644
> --- a/cxl/event_trace.h
> +++ b/cxl/event_trace.h
> @@ -15,6 +15,7 @@ struct event_ctx {
> const char *system;
> struct list_head jlist_head;
> const char *event_name; /* optional */
> + int event_pid; /* optional */
> int (*parse_event)(struct tep_event *event, struct tep_record *record,
> struct list_head *jlist_head); /* optional */
> };
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v6 3/7] cxl/event_trace: add a private context for private parsers
2024-01-18 0:27 [ndctl PATCH v6 0/7] Support poison list retrieval alison.schofield
2024-01-18 0:28 ` [PATCH v6 1/7] libcxl: add interfaces for GET_POISON_LIST mailbox commands alison.schofield
2024-01-18 0:28 ` [PATCH v6 2/7] cxl: add an optional pid check to event parsing alison.schofield
@ 2024-01-18 0:28 ` alison.schofield
2024-01-19 21:08 ` Dave Jiang
2024-01-18 0:28 ` [PATCH v6 4/7] cxl/event_trace: add helpers get_field_[string|data]() alison.schofield
` (4 subsequent siblings)
7 siblings, 1 reply; 16+ messages in thread
From: alison.schofield @ 2024-01-18 0:28 UTC (permalink / raw)
To: Vishal Verma; +Cc: Alison Schofield, nvdimm, linux-cxl
From: Alison Schofield <alison.schofield@intel.com>
CXL event tracing provides helpers to iterate through a trace
buffer and extract events of interest. It offers two parsing
options: a default parser that adds every field of an event to
a json object, and a private parsing option where the caller can
parse each event as it wishes.
Although the private parser can do some conditional parsing based
on field values, it has no method to receive additional information
needed to make parsing decisions in the callback.
Add a private_ctx field to the existing 'struct event_context'.
Replace the jlist_head parameter, used in the default parser,
with the private_ctx.
This is in preparation for adding a private parser requiring
additional context for cxl_poison events.
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
cxl/event_trace.c | 2 +-
cxl/event_trace.h | 3 ++-
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/cxl/event_trace.c b/cxl/event_trace.c
index 269060898118..fbf7a77235ff 100644
--- a/cxl/event_trace.c
+++ b/cxl/event_trace.c
@@ -215,7 +215,7 @@ static int cxl_event_parse(struct tep_event *event, struct tep_record *record,
if (event_ctx->parse_event)
return event_ctx->parse_event(event, record,
- &event_ctx->jlist_head);
+ event_ctx->private_ctx);
return cxl_event_to_json(event, record, &event_ctx->jlist_head);
}
diff --git a/cxl/event_trace.h b/cxl/event_trace.h
index 7f7773b2201f..ec61962abbc6 100644
--- a/cxl/event_trace.h
+++ b/cxl/event_trace.h
@@ -16,8 +16,9 @@ struct event_ctx {
struct list_head jlist_head;
const char *event_name; /* optional */
int event_pid; /* optional */
+ void *private_ctx; /* required with parse_event() */
int (*parse_event)(struct tep_event *event, struct tep_record *record,
- struct list_head *jlist_head); /* optional */
+ void *private_ctx);/* optional */
};
int cxl_parse_events(struct tracefs_instance *inst, struct event_ctx *ectx);
--
2.37.3
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v6 3/7] cxl/event_trace: add a private context for private parsers
2024-01-18 0:28 ` [PATCH v6 3/7] cxl/event_trace: add a private context for private parsers alison.schofield
@ 2024-01-19 21:08 ` Dave Jiang
0 siblings, 0 replies; 16+ messages in thread
From: Dave Jiang @ 2024-01-19 21:08 UTC (permalink / raw)
To: alison.schofield, Vishal Verma; +Cc: nvdimm, linux-cxl
On 1/17/24 17:28, alison.schofield@intel.com wrote:
> From: Alison Schofield <alison.schofield@intel.com>
>
> CXL event tracing provides helpers to iterate through a trace
> buffer and extract events of interest. It offers two parsing
> options: a default parser that adds every field of an event to
> a json object, and a private parsing option where the caller can
> parse each event as it wishes.
>
> Although the private parser can do some conditional parsing based
> on field values, it has no method to receive additional information
> needed to make parsing decisions in the callback.
>
> Add a private_ctx field to the existing 'struct event_context'.
> Replace the jlist_head parameter, used in the default parser,
> with the private_ctx.
>
> This is in preparation for adding a private parser requiring
> additional context for cxl_poison events.
>
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> cxl/event_trace.c | 2 +-
> cxl/event_trace.h | 3 ++-
> 2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/cxl/event_trace.c b/cxl/event_trace.c
> index 269060898118..fbf7a77235ff 100644
> --- a/cxl/event_trace.c
> +++ b/cxl/event_trace.c
> @@ -215,7 +215,7 @@ static int cxl_event_parse(struct tep_event *event, struct tep_record *record,
>
> if (event_ctx->parse_event)
> return event_ctx->parse_event(event, record,
> - &event_ctx->jlist_head);
> + event_ctx->private_ctx);
>
> return cxl_event_to_json(event, record, &event_ctx->jlist_head);
> }
> diff --git a/cxl/event_trace.h b/cxl/event_trace.h
> index 7f7773b2201f..ec61962abbc6 100644
> --- a/cxl/event_trace.h
> +++ b/cxl/event_trace.h
> @@ -16,8 +16,9 @@ struct event_ctx {
> struct list_head jlist_head;
> const char *event_name; /* optional */
> int event_pid; /* optional */
> + void *private_ctx; /* required with parse_event() */
> int (*parse_event)(struct tep_event *event, struct tep_record *record,
> - struct list_head *jlist_head); /* optional */
> + void *private_ctx);/* optional */
> };
>
> int cxl_parse_events(struct tracefs_instance *inst, struct event_ctx *ectx);
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v6 4/7] cxl/event_trace: add helpers get_field_[string|data]()
2024-01-18 0:27 [ndctl PATCH v6 0/7] Support poison list retrieval alison.schofield
` (2 preceding siblings ...)
2024-01-18 0:28 ` [PATCH v6 3/7] cxl/event_trace: add a private context for private parsers alison.schofield
@ 2024-01-18 0:28 ` alison.schofield
2024-01-19 21:18 ` Dave Jiang
2024-01-18 0:28 ` [PATCH v6 5/7] cxl/list: collect and parse media_error records alison.schofield
` (3 subsequent siblings)
7 siblings, 1 reply; 16+ messages in thread
From: alison.schofield @ 2024-01-18 0:28 UTC (permalink / raw)
To: Vishal Verma; +Cc: Alison Schofield, nvdimm, linux-cxl
From: Alison Schofield <alison.schofield@intel.com>
Add helpers to extract the value of an event record field given the
field name. This is useful when the user knows the name and format
of the field and simply needs to get it. Add signed and unsigned
char* versions to support string and u64 data fields.
This is in preparation for adding a private parser of cxl_poison
events.
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
cxl/event_trace.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
cxl/event_trace.h | 5 ++++-
2 files changed, 50 insertions(+), 1 deletion(-)
diff --git a/cxl/event_trace.c b/cxl/event_trace.c
index fbf7a77235ff..8d04d8d34194 100644
--- a/cxl/event_trace.c
+++ b/cxl/event_trace.c
@@ -15,6 +15,52 @@
#define _GNU_SOURCE
#include <string.h>
+static struct tep_format_field *__find_field(struct tep_event *event,
+ const char *name)
+{
+ struct tep_format_field **fields;
+
+ fields = tep_event_fields(event);
+ if (!fields)
+ return NULL;
+
+ for (int i = 0; fields[i]; i++) {
+ struct tep_format_field *f = fields[i];
+
+ if (strcmp(f->name, name) != 0)
+ continue;
+
+ return f;
+ }
+ return NULL;
+}
+
+unsigned char *cxl_get_field_data(struct tep_event *event,
+ struct tep_record *record, const char *name)
+{
+ struct tep_format_field *f;
+ int len;
+
+ f = __find_field(event, name);
+ if (!f)
+ return NULL;
+
+ return tep_get_field_raw(NULL, event, f->name, record, &len, 0);
+}
+
+char *cxl_get_field_string(struct tep_event *event, struct tep_record *record,
+ const char *name)
+{
+ struct tep_format_field *f;
+ int len;
+
+ f = __find_field(event, name);
+ if (!f)
+ return NULL;
+
+ return tep_get_field_raw(NULL, event, f->name, record, &len, 0);
+}
+
static struct json_object *num_to_json(void *num, int elem_size, unsigned long flags)
{
bool sign = flags & TEP_FIELD_IS_SIGNED;
diff --git a/cxl/event_trace.h b/cxl/event_trace.h
index ec61962abbc6..6252f583097a 100644
--- a/cxl/event_trace.h
+++ b/cxl/event_trace.h
@@ -25,5 +25,8 @@ int cxl_parse_events(struct tracefs_instance *inst, struct event_ctx *ectx);
int cxl_event_tracing_enable(struct tracefs_instance *inst, const char *system,
const char *event);
int cxl_event_tracing_disable(struct tracefs_instance *inst);
-
+char *cxl_get_field_string(struct tep_event *event, struct tep_record *record,
+ const char *name);
+unsigned char *cxl_get_field_data(struct tep_event *event,
+ struct tep_record *record, const char *name);
#endif
--
2.37.3
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v6 4/7] cxl/event_trace: add helpers get_field_[string|data]()
2024-01-18 0:28 ` [PATCH v6 4/7] cxl/event_trace: add helpers get_field_[string|data]() alison.schofield
@ 2024-01-19 21:18 ` Dave Jiang
0 siblings, 0 replies; 16+ messages in thread
From: Dave Jiang @ 2024-01-19 21:18 UTC (permalink / raw)
To: alison.schofield, Vishal Verma; +Cc: nvdimm, linux-cxl
On 1/17/24 17:28, alison.schofield@intel.com wrote:
> From: Alison Schofield <alison.schofield@intel.com>
>
> Add helpers to extract the value of an event record field given the
> field name. This is useful when the user knows the name and format
> of the field and simply needs to get it. Add signed and unsigned
> char* versions to support string and u64 data fields.
>
> This is in preparation for adding a private parser of cxl_poison
> events.
>
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> cxl/event_trace.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> cxl/event_trace.h | 5 ++++-
> 2 files changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/cxl/event_trace.c b/cxl/event_trace.c
> index fbf7a77235ff..8d04d8d34194 100644
> --- a/cxl/event_trace.c
> +++ b/cxl/event_trace.c
> @@ -15,6 +15,52 @@
> #define _GNU_SOURCE
> #include <string.h>
>
> +static struct tep_format_field *__find_field(struct tep_event *event,
> + const char *name)
> +{
> + struct tep_format_field **fields;
> +
> + fields = tep_event_fields(event);
> + if (!fields)
> + return NULL;
> +
> + for (int i = 0; fields[i]; i++) {
> + struct tep_format_field *f = fields[i];
> +
> + if (strcmp(f->name, name) != 0)
> + continue;
> +
> + return f;
> + }
> + return NULL;
> +}
> +
> +unsigned char *cxl_get_field_data(struct tep_event *event,
> + struct tep_record *record, const char *name)
> +{
> + struct tep_format_field *f;
> + int len;
> +
> + f = __find_field(event, name);
> + if (!f)
> + return NULL;
> +
> + return tep_get_field_raw(NULL, event, f->name, record, &len, 0);
> +}
> +
> +char *cxl_get_field_string(struct tep_event *event, struct tep_record *record,
> + const char *name)
> +{
> + struct tep_format_field *f;
> + int len;
> +
> + f = __find_field(event, name);
> + if (!f)
> + return NULL;
> +
> + return tep_get_field_raw(NULL, event, f->name, record, &len, 0);
> +}
> +
> static struct json_object *num_to_json(void *num, int elem_size, unsigned long flags)
> {
> bool sign = flags & TEP_FIELD_IS_SIGNED;
> diff --git a/cxl/event_trace.h b/cxl/event_trace.h
> index ec61962abbc6..6252f583097a 100644
> --- a/cxl/event_trace.h
> +++ b/cxl/event_trace.h
> @@ -25,5 +25,8 @@ int cxl_parse_events(struct tracefs_instance *inst, struct event_ctx *ectx);
> int cxl_event_tracing_enable(struct tracefs_instance *inst, const char *system,
> const char *event);
> int cxl_event_tracing_disable(struct tracefs_instance *inst);
> -
> +char *cxl_get_field_string(struct tep_event *event, struct tep_record *record,
> + const char *name);
> +unsigned char *cxl_get_field_data(struct tep_event *event,
> + struct tep_record *record, const char *name);
> #endif
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v6 5/7] cxl/list: collect and parse media_error records
2024-01-18 0:27 [ndctl PATCH v6 0/7] Support poison list retrieval alison.schofield
` (3 preceding siblings ...)
2024-01-18 0:28 ` [PATCH v6 4/7] cxl/event_trace: add helpers get_field_[string|data]() alison.schofield
@ 2024-01-18 0:28 ` alison.schofield
2024-01-18 0:28 ` [PATCH v6 6/7] cxl/list: add --media-errors option to cxl list alison.schofield
` (2 subsequent siblings)
7 siblings, 0 replies; 16+ messages in thread
From: alison.schofield @ 2024-01-18 0:28 UTC (permalink / raw)
To: Vishal Verma; +Cc: Alison Schofield, nvdimm, linux-cxl
From: Alison Schofield <alison.schofield@intel.com>
Media_error records are logged as events in the kernel tracing
subsystem. To prepare the media_error records for cxl list, enable
tracing, trigger the poison list read, and parse the generated
cxl_poison events into a json representation.
Use the event_trace private parsing option to customize the json
representation based on cxl-list calling options and event field
settings.
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
cxl/json.c | 218 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 218 insertions(+)
diff --git a/cxl/json.c b/cxl/json.c
index 7678d02020b6..abe77e1f86d3 100644
--- a/cxl/json.c
+++ b/cxl/json.c
@@ -1,16 +1,20 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (C) 2015-2021 Intel Corporation. All rights reserved.
#include <limits.h>
+#include <errno.h>
#include <util/json.h>
+#include <util/bitmap.h>
#include <uuid/uuid.h>
#include <cxl/libcxl.h>
#include <json-c/json.h>
#include <json-c/printbuf.h>
#include <ccan/short_types/short_types.h>
+#include <tracefs/tracefs.h>
#include "filter.h"
#include "json.h"
#include "../daxctl/json.h"
+#include "event_trace.h"
#define CXL_FW_VERSION_STR_LEN 16
#define CXL_FW_MAX_SLOTS 4
@@ -571,6 +575,208 @@ err_jobj:
return NULL;
}
+/* CXL Spec 3.1 Table 8-140 Media Error Record */
+#define CXL_POISON_SOURCE_UNKNOWN 0
+#define CXL_POISON_SOURCE_EXTERNAL 1
+#define CXL_POISON_SOURCE_INTERNAL 2
+#define CXL_POISON_SOURCE_INJECTED 3
+#define CXL_POISON_SOURCE_VENDOR 7
+
+/* CXL Spec 3.1 Table 8-139 Get Poison List Output Payload */
+#define CXL_POISON_FLAG_MORE BIT(0)
+#define CXL_POISON_FLAG_OVERFLOW BIT(1)
+#define CXL_POISON_FLAG_SCANNING BIT(2)
+
+struct poison_event_ctx {
+ struct json_object *jpoison;
+ const char *region_name;
+ unsigned long flags;
+};
+
+int poison_event_to_json(struct tep_event *event, struct tep_record *record,
+ void *ctx)
+{
+ struct poison_event_ctx *p_ctx = (struct poison_event_ctx *)ctx;
+ struct json_object *jobj, *jp, *jpoison = p_ctx->jpoison;
+ const char *region_name = p_ctx->region_name;
+ unsigned long flags = p_ctx->flags;
+ bool overflow = false;
+ unsigned char *data;
+ int pflags;
+ char *str;
+
+ jp = json_object_new_object();
+ if (!jp)
+ return -ENOMEM;
+
+ str = cxl_get_field_string(event, record, "region");
+
+ /* Skip records not in this region when listing by region */
+ if ((region_name) && (strcmp(region_name, str) != 0)) {
+ json_object_put(jp);
+ return 0;
+ }
+ /* Only display region name in by memdev listings */
+ if (!region_name && strlen(str)) {
+ jobj = json_object_new_string(str);
+ if (jobj)
+ json_object_object_add(jp, "region", jobj);
+ }
+ /* Only display memdev name in by region listings */
+ if (region_name) {
+ str = cxl_get_field_string(event, record, "memdev");
+ jobj = json_object_new_string(str);
+ if (jobj)
+ json_object_object_add(jp, "memdev", jobj);
+ }
+
+ data = cxl_get_field_data(event, record, "dpa");
+ jobj = util_json_object_hex(*(uint64_t *)data, flags);
+ if (jobj)
+ json_object_object_add(jp, "dpa", jobj);
+
+ data = cxl_get_field_data(event, record, "dpa_length");
+ jobj = util_json_object_size(*(uint32_t *)data, flags);
+ if (jobj)
+ json_object_object_add(jp, "dpa_length", jobj);
+
+ data = cxl_get_field_data(event, record, "hpa");
+ if (*(uint64_t *)data != ULLONG_MAX) {
+ jobj = util_json_object_hex(*(uint64_t *)data, flags);
+ if (jobj)
+ json_object_object_add(jp, "hpa", jobj);
+ }
+
+ str = cxl_get_field_string(event, record, "source");
+ switch (*(uint8_t *)str) {
+ case CXL_POISON_SOURCE_UNKNOWN:
+ jobj = json_object_new_string("Unknown");
+ break;
+ case CXL_POISON_SOURCE_EXTERNAL:
+ jobj = json_object_new_string("External");
+ break;
+ case CXL_POISON_SOURCE_INTERNAL:
+ jobj = json_object_new_string("Internal");
+ break;
+ case CXL_POISON_SOURCE_INJECTED:
+ jobj = json_object_new_string("Injected");
+ break;
+ case CXL_POISON_SOURCE_VENDOR:
+ jobj = json_object_new_string("Vendor");
+ break;
+ default:
+ jobj = json_object_new_string("Reserved");
+ }
+ json_object_object_add(jp, "source", jobj);
+
+ str = cxl_get_field_string(event, record, "flags");
+ pflags = *(uint8_t *)str;
+ if (pflags) {
+ char flag_str[32] = { '\0' };
+
+ if (pflags & CXL_POISON_FLAG_MORE)
+ strcat(flag_str, "More,");
+ if (pflags & CXL_POISON_FLAG_SCANNING)
+ strcat(flag_str, "Scanning,");
+ if (pflags & CXL_POISON_FLAG_OVERFLOW) {
+ strcat(flag_str, "Overflow,");
+ overflow = true;
+ }
+ jobj = json_object_new_string(flag_str);
+ if (jobj)
+ json_object_object_add(jp, "flags", jobj);
+ }
+ if (overflow) {
+ data = cxl_get_field_data(event, record, "overflow_ts");
+ jobj = util_json_object_hex(*(uint64_t *)data, flags);
+ if (jobj)
+ json_object_object_add(jp, "overflow_t", jobj);
+ }
+
+ json_object_array_add(jpoison, jp);
+
+ return 0;
+}
+
+static struct json_object *
+util_cxl_poison_events_to_json(struct tracefs_instance *inst,
+ const char *region_name, unsigned long flags)
+{
+ struct poison_event_ctx p_ctx = {
+ .region_name = region_name,
+ .flags = flags,
+ };
+ struct event_ctx ectx = {
+ .event_name = "cxl_poison",
+ .event_pid = getpid(),
+ .system = "cxl",
+ .private_ctx = &p_ctx,
+ .parse_event = poison_event_to_json,
+ };
+ int rc = 0;
+
+ p_ctx.jpoison = json_object_new_array();
+ if (!p_ctx.jpoison)
+ return NULL;
+
+ rc = cxl_parse_events(inst, &ectx);
+ if (rc < 0) {
+ fprintf(stderr, "Failed to parse events: %d\n", rc);
+ json_object_put(p_ctx.jpoison);
+ return NULL;
+ }
+
+ if (json_object_array_length(p_ctx.jpoison) == 0) {
+ json_object_put(p_ctx.jpoison);
+ return NULL;
+ }
+
+ return p_ctx.jpoison;
+}
+
+static struct json_object *
+util_cxl_poison_list_to_json(struct cxl_region *region,
+ struct cxl_memdev *memdev,
+ unsigned long flags)
+{
+ struct json_object *jpoison = NULL;
+ struct tracefs_instance *inst;
+ const char *region_name;
+ int rc;
+
+ inst = tracefs_instance_create("cxl list");
+ if (!inst) {
+ fprintf(stderr, "tracefs_instance_create() failed\n");
+ return NULL;
+ }
+
+ rc = cxl_event_tracing_enable(inst, "cxl", "cxl_poison");
+ if (rc < 0) {
+ fprintf(stderr, "Failed to enable trace: %d\n", rc);
+ goto err_free;
+ }
+
+ if (region)
+ rc = cxl_region_trigger_poison_list(region);
+ else
+ rc = cxl_memdev_trigger_poison_list(memdev);
+ if (rc)
+ goto err_free;
+
+ rc = cxl_event_tracing_disable(inst);
+ if (rc < 0) {
+ fprintf(stderr, "Failed to disable trace: %d\n", rc);
+ goto err_free;
+ }
+
+ region_name = region ? cxl_region_get_devname(region) : NULL;
+ jpoison = util_cxl_poison_events_to_json(inst, region_name, flags);
+
+err_free:
+ tracefs_instance_free(inst);
+ return jpoison;
+}
+
struct json_object *util_cxl_memdev_to_json(struct cxl_memdev *memdev,
unsigned long flags)
{
@@ -649,6 +855,12 @@ struct json_object *util_cxl_memdev_to_json(struct cxl_memdev *memdev,
json_object_object_add(jdev, "firmware", jobj);
}
+ if (flags & UTIL_JSON_MEDIA_ERRORS) {
+ jobj = util_cxl_poison_list_to_json(NULL, memdev, flags);
+ if (jobj)
+ json_object_object_add(jdev, "media_errors", jobj);
+ }
+
json_object_set_userdata(jdev, memdev, NULL);
return jdev;
}
@@ -987,6 +1199,12 @@ struct json_object *util_cxl_region_to_json(struct cxl_region *region,
json_object_object_add(jregion, "state", jobj);
}
+ if (flags & UTIL_JSON_MEDIA_ERRORS) {
+ jobj = util_cxl_poison_list_to_json(region, NULL, flags);
+ if (jobj)
+ json_object_object_add(jregion, "media_errors", jobj);
+ }
+
util_cxl_mappings_append_json(jregion, region, flags);
if (flags & UTIL_JSON_DAX) {
--
2.37.3
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH v6 6/7] cxl/list: add --media-errors option to cxl list
2024-01-18 0:27 [ndctl PATCH v6 0/7] Support poison list retrieval alison.schofield
` (4 preceding siblings ...)
2024-01-18 0:28 ` [PATCH v6 5/7] cxl/list: collect and parse media_error records alison.schofield
@ 2024-01-18 0:28 ` alison.schofield
2024-01-18 0:28 ` [PATCH v6 7/7] cxl/test: add cxl-poison.sh unit test alison.schofield
2024-01-18 21:56 ` [ndctl PATCH v6 0/7] Support poison list retrieval Dan Williams
7 siblings, 0 replies; 16+ messages in thread
From: alison.schofield @ 2024-01-18 0:28 UTC (permalink / raw)
To: Vishal Verma; +Cc: Alison Schofield, nvdimm, linux-cxl
From: Alison Schofield <alison.schofield@intel.com>
The --media-errors option to 'cxl list' retrieves poison lists from
memory devices supporting the capability and displays the returned
media_error records in the cxl list json. This option can apply to
memdevs or regions.
Example usage in the Documentation/cxl/cxl-list.txt update.
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
Documentation/cxl/cxl-list.txt | 71 ++++++++++++++++++++++++++++++++++
cxl/filter.h | 3 ++
cxl/list.c | 2 +
3 files changed, 76 insertions(+)
diff --git a/Documentation/cxl/cxl-list.txt b/Documentation/cxl/cxl-list.txt
index 838de4086678..6105c896938c 100644
--- a/Documentation/cxl/cxl-list.txt
+++ b/Documentation/cxl/cxl-list.txt
@@ -415,6 +415,77 @@ OPTIONS
--region::
Specify CXL region device name(s), or device id(s), to filter the listing.
+-L::
+--media-errors::
+ Include media-error information. The poison list is retrieved from the
+ device(s) and media_error records are added to the listing. Apply this
+ option to memdevs and regions where devices support the poison list
+ capability.
+
+----
+# cxl list -m mem1 --media-errors
+[
+ {
+ "memdev":"mem1",
+ "pmem_size":1073741824,
+ "ram_size":1073741824,
+ "serial":1,
+ "numa_node":1,
+ "host":"cxl_mem.1",
+ "media_errors":[
+ {
+ "dpa":0,
+ "dpa_length":64,
+ "source":"Injected"
+ },
+ {
+ "region":"region5",
+ "dpa":1073741824,
+ "dpa_length":64,
+ "hpa":1035355557888,
+ "source":"Injected"
+ },
+ {
+ "region":"region5",
+ "dpa":1073745920,
+ "dpa_length":64,
+ "hpa":1035355566080,
+ "source":"Injected"
+ }
+ ]
+ }
+]
+
+# cxl list -r region5 --media-errors
+[
+ {
+ "region":"region5",
+ "resource":1035355553792,
+ "size":2147483648,
+ "type":"pmem",
+ "interleave_ways":2,
+ "interleave_granularity":4096,
+ "decode_state":"commit",
+ "media_errors":[
+ {
+ "memdev":"mem1",
+ "dpa":1073741824,
+ "dpa_length":64,
+ "hpa":1035355557888,
+ "source":"Injected"
+ },
+ {
+ "memdev":"mem1",
+ "dpa":1073745920,
+ "dpa_length":64,
+ "hpa":1035355566080,
+ "source":"Injected"
+ }
+ ]
+ }
+]
+----
+
-v::
--verbose::
Increase verbosity of the output. This can be specified
diff --git a/cxl/filter.h b/cxl/filter.h
index 3f65990f835a..956a46e0c7a9 100644
--- a/cxl/filter.h
+++ b/cxl/filter.h
@@ -30,6 +30,7 @@ struct cxl_filter_params {
bool fw;
bool alert_config;
bool dax;
+ bool media_errors;
int verbose;
struct log_ctx ctx;
};
@@ -88,6 +89,8 @@ static inline unsigned long cxl_filter_to_flags(struct cxl_filter_params *param)
flags |= UTIL_JSON_ALERT_CONFIG;
if (param->dax)
flags |= UTIL_JSON_DAX | UTIL_JSON_DAX_DEVS;
+ if (param->media_errors)
+ flags |= UTIL_JSON_MEDIA_ERRORS;
return flags;
}
diff --git a/cxl/list.c b/cxl/list.c
index 93ba51ef895c..bcdee0afd405 100644
--- a/cxl/list.c
+++ b/cxl/list.c
@@ -57,6 +57,8 @@ static const struct option options[] = {
"include memory device firmware information"),
OPT_BOOLEAN('A', "alert-config", ¶m.alert_config,
"include alert configuration information"),
+ OPT_BOOLEAN('L', "media-errors", ¶m.media_errors,
+ "include media-error information "),
OPT_INCR('v', "verbose", ¶m.verbose, "increase output detail"),
#ifdef ENABLE_DEBUG
OPT_BOOLEAN(0, "debug", &debug, "debug list walk"),
--
2.37.3
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH v6 7/7] cxl/test: add cxl-poison.sh unit test
2024-01-18 0:27 [ndctl PATCH v6 0/7] Support poison list retrieval alison.schofield
` (5 preceding siblings ...)
2024-01-18 0:28 ` [PATCH v6 6/7] cxl/list: add --media-errors option to cxl list alison.schofield
@ 2024-01-18 0:28 ` alison.schofield
2024-01-18 21:56 ` [ndctl PATCH v6 0/7] Support poison list retrieval Dan Williams
7 siblings, 0 replies; 16+ messages in thread
From: alison.schofield @ 2024-01-18 0:28 UTC (permalink / raw)
To: Vishal Verma; +Cc: Alison Schofield, nvdimm, linux-cxl
From: Alison Schofield <alison.schofield@intel.com>
Exercise cxl list, libcxl, and driver pieces of the get poison list
pathway. Inject and clear poison using debugfs and use cxl-cli to
read the poison list by memdev and by region.
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
test/cxl-poison.sh | 133 +++++++++++++++++++++++++++++++++++++++++++++
test/meson.build | 2 +
2 files changed, 135 insertions(+)
create mode 100644 test/cxl-poison.sh
diff --git a/test/cxl-poison.sh b/test/cxl-poison.sh
new file mode 100644
index 000000000000..91c5c0bed1c2
--- /dev/null
+++ b/test/cxl-poison.sh
@@ -0,0 +1,133 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2023 Intel Corporation. All rights reserved.
+
+. "$(dirname "$0")"/common
+
+rc=77
+
+set -ex
+
+trap 'err $LINENO' ERR
+
+check_prereq "jq"
+
+modprobe -r cxl_test
+modprobe cxl_test
+
+rc=1
+
+# THEORY OF OPERATION: Exercise cxl-cli and cxl driver ability to
+# inject, clear, and get the poison list. Do it by memdev and by region.
+
+find_memdev()
+{
+ readarray -t capable_mems < <("$CXL" list -b "$CXL_TEST_BUS" -M |
+ jq -r ".[] | select(.pmem_size != null) |
+ select(.ram_size != null) | .memdev")
+
+ if [ ${#capable_mems[@]} == 0 ]; then
+ echo "no memdevs found for test"
+ err "$LINENO"
+ fi
+
+ memdev=${capable_mems[0]}
+}
+
+create_x2_region()
+{
+ # Find an x2 decoder
+ decoder="$($CXL list -b "$CXL_TEST_BUS" -D -d root | jq -r ".[] |
+ select(.pmem_capable == true) |
+ select(.nr_targets == 2) |
+ .decoder")"
+
+ # Find a memdev for each host-bridge interleave position
+ port_dev0="$($CXL list -T -d "$decoder" | jq -r ".[] |
+ .targets | .[] | select(.position == 0) | .target")"
+ port_dev1="$($CXL list -T -d "$decoder" | jq -r ".[] |
+ .targets | .[] | select(.position == 1) | .target")"
+ mem0="$($CXL list -M -p "$port_dev0" | jq -r ".[0].memdev")"
+ mem1="$($CXL list -M -p "$port_dev1" | jq -r ".[0].memdev")"
+
+ region="$($CXL create-region -d "$decoder" -m "$mem0" "$mem1" |
+ jq -r ".region")"
+ if [[ ! $region ]]; then
+ echo "create-region failed for $decoder"
+ err "$LINENO"
+ fi
+ echo "$region"
+}
+
+# When cxl-cli support for inject and clear arrives, replace
+# the writes to /sys/kernel/debug with the new cxl commands.
+
+inject_poison_sysfs()
+{
+ memdev="$1"
+ addr="$2"
+
+ echo "$addr" > /sys/kernel/debug/cxl/"$memdev"/inject_poison
+}
+
+clear_poison_sysfs()
+{
+ memdev="$1"
+ addr="$2"
+
+ echo "$addr" > /sys/kernel/debug/cxl/"$memdev"/clear_poison
+}
+
+validate_poison_found()
+{
+ list_by="$1"
+ nr_expect="$2"
+
+ poison_list="$($CXL list -r "$list_by" --media-errors |
+ jq -r '.[].media_errors')"
+ nr_found=$(jq "length" <<< "$poison_list")
+ if [ "$nr_found" -ne "$nr_expect" ]; then
+ echo "$nr_expect poison records expected, $nr_found found"
+ err "$LINENO"
+ fi
+}
+
+test_poison_by_memdev()
+{
+ find_memdev
+ inject_poison_sysfs "$memdev" "0x40000000"
+ inject_poison_sysfs "$memdev" "0x40001000"
+ inject_poison_sysfs "$memdev" "0x600"
+ inject_poison_sysfs "$memdev" "0x0"
+ validate_poison_found "-m $memdev" 4
+
+ clear_poison_sysfs "$memdev" "0x40000000"
+ clear_poison_sysfs "$memdev" "0x40001000"
+ clear_poison_sysfs "$memdev" "0x600"
+ clear_poison_sysfs "$memdev" "0x0"
+ validate_poison_found "-m $memdev" 0
+}
+
+test_poison_by_region()
+{
+ create_x2_region
+ inject_poison_sysfs "$mem0" "0x40000000"
+ inject_poison_sysfs "$mem1" "0x40000000"
+ validate_poison_found "-r $region" 2
+
+ clear_poison_sysfs "$mem0" "0x40000000"
+ clear_poison_sysfs "$mem1" "0x40000000"
+ validate_poison_found "-r $region" 0
+}
+
+# Turn tracing on. Note that 'cxl list --poison' does toggle the tracing.
+# Turning it on here allows the test user to also view inject and clear
+# trace events.
+echo 1 > /sys/kernel/tracing/events/cxl/cxl_poison/enable
+
+test_poison_by_memdev
+test_poison_by_region
+
+check_dmesg "$LINENO"
+
+modprobe -r cxl-test
diff --git a/test/meson.build b/test/meson.build
index 224adaf41fcc..2706fa5d633c 100644
--- a/test/meson.build
+++ b/test/meson.build
@@ -157,6 +157,7 @@ cxl_create_region = find_program('cxl-create-region.sh')
cxl_xor_region = find_program('cxl-xor-region.sh')
cxl_update_firmware = find_program('cxl-update-firmware.sh')
cxl_events = find_program('cxl-events.sh')
+cxl_poison = find_program('cxl-poison.sh')
tests = [
[ 'libndctl', libndctl, 'ndctl' ],
@@ -186,6 +187,7 @@ tests = [
[ 'cxl-create-region.sh', cxl_create_region, 'cxl' ],
[ 'cxl-xor-region.sh', cxl_xor_region, 'cxl' ],
[ 'cxl-events.sh', cxl_events, 'cxl' ],
+ [ 'cxl-poison.sh', cxl_poison, 'cxl' ],
]
if get_option('destructive').enabled()
--
2.37.3
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [ndctl PATCH v6 0/7] Support poison list retrieval
2024-01-18 0:27 [ndctl PATCH v6 0/7] Support poison list retrieval alison.schofield
` (6 preceding siblings ...)
2024-01-18 0:28 ` [PATCH v6 7/7] cxl/test: add cxl-poison.sh unit test alison.schofield
@ 2024-01-18 21:56 ` Dan Williams
2024-01-18 23:34 ` Alison Schofield
7 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2024-01-18 21:56 UTC (permalink / raw)
To: alison.schofield, Vishal Verma; +Cc: Alison Schofield, nvdimm, linux-cxl
alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@intel.com>
>
> Changes since v5:
> - Use a private parser for cxl_poison events. (Dan)
> Previously used the default parser and re-parsed per the cxl-list
> needs. Replace that with a private parsing method for cxl_poison.
> - Add a private context to support private parsers.
> - Add helpers to use with the cxl_poison parser.
> - cxl list json: drop nr_records field (Dan)
> - cxl list option: replace "poison" w "media-errors" (Dan)
> - cxl list json: replace "poison" w "media_errors" (Dan)
> - Link to v5: https://lore.kernel.org/linux-cxl/cover.1700615159.git.alison.schofield@intel.com/
>
>
> Begin cover letter:
>
> Add the option to add a memory devices poison list to the cxl-list
> json output. Offer the option by memdev and by region. Sample usage:
>
> # cxl list -m mem1 --media-errors
> [
> {
> "memdev":"mem1",
> "pmem_size":1073741824,
> "ram_size":1073741824,
> "serial":1,
> "numa_node":1,
> "host":"cxl_mem.1",
> "media_errors":[
> {
> "dpa":0,
> "dpa_length":64,
> "source":"Injected"
> },
> {
> "region":"region5",
It feels odd to list the region here. I feel like what really matters is
to list the endpoint decoder and if someone wants to associate endpoint
decoder to region, or endpoint decoder to memdev there are other queries
for that.
Then this format does not change between the "region" listing and
"memdev" listing, they both just output the endpoint decoder and leave
the rest to follow-on queries.
For example I expect operations software has already recorded the
endpoint decoder to region mapping, so when this data comes in the
endpoint decoder is a key to make that association. Otherwise:
cxl list -RT -e $endpoint_decoder
...can answer follow up questions about what is impacted by a given
media error record.
> "dpa":1073741824,
> "dpa_length":64,
The dpa_length is also the hpa_length, right? So maybe just call the
field "length".
> "hpa":1035355557888,
> "source":"Injected"
> },
> {
> "region":"region5",
> "dpa":1073745920,
> "dpa_length":64,
> "hpa":1035355566080,
> "source":"Injected"
This "source" field feels like debug data. In production nobody is going
to be doing poison injection, and if the administrator injected it then
its implied they know that status. Otherwise a media-error is a
media-error regardless of the source.
> }
> ]
> }
> ]
>
> # cxl list -r region5 --media-errors
> [
> {
> "region":"region5",
> "resource":1035355553792,
> "size":2147483648,
> "type":"pmem",
> "interleave_ways":2,
> "interleave_granularity":4096,
> "decode_state":"commit",
> "media_errors":[
> {
> "memdev":"mem1",
> "dpa":1073741824,
> "dpa_length":64,
> "hpa":1035355557888,
> "source":"Injected"
> },
> {
> "memdev":"mem1",
> "dpa":1073745920,
> "dpa_length":64,
> "hpa":1035355566080,
> "source":"Injected"
> }
> ]
> }
> ]
>
> Alison Schofield (7):
> libcxl: add interfaces for GET_POISON_LIST mailbox commands
> cxl: add an optional pid check to event parsing
> cxl/event_trace: add a private context for private parsers
> cxl/event_trace: add helpers get_field_[string|data]()
> cxl/list: collect and parse media_error records
> cxl/list: add --media-errors option to cxl list
> cxl/test: add cxl-poison.sh unit test
>
> Documentation/cxl/cxl-list.txt | 71 +++++++++++
> cxl/event_trace.c | 53 +++++++-
> cxl/event_trace.h | 9 +-
> cxl/filter.h | 3 +
> cxl/json.c | 218 +++++++++++++++++++++++++++++++++
> cxl/lib/libcxl.c | 47 +++++++
> cxl/lib/libcxl.sym | 6 +
> cxl/libcxl.h | 2 +
> cxl/list.c | 2 +
> test/cxl-poison.sh | 133 ++++++++++++++++++++
> test/meson.build | 2 +
> 11 files changed, 543 insertions(+), 3 deletions(-)
> create mode 100644 test/cxl-poison.sh
>
>
> base-commit: a871e6153b11fe63780b37cdcb1eb347b296095c
> --
> 2.37.3
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [ndctl PATCH v6 0/7] Support poison list retrieval
2024-01-18 21:56 ` [ndctl PATCH v6 0/7] Support poison list retrieval Dan Williams
@ 2024-01-18 23:34 ` Alison Schofield
2024-01-18 23:55 ` Dan Williams
0 siblings, 1 reply; 16+ messages in thread
From: Alison Schofield @ 2024-01-18 23:34 UTC (permalink / raw)
To: Dan Williams; +Cc: Vishal Verma, nvdimm, linux-cxl
On Thu, Jan 18, 2024 at 01:56:51PM -0800, Dan Williams wrote:
> alison.schofield@ wrote:
> > From: Alison Schofield <alison.schofield@intel.com>
> >
> > Changes since v5:
> > - Use a private parser for cxl_poison events. (Dan)
> > Previously used the default parser and re-parsed per the cxl-list
> > needs. Replace that with a private parsing method for cxl_poison.
> > - Add a private context to support private parsers.
> > - Add helpers to use with the cxl_poison parser.
> > - cxl list json: drop nr_records field (Dan)
> > - cxl list option: replace "poison" w "media-errors" (Dan)
> > - cxl list json: replace "poison" w "media_errors" (Dan)
> > - Link to v5: https://lore.kernel.org/linux-cxl/cover.1700615159.git.alison.schofield@intel.com/
> >
> >
> > Begin cover letter:
> >
> > Add the option to add a memory devices poison list to the cxl-list
> > json output. Offer the option by memdev and by region. Sample usage:
> >
> > # cxl list -m mem1 --media-errors
> > [
> > {
> > "memdev":"mem1",
> > "pmem_size":1073741824,
> > "ram_size":1073741824,
> > "serial":1,
> > "numa_node":1,
> > "host":"cxl_mem.1",
> > "media_errors":[
> > {
> > "dpa":0,
> > "dpa_length":64,
> > "source":"Injected"
> > },
> > {
> > "region":"region5",
>
> It feels odd to list the region here. I feel like what really matters is
> to list the endpoint decoder and if someone wants to associate endpoint
> decoder to region, or endpoint decoder to memdev there are other queries
> for that.
>
> Then this format does not change between the "region" listing and
> "memdev" listing, they both just output the endpoint decoder and leave
> the rest to follow-on queries.
>
> For example I expect operations software has already recorded the
> endpoint decoder to region mapping, so when this data comes in the
> endpoint decoder is a key to make that association. Otherwise:
>
> cxl list -RT -e $endpoint_decoder
>
> ...can answer follow up questions about what is impacted by a given
> media error record.
I see it as a convenience offering, but I'm starting to see that your
stance is probably that a cxl-list option should only list additional
info provided by the option, and not include info that can be retrieved
elsewhere w cxl-list.
I plan to make this change to endpoint as you suggest.
>
> > "dpa":1073741824,
> > "dpa_length":64,
>
> The dpa_length is also the hpa_length, right? So maybe just call the
> field "length".
>
No, the length only refers to the device address space. I don't think
the hpa is guaranteed to be contiguous, so only the starting hpa addr
is offered.
hmm..should we call it 'size' because that seems to imply less
contiguous-ness than length?
Which should it be 'dpa_length' or 'size' (or 'length')
> > "hpa":1035355557888,
> > "source":"Injected"
> > },
> > {
> > "region":"region5",
> > "dpa":1073745920,
> > "dpa_length":64,
> > "hpa":1035355566080,
> > "source":"Injected"
>
> This "source" field feels like debug data. In production nobody is going
> to be doing poison injection, and if the administrator injected it then
> its implied they know that status. Otherwise a media-error is a
> media-error regardless of the source.
From CXL Spec Tabel 8-140 Sources can be:
Unknown.
External. Poison received from a source external to the device.
Internal. The device generated poison from an internal source.
Injected. The error was injected into the device for testing purposes.
Vendor Specific.
On the v5 review, Erwin commented:
>> This is how I would use source.
>> "external" = don't expect to see a cxl media error, look elsewhere like a UCNA or a mem_data error in the RP's CXL.CM RAS.
>> "internal" = expect to see a media error for more information.
>> "injected" = somebody injected the error, no service action needed except to maybe tighten up your security.
>> "vendor" = see vendor
If it's not presented here, user can look it up in the cxl_poison trace
event directly.
I think we should keep this as is.
>
> > }
> > ]
> > }
> > ]
> >
> > # cxl list -r region5 --media-errors
> > [
> > {
> > "region":"region5",
> > "resource":1035355553792,
> > "size":2147483648,
> > "type":"pmem",
> > "interleave_ways":2,
> > "interleave_granularity":4096,
> > "decode_state":"commit",
> > "media_errors":[
> > {
> > "memdev":"mem1",
> > "dpa":1073741824,
> > "dpa_length":64,
> > "hpa":1035355557888,
> > "source":"Injected"
> > },
> > {
> > "memdev":"mem1",
> > "dpa":1073745920,
> > "dpa_length":64,
> > "hpa":1035355566080,
> > "source":"Injected"
> > }
> > ]
> > }
> > ]
> >
> > Alison Schofield (7):
> > libcxl: add interfaces for GET_POISON_LIST mailbox commands
> > cxl: add an optional pid check to event parsing
> > cxl/event_trace: add a private context for private parsers
> > cxl/event_trace: add helpers get_field_[string|data]()
> > cxl/list: collect and parse media_error records
> > cxl/list: add --media-errors option to cxl list
> > cxl/test: add cxl-poison.sh unit test
> >
> > Documentation/cxl/cxl-list.txt | 71 +++++++++++
> > cxl/event_trace.c | 53 +++++++-
> > cxl/event_trace.h | 9 +-
> > cxl/filter.h | 3 +
> > cxl/json.c | 218 +++++++++++++++++++++++++++++++++
> > cxl/lib/libcxl.c | 47 +++++++
> > cxl/lib/libcxl.sym | 6 +
> > cxl/libcxl.h | 2 +
> > cxl/list.c | 2 +
> > test/cxl-poison.sh | 133 ++++++++++++++++++++
> > test/meson.build | 2 +
> > 11 files changed, 543 insertions(+), 3 deletions(-)
> > create mode 100644 test/cxl-poison.sh
> >
> >
> > base-commit: a871e6153b11fe63780b37cdcb1eb347b296095c
> > --
> > 2.37.3
> >
> >
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [ndctl PATCH v6 0/7] Support poison list retrieval
2024-01-18 23:34 ` Alison Schofield
@ 2024-01-18 23:55 ` Dan Williams
2024-02-07 22:54 ` Alison Schofield
0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2024-01-18 23:55 UTC (permalink / raw)
To: Alison Schofield, Dan Williams; +Cc: Vishal Verma, nvdimm, linux-cxl
Alison Schofield wrote:
[..]
> > > "dpa":1073741824,
> > > "dpa_length":64,
> >
> > The dpa_length is also the hpa_length, right? So maybe just call the
> > field "length".
> >
>
> No, the length only refers to the device address space. I don't think
> the hpa is guaranteed to be contiguous, so only the starting hpa addr
> is offered.
>
> hmm..should we call it 'size' because that seems to imply less
> contiguous-ness than length?
The only way the length could be discontiguous in HPA space is if the
error length is greater than the interleave granularity. Given poison is
tracked in cachelines and the smallest granularity is 4 cachelines it is
unlikely to hit the mutiple HPA case.
However, I think the kernel side should aim to preclude that from
happening. Given that this is relying on the kernel's translation I
would make it so that the kernel never leaves the impacted HPAs as
ambiguous. For example, if the interleave_granularity of the region is
256 and the DPA length is 512, it would be helpful if the *kernel* split
that into multiple trace events to communicate the multiple impacted
HPAs rather than leave it as an exercise to userspace.
> Which should it be 'dpa_length' or 'size' (or 'length')
I recall we used "length" for the number of badblocks in "ndctl list
--media-errors", might as well keep in consistent.
> > > "hpa":1035355557888,
> > > "source":"Injected"
> > > },
> > > {
> > > "region":"region5",
> > > "dpa":1073745920,
> > > "dpa_length":64,
> > > "hpa":1035355566080,
> > > "source":"Injected"
> >
> > This "source" field feels like debug data. In production nobody is going
> > to be doing poison injection, and if the administrator injected it then
> > its implied they know that status. Otherwise a media-error is a
> > media-error regardless of the source.
>
> From CXL Spec Tabel 8-140 Sources can be:
>
> Unknown.
> External. Poison received from a source external to the device.
> Internal. The device generated poison from an internal source.
> Injected. The error was injected into the device for testing purposes.
> Vendor Specific.
>
> On the v5 review, Erwin commented:
> >> This is how I would use source.
> >> "external" = don't expect to see a cxl media error, look elsewhere like a UCNA or a mem_data error in the RP's CXL.CM RAS.
> >> "internal" = expect to see a media error for more information.
> >> "injected" = somebody injected the error, no service action needed except to maybe tighten up your security.
> >> "vendor" = see vendor
>
> If it's not presented here, user can look it up in the cxl_poison trace
> event directly.
>
> I think we should keep this as is.
Ah, I had forgotten Erwin's comment, yeah, showing "external" vs
"internal" looks useful, "injected" gets to come along for the ride, and
if any vendor actually ships that "vendor" status that's a good
indication to the end user to go shopping for a device that plays better
with open standards.
Might be useful to capture Erwin's analysis of how to use that field in
the man page, if it's not there already.
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [ndctl PATCH v6 0/7] Support poison list retrieval
2024-01-18 23:55 ` Dan Williams
@ 2024-02-07 22:54 ` Alison Schofield
0 siblings, 0 replies; 16+ messages in thread
From: Alison Schofield @ 2024-02-07 22:54 UTC (permalink / raw)
To: Dan Williams; +Cc: Vishal Verma, nvdimm, linux-cxl
On Thu, Jan 18, 2024 at 03:55:00PM -0800, Dan Williams wrote:
> Alison Schofield wrote:
> [..]
> > > > "dpa":1073741824,
> > > > "dpa_length":64,
> > >
> > > The dpa_length is also the hpa_length, right? So maybe just call the
> > > field "length".
> > >
> >
> > No, the length only refers to the device address space. I don't think
> > the hpa is guaranteed to be contiguous, so only the starting hpa addr
> > is offered.
> >
> > hmm..should we call it 'size' because that seems to imply less
> > contiguous-ness than length?
>
> The only way the length could be discontiguous in HPA space is if the
> error length is greater than the interleave granularity. Given poison is
> tracked in cachelines and the smallest granularity is 4 cachelines it is
> unlikely to hit the mutiple HPA case.
Hi Dan,
Circling back to this issue, as I'm posting an udpated rev.
I'm not getting how *only* an error length greater that IG can lead to
discontigous HPA. If the poison starts on the last 64 bytes of an IG and
has a length greater than 64 bytes, we go beyond the endpoints mapping,
even if that length is less than IG.
In the layout below, if the device underlying endpoint2 reports
^poison^ as shown, it is discontinguous in HPA space.
HPA 0..........................................................N
ep1 .......... .......... ..........
ep2 .......... .......... ..........
bad ^poison^
good ^po ison^
'bad' is what happens today if length is applied to HPA
'good' is what is right
Am I missing something wrt cachelines you mention?
>
> However, I think the kernel side should aim to preclude that from
> happening. Given that this is relying on the kernel's translation I
> would make it so that the kernel never leaves the impacted HPAs as
> ambiguous. For example, if the interleave_granularity of the region is
> 256 and the DPA length is 512, it would be helpful if the *kernel* split
> that into multiple trace events to communicate the multiple impacted
> HPAs rather than leave it as an exercise to userspace.
>
That's a familiar plan that we rejected in the driver implementation,
As defined, a cxl_poison event reports a starting dpa, a dpa_length,
and the starting hpa if the address is mapped. That left userspace to do
the HPA translation work.
We can move that work to the driver independent of this ndctl work.
>
> Might be useful to capture Erwin's analysis of how to use that field in
> the man page, if it's not there already.
The man page now has the definitions of the source field and a spec
reference. I don't see the cxl list man page as the place to offer
media-error trouble-shooting tips.
^ permalink raw reply [flat|nested] 16+ messages in thread