From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Riana Tauro <riana.tauro@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <anshuman.gupta@intel.com>,
<badal.nilawar@intel.com>
Subject: Re: [PATCH v2 1/2] drm/xe/xe_survivability: Redesign survivability mode
Date: Mon, 1 Dec 2025 12:48:31 -0500 [thread overview]
Message-ID: <aS3U7x5mabFPF23e@intel.com> (raw)
In-Reply-To: <20251201143420.3158372-5-riana.tauro@intel.com>
On Mon, Dec 01, 2025 at 08:04:22PM +0530, Riana Tauro wrote:
> Redesign survivability mode to have only one value per file.
>
> 1) Retain the survivability_mode sysfs to indicate the type
>
> cat /sys/bus/pci/devices/0000\:03\:00.0/survivability_mode
> (Boot / Runtime)
>
> 2) Add survivability_info directory to expose boot breadcrumbs.
> Entries in survivability mode sysfs are only visible when
> boot breadcrumb registers are populated.
>
> /sys/bus/pci/devices/0000:03:00.0/survivability_info
> ├── aux_info0
> ├── aux_info1
> ├── aux_info2
> ├── aux_info3
> ├── aux_info4
> ├── capability_info
> ├── postcode_trace
> └── postcode_trace_overflow
>
> Capability_info:
>
> Provides data about boot status and has bits that
> indicate the support for the other breadcrumbs
>
> Postcode Trace / Postcode Trace Overflow :
>
> Each postcode is represented as an 8-bit value and represents
> a boot failure event. When a new failure event is logged by Pcode
> the existing postcodes are shifted left. These entries provide a
> history of 8 postcodes.
>
> Auxiliary Info:
>
> Some failures have additional debug information.
>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> v2: Add array map for names (Rodrigo)
> ---
> drivers/gpu/drm/xe/xe_survivability_mode.c | 199 +++++++++++-------
> .../gpu/drm/xe/xe_survivability_mode_types.h | 22 +-
> 2 files changed, 135 insertions(+), 86 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
> index 1662bfddd4bc..b5b582442637 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
> @@ -19,8 +19,6 @@
> #include "xe_pcode_api.h"
> #include "xe_vsec.h"
>
> -#define MAX_SCRATCH_MMIO 8
> -
> /**
> * DOC: Survivability Mode
> *
> @@ -48,19 +46,25 @@
> *
> * Refer :ref:`xe_configfs` for more details on how to use configfs
> *
> - * Survivability mode is indicated by the below admin-only readable sysfs which provides additional
> - * debug information::
> + * Survivability mode is indicated by the below admin-only readable sysfs entry::
> *
> * /sys/bus/pci/devices/<device>/survivability_mode
> *
> - * Capability Information:
> - * Provides boot status
> - * Postcode Information:
> - * Provides information about the failure
> - * Overflow Information
> - * Provides history of previous failures
> - * Auxiliary Information
> - * Certain failures may have information in addition to postcode information
> + * Survivability mode sysfs provides information about the type of survivability mode.
Perhaps expand here that it is ether 'Runtime' or 'Boot'...
> + * Any additional debug information if present will be visible under the directory
> + * ``survivability_info``::
> + *
> + * /sys/bus/pci/devices/<device>/survivability_info/
> + *
> + * This directory has the following attributes
> + *
> + * - ``capability_info`` : Indicates Boot status and support for additional information
> + *
> + * - ``postcode_trace``, ``postcode_trace_overflow`` : Each postcode is a 8bit value and
> + * represents a boot failure event. When a new failure event is logged by PCODE the
> + * existing postcodes are shifted left. These entries provide a history of 8 postcodes.
> + *
> + * - ``aux_info<n>`` : Some failures have additional debug information
> *
> * Runtime Survivability
> * =====================
> @@ -79,49 +83,63 @@
> * to restore device to normal operation.
> */
>
> +static const char * const reg_map[] = {
> + [CAPABILITY_INFO] = "Capability_info",
Why do we need '_' between the words here instead of space like
the entries below?
> + [POSTCODE_TRACE] = "Postcode trace",
> + [POSTCODE_TRACE_OVERFLOW] = "Postcode trace overflow",
> + [AUX_INFO0] = "Auxiliary Info 0",
> + [AUX_INFO1] = "Auxiliary Info 1",
> + [AUX_INFO2] = "Auxiliary Info 2",
> + [AUX_INFO3] = "Auxiliary Info 3",
> + [AUX_INFO4] = "Auxiliary Info 4",
> +};
> +
> +struct xe_survivability_attribute {
> + struct device_attribute attr;
> + u8 index;
> +};
> +
> +static struct
> +xe_survivability_attribute *dev_attr_to_survivability_attr(struct device_attribute *attr)
> +{
> + return container_of(attr, struct xe_survivability_attribute, attr);
> +}
> +
> static u32 aux_history_offset(u32 reg_value)
> {
> return REG_FIELD_GET(AUXINFO_HISTORY_OFFSET, reg_value);
> }
>
> -static void set_survivability_info(struct xe_mmio *mmio, struct xe_survivability_info *info,
> - int id, char *name)
> +static void set_survivability_info(struct xe_mmio *mmio, u32 *info, int id)
> {
> - strscpy(info[id].name, name, sizeof(info[id].name));
> - info[id].reg = PCODE_SCRATCH(id).raw;
> - info[id].value = xe_mmio_read32(mmio, PCODE_SCRATCH(id));
> + info[id] = xe_mmio_read32(mmio, PCODE_SCRATCH(id));
> }
>
> static void populate_survivability_info(struct xe_device *xe)
> {
> struct xe_survivability *survivability = &xe->survivability;
> - struct xe_survivability_info *info = survivability->info;
> + u32 *info = survivability->info;
> struct xe_mmio *mmio;
> u32 id = 0, reg_value;
> - char name[NAME_MAX];
> int index;
>
> mmio = xe_root_tile_mmio(xe);
> - set_survivability_info(mmio, info, id, "Capability Info");
> - reg_value = info[id].value;
> + set_survivability_info(mmio, info, CAPABILITY_INFO);
> + reg_value = info[CAPABILITY_INFO];
>
> if (reg_value & HISTORY_TRACKING) {
> - id++;
> - set_survivability_info(mmio, info, id, "Postcode Info");
> + set_survivability_info(mmio, info, POSTCODE_TRACE);
>
> - if (reg_value & OVERFLOW_SUPPORT) {
> - id = REG_FIELD_GET(OVERFLOW_REG_OFFSET, reg_value);
> - set_survivability_info(mmio, info, id, "Overflow Info");
> - }
> + if (reg_value & OVERFLOW_SUPPORT)
> + set_survivability_info(mmio, info, POSTCODE_TRACE_OVERFLOW);
> }
>
> if (reg_value & AUXINFO_SUPPORT) {
> id = REG_FIELD_GET(AUXINFO_REG_OFFSET, reg_value);
>
> - for (index = 0; id && reg_value; index++, reg_value = info[id].value,
> - id = aux_history_offset(reg_value)) {
> - snprintf(name, NAME_MAX, "Auxiliary Info %d", index);
> - set_survivability_info(mmio, info, id, name);
> + for (index = 0; id >= AUX_INFO0 && id < MAX_SCRATCH_REG; index++) {
> + set_survivability_info(mmio, info, id);
> + id = aux_history_offset(info[id]);
> }
> }
> }
> @@ -130,15 +148,14 @@ static void log_survivability_info(struct pci_dev *pdev)
> {
> struct xe_device *xe = pdev_to_xe_device(pdev);
> struct xe_survivability *survivability = &xe->survivability;
> - struct xe_survivability_info *info = survivability->info;
> + u32 *info = survivability->info;
> int id;
>
> dev_info(&pdev->dev, "Survivability Boot Status : Critical Failure (%d)\n",
> survivability->boot_status);
> - for (id = 0; id < MAX_SCRATCH_MMIO; id++) {
> - if (info[id].reg)
> - dev_info(&pdev->dev, "%s: 0x%x - 0x%x\n", info[id].name,
> - info[id].reg, info[id].value);
> + for (id = 0; id < MAX_SCRATCH_REG; id++) {
> + if (info[id])
> + dev_info(&pdev->dev, "%s: 0x%x\n", reg_map[id], info[id]);
> }
> }
>
> @@ -156,25 +173,38 @@ static ssize_t survivability_mode_show(struct device *dev,
> struct pci_dev *pdev = to_pci_dev(dev);
> struct xe_device *xe = pdev_to_xe_device(pdev);
> struct xe_survivability *survivability = &xe->survivability;
> - struct xe_survivability_info *info = survivability->info;
> - int index = 0, count = 0;
>
> - count += sysfs_emit_at(buff, count, "Survivability mode type: %s\n",
> - survivability->type ? "Runtime" : "Boot");
> + return sysfs_emit(buff, "%s\n", survivability->type ? "Runtime" : "Boot");
> +}
>
> - if (!check_boot_failure(xe))
> - return count;
> +static DEVICE_ATTR_ADMIN_RO(survivability_mode);
>
> - for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
> - if (info[index].reg)
> - count += sysfs_emit_at(buff, count, "%s: 0x%x - 0x%x\n", info[index].name,
> - info[index].reg, info[index].value);
> - }
> +static ssize_t survivability_info_show(struct device *dev,
> + struct device_attribute *attr, char *buff)
> +{
> + struct xe_survivability_attribute *sa = dev_attr_to_survivability_attr(attr);
> + struct pci_dev *pdev = to_pci_dev(dev);
> + struct xe_device *xe = pdev_to_xe_device(pdev);
> + struct xe_survivability *survivability = &xe->survivability;
> + u32 *info = survivability->info;
>
> - return count;
> + return sysfs_emit(buff, "0x%x\n", info[sa->index]);
> }
>
> -static DEVICE_ATTR_ADMIN_RO(survivability_mode);
> +#define SURVIVABILITY_ATTR_RO(name, _index) \
> + struct xe_survivability_attribute attr_##name = { \
> + .attr = __ATTR(name, 0400, survivability_info_show, NULL), \
> + .index = _index, \
> + }
> +
> +SURVIVABILITY_ATTR_RO(capability_info, CAPABILITY_INFO);
> +SURVIVABILITY_ATTR_RO(postcode_trace, POSTCODE_TRACE);
> +SURVIVABILITY_ATTR_RO(postcode_trace_overflow, POSTCODE_TRACE_OVERFLOW);
> +SURVIVABILITY_ATTR_RO(aux_info0, AUX_INFO0);
> +SURVIVABILITY_ATTR_RO(aux_info1, AUX_INFO1);
> +SURVIVABILITY_ATTR_RO(aux_info2, AUX_INFO2);
> +SURVIVABILITY_ATTR_RO(aux_info3, AUX_INFO3);
> +SURVIVABILITY_ATTR_RO(aux_info4, AUX_INFO4);
>
> static void xe_survivability_mode_fini(void *arg)
> {
> @@ -182,17 +212,48 @@ static void xe_survivability_mode_fini(void *arg)
> struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> struct device *dev = &pdev->dev;
>
> - sysfs_remove_file(&dev->kobj, &dev_attr_survivability_mode.attr);
> + device_remove_file(dev, &dev_attr_survivability_mode);
> +}
> +
> +static umode_t survivability_info_attrs_visible(struct kobject *kobj, struct attribute *attr,
> + int idx)
> +{
> + struct xe_device *xe = kdev_to_xe_device(kobj_to_dev(kobj));
> + struct xe_survivability *survivability = &xe->survivability;
> + u32 *info = survivability->info;
> +
> + if (info[idx])
> + return 0400;
> +
> + return 0;
> }
>
> +/* Attributes are ordered according to enum scratch_reg */
> +static struct attribute *survivability_info_attrs[] = {
> + &attr_capability_info.attr.attr,
> + &attr_postcode_trace.attr.attr,
> + &attr_postcode_trace_overflow.attr.attr,
> + &attr_aux_info0.attr.attr,
> + &attr_aux_info1.attr.attr,
> + &attr_aux_info2.attr.attr,
> + &attr_aux_info3.attr.attr,
> + &attr_aux_info4.attr.attr,
> + NULL,
> +};
> +
> +static const struct attribute_group survivability_info_group = {
> + .name = "survivability_info",
> + .attrs = survivability_info_attrs,
> + .is_visible = survivability_info_attrs_visible,
> +};
> +
> static int create_survivability_sysfs(struct pci_dev *pdev)
> {
> struct device *dev = &pdev->dev;
> struct xe_device *xe = pdev_to_xe_device(pdev);
> int ret;
>
> - /* create survivability mode sysfs */
> - ret = sysfs_create_file(&dev->kobj, &dev_attr_survivability_mode.attr);
> + ret = device_create_file(dev, &dev_attr_survivability_mode);
> if (ret) {
> dev_warn(dev, "Failed to create survivability sysfs files\n");
> return ret;
> @@ -203,6 +264,12 @@ static int create_survivability_sysfs(struct pci_dev *pdev)
> if (ret)
> return ret;
>
> + if (check_boot_failure(xe)) {
> + ret = devm_device_add_group(dev, &survivability_info_group);
> + if (ret)
> + return ret;
> + }
> +
> return 0;
> }
>
> @@ -239,25 +306,6 @@ static int enable_boot_survivability_mode(struct pci_dev *pdev)
> return ret;
> }
>
> -static int init_survivability_mode(struct xe_device *xe)
> -{
> - struct xe_survivability *survivability = &xe->survivability;
> - struct xe_survivability_info *info;
> -
> - survivability->size = MAX_SCRATCH_MMIO;
> -
> - info = devm_kcalloc(xe->drm.dev, survivability->size, sizeof(*info),
> - GFP_KERNEL);
> - if (!info)
> - return -ENOMEM;
> -
> - survivability->info = info;
> -
> - populate_survivability_info(xe);
> -
> - return 0;
> -}
> -
> /**
> * xe_survivability_mode_is_boot_enabled- check if boot survivability mode is enabled
> * @xe: xe device instance
> @@ -325,9 +373,7 @@ int xe_survivability_mode_runtime_enable(struct xe_device *xe)
> return -EINVAL;
> }
>
> - ret = init_survivability_mode(xe);
> - if (ret)
> - return ret;
> + populate_survivability_info(xe);
>
> ret = create_survivability_sysfs(pdev);
> if (ret)
> @@ -356,14 +402,11 @@ int xe_survivability_mode_boot_enable(struct xe_device *xe)
> {
> struct xe_survivability *survivability = &xe->survivability;
> struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> - int ret;
>
> if (!xe_survivability_mode_is_requested(xe))
> return 0;
>
> - ret = init_survivability_mode(xe);
> - if (ret)
> - return ret;
> + populate_survivability_info(xe);
>
> /* Log breadcrumbs but do not enter survivability mode for Critical boot errors */
> if (survivability->boot_status == CRITICAL_FAILURE) {
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> index cd65a5d167c9..f31b3907d933 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> @@ -9,23 +9,29 @@
> #include <linux/limits.h>
> #include <linux/types.h>
>
> +enum scratch_reg {
> + CAPABILITY_INFO,
> + POSTCODE_TRACE,
> + POSTCODE_TRACE_OVERFLOW,
> + AUX_INFO0,
> + AUX_INFO1,
> + AUX_INFO2,
> + AUX_INFO3,
> + AUX_INFO4,
> + MAX_SCRATCH_REG,
> +};
> +
> enum xe_survivability_type {
> XE_SURVIVABILITY_TYPE_BOOT,
> XE_SURVIVABILITY_TYPE_RUNTIME,
> };
>
> -struct xe_survivability_info {
> - char name[NAME_MAX];
> - u32 reg;
> - u32 value;
> -};
> -
> /**
> * struct xe_survivability: Contains survivability mode information
> */
> struct xe_survivability {
> - /** @info: struct that holds survivability info from scratch registers */
> - struct xe_survivability_info *info;
> + /** @info: survivability debug info */
> + u32 info[MAX_SCRATCH_REG];
>
> /** @size: number of scratch registers */
> u32 size;
> --
> 2.47.1
>
next prev parent reply other threads:[~2025-12-01 17:48 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-01 14:34 [PATCH v2 0/2] Redesign survivability mode Riana Tauro
2025-12-01 14:34 ` [PATCH v2 1/2] drm/xe/xe_survivability: " Riana Tauro
2025-12-01 17:48 ` Rodrigo Vivi [this message]
2025-12-02 3:10 ` Riana Tauro
2025-12-01 14:34 ` [PATCH v2 2/2] drm/xe/xe_survivability: Add support for survivability mode v2 Riana Tauro
2025-12-01 17:43 ` Rodrigo Vivi
2025-12-02 3:07 ` Riana Tauro
2025-12-02 20:31 ` Rodrigo Vivi
2025-12-01 17:05 ` ✓ CI.KUnit: success for Redesign survivability mode (rev2) Patchwork
2025-12-01 17:45 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-01 20:12 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aS3U7x5mabFPF23e@intel.com \
--to=rodrigo.vivi@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=badal.nilawar@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=riana.tauro@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox