From: Riana Tauro <riana.tauro@intel.com>
To: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <anshuman.gupta@intel.com>,
<badal.nilawar@intel.com>
Subject: Re: [PATCH v2 1/2] drm/xe/xe_survivability: Redesign survivability mode
Date: Tue, 2 Dec 2025 08:40:01 +0530 [thread overview]
Message-ID: <5438fd55-1d65-49e2-be16-1623da7646da@intel.com> (raw)
In-Reply-To: <aS3U7x5mabFPF23e@intel.com>
On 12/1/2025 11:18 PM, Rodrigo Vivi wrote:
> On Mon, Dec 01, 2025 at 08:04:22PM +0530, Riana Tauro wrote:
>> Redesign survivability mode to have only one value per file.
>>
>> 1) Retain the survivability_mode sysfs to indicate the type
>>
>> cat /sys/bus/pci/devices/0000\:03\:00.0/survivability_mode
>> (Boot / Runtime)
>>
>> 2) Add survivability_info directory to expose boot breadcrumbs.
>> Entries in survivability mode sysfs are only visible when
>> boot breadcrumb registers are populated.
>>
>> /sys/bus/pci/devices/0000:03:00.0/survivability_info
>> ├── aux_info0
>> ├── aux_info1
>> ├── aux_info2
>> ├── aux_info3
>> ├── aux_info4
>> ├── capability_info
>> ├── postcode_trace
>> └── postcode_trace_overflow
>>
>> Capability_info:
>>
>> Provides data about boot status and has bits that
>> indicate the support for the other breadcrumbs
>>
>> Postcode Trace / Postcode Trace Overflow :
>>
>> Each postcode is represented as an 8-bit value and represents
>> a boot failure event. When a new failure event is logged by Pcode
>> the existing postcodes are shifted left. These entries provide a
>> history of 8 postcodes.
>>
>> Auxiliary Info:
>>
>> Some failures have additional debug information.
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> v2: Add array map for names (Rodrigo)
>> ---
>> drivers/gpu/drm/xe/xe_survivability_mode.c | 199 +++++++++++-------
>> .../gpu/drm/xe/xe_survivability_mode_types.h | 22 +-
>> 2 files changed, 135 insertions(+), 86 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> index 1662bfddd4bc..b5b582442637 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> @@ -19,8 +19,6 @@
>> #include "xe_pcode_api.h"
>> #include "xe_vsec.h"
>>
>> -#define MAX_SCRATCH_MMIO 8
>> -
>> /**
>> * DOC: Survivability Mode
>> *
>> @@ -48,19 +46,25 @@
>> *
>> * Refer :ref:`xe_configfs` for more details on how to use configfs
>> *
>> - * Survivability mode is indicated by the below admin-only readable sysfs which provides additional
>> - * debug information::
>> + * Survivability mode is indicated by the below admin-only readable sysfs entry::
>> *
>> * /sys/bus/pci/devices/<device>/survivability_mode
>> *
>> - * Capability Information:
>> - * Provides boot status
>> - * Postcode Information:
>> - * Provides information about the failure
>> - * Overflow Information
>> - * Provides history of previous failures
>> - * Auxiliary Information
>> - * Certain failures may have information in addition to postcode information
>> + * Survivability mode sysfs provides information about the type of survivability mode.
>
> Perhaps expand here that it is ether 'Runtime' or 'Boot'...
sure will add this
>
>> + * Any additional debug information if present will be visible under the directory
>> + * ``survivability_info``::
>> + *
>> + * /sys/bus/pci/devices/<device>/survivability_info/
>> + *
>> + * This directory has the following attributes
>> + *
>> + * - ``capability_info`` : Indicates Boot status and support for additional information
>> + *
>> + * - ``postcode_trace``, ``postcode_trace_overflow`` : Each postcode is a 8bit value and
>> + * represents a boot failure event. When a new failure event is logged by PCODE the
>> + * existing postcodes are shifted left. These entries provide a history of 8 postcodes.
>> + *
>> + * - ``aux_info<n>`` : Some failures have additional debug information
>> *
>> * Runtime Survivability
>> * =====================
>> @@ -79,49 +83,63 @@
>> * to restore device to normal operation.
>> */
>>
>> +static const char * const reg_map[] = {
>> + [CAPABILITY_INFO] = "Capability_info",
>
> Why do we need '_' between the words here instead of space like
> the entries below?
Sorry missed removing '_'.
Thanks for catching this. Will fix this
Thank you
Riana
>
>> + [POSTCODE_TRACE] = "Postcode trace",
>> + [POSTCODE_TRACE_OVERFLOW] = "Postcode trace overflow",
>> + [AUX_INFO0] = "Auxiliary Info 0",
>> + [AUX_INFO1] = "Auxiliary Info 1",
>> + [AUX_INFO2] = "Auxiliary Info 2",
>> + [AUX_INFO3] = "Auxiliary Info 3",
>> + [AUX_INFO4] = "Auxiliary Info 4",
>> +};
>> +
>> +struct xe_survivability_attribute {
>> + struct device_attribute attr;
>> + u8 index;
>> +};
>> +
>> +static struct
>> +xe_survivability_attribute *dev_attr_to_survivability_attr(struct device_attribute *attr)
>> +{
>> + return container_of(attr, struct xe_survivability_attribute, attr);
>> +}
>> +
>> static u32 aux_history_offset(u32 reg_value)
>> {
>> return REG_FIELD_GET(AUXINFO_HISTORY_OFFSET, reg_value);
>> }
>>
>> -static void set_survivability_info(struct xe_mmio *mmio, struct xe_survivability_info *info,
>> - int id, char *name)
>> +static void set_survivability_info(struct xe_mmio *mmio, u32 *info, int id)
>> {
>> - strscpy(info[id].name, name, sizeof(info[id].name));
>> - info[id].reg = PCODE_SCRATCH(id).raw;
>> - info[id].value = xe_mmio_read32(mmio, PCODE_SCRATCH(id));
>> + info[id] = xe_mmio_read32(mmio, PCODE_SCRATCH(id));
>> }
>>
>> static void populate_survivability_info(struct xe_device *xe)
>> {
>> struct xe_survivability *survivability = &xe->survivability;
>> - struct xe_survivability_info *info = survivability->info;
>> + u32 *info = survivability->info;
>> struct xe_mmio *mmio;
>> u32 id = 0, reg_value;
>> - char name[NAME_MAX];
>> int index;
>>
>> mmio = xe_root_tile_mmio(xe);
>> - set_survivability_info(mmio, info, id, "Capability Info");
>> - reg_value = info[id].value;
>> + set_survivability_info(mmio, info, CAPABILITY_INFO);
>> + reg_value = info[CAPABILITY_INFO];
>>
>> if (reg_value & HISTORY_TRACKING) {
>> - id++;
>> - set_survivability_info(mmio, info, id, "Postcode Info");
>> + set_survivability_info(mmio, info, POSTCODE_TRACE);
>>
>> - if (reg_value & OVERFLOW_SUPPORT) {
>> - id = REG_FIELD_GET(OVERFLOW_REG_OFFSET, reg_value);
>> - set_survivability_info(mmio, info, id, "Overflow Info");
>> - }
>> + if (reg_value & OVERFLOW_SUPPORT)
>> + set_survivability_info(mmio, info, POSTCODE_TRACE_OVERFLOW);
>> }
>>
>> if (reg_value & AUXINFO_SUPPORT) {
>> id = REG_FIELD_GET(AUXINFO_REG_OFFSET, reg_value);
>>
>> - for (index = 0; id && reg_value; index++, reg_value = info[id].value,
>> - id = aux_history_offset(reg_value)) {
>> - snprintf(name, NAME_MAX, "Auxiliary Info %d", index);
>> - set_survivability_info(mmio, info, id, name);
>> + for (index = 0; id >= AUX_INFO0 && id < MAX_SCRATCH_REG; index++) {
>> + set_survivability_info(mmio, info, id);
>> + id = aux_history_offset(info[id]);
>> }
>> }
>> }
>> @@ -130,15 +148,14 @@ static void log_survivability_info(struct pci_dev *pdev)
>> {
>> struct xe_device *xe = pdev_to_xe_device(pdev);
>> struct xe_survivability *survivability = &xe->survivability;
>> - struct xe_survivability_info *info = survivability->info;
>> + u32 *info = survivability->info;
>> int id;
>>
>> dev_info(&pdev->dev, "Survivability Boot Status : Critical Failure (%d)\n",
>> survivability->boot_status);
>> - for (id = 0; id < MAX_SCRATCH_MMIO; id++) {
>> - if (info[id].reg)
>> - dev_info(&pdev->dev, "%s: 0x%x - 0x%x\n", info[id].name,
>> - info[id].reg, info[id].value);
>> + for (id = 0; id < MAX_SCRATCH_REG; id++) {
>> + if (info[id])
>> + dev_info(&pdev->dev, "%s: 0x%x\n", reg_map[id], info[id]);
>> }
>> }
>>
>> @@ -156,25 +173,38 @@ static ssize_t survivability_mode_show(struct device *dev,
>> struct pci_dev *pdev = to_pci_dev(dev);
>> struct xe_device *xe = pdev_to_xe_device(pdev);
>> struct xe_survivability *survivability = &xe->survivability;
>> - struct xe_survivability_info *info = survivability->info;
>> - int index = 0, count = 0;
>>
>> - count += sysfs_emit_at(buff, count, "Survivability mode type: %s\n",
>> - survivability->type ? "Runtime" : "Boot");
>> + return sysfs_emit(buff, "%s\n", survivability->type ? "Runtime" : "Boot");
>> +}
>>
>> - if (!check_boot_failure(xe))
>> - return count;
>> +static DEVICE_ATTR_ADMIN_RO(survivability_mode);
>>
>> - for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
>> - if (info[index].reg)
>> - count += sysfs_emit_at(buff, count, "%s: 0x%x - 0x%x\n", info[index].name,
>> - info[index].reg, info[index].value);
>> - }
>> +static ssize_t survivability_info_show(struct device *dev,
>> + struct device_attribute *attr, char *buff)
>> +{
>> + struct xe_survivability_attribute *sa = dev_attr_to_survivability_attr(attr);
>> + struct pci_dev *pdev = to_pci_dev(dev);
>> + struct xe_device *xe = pdev_to_xe_device(pdev);
>> + struct xe_survivability *survivability = &xe->survivability;
>> + u32 *info = survivability->info;
>>
>> - return count;
>> + return sysfs_emit(buff, "0x%x\n", info[sa->index]);
>> }
>>
>> -static DEVICE_ATTR_ADMIN_RO(survivability_mode);
>> +#define SURVIVABILITY_ATTR_RO(name, _index) \
>> + struct xe_survivability_attribute attr_##name = { \
>> + .attr = __ATTR(name, 0400, survivability_info_show, NULL), \
>> + .index = _index, \
>> + }
>> +
>> +SURVIVABILITY_ATTR_RO(capability_info, CAPABILITY_INFO);
>> +SURVIVABILITY_ATTR_RO(postcode_trace, POSTCODE_TRACE);
>> +SURVIVABILITY_ATTR_RO(postcode_trace_overflow, POSTCODE_TRACE_OVERFLOW);
>> +SURVIVABILITY_ATTR_RO(aux_info0, AUX_INFO0);
>> +SURVIVABILITY_ATTR_RO(aux_info1, AUX_INFO1);
>> +SURVIVABILITY_ATTR_RO(aux_info2, AUX_INFO2);
>> +SURVIVABILITY_ATTR_RO(aux_info3, AUX_INFO3);
>> +SURVIVABILITY_ATTR_RO(aux_info4, AUX_INFO4);
>>
>> static void xe_survivability_mode_fini(void *arg)
>> {
>> @@ -182,17 +212,48 @@ static void xe_survivability_mode_fini(void *arg)
>> struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>> struct device *dev = &pdev->dev;
>>
>> - sysfs_remove_file(&dev->kobj, &dev_attr_survivability_mode.attr);
>> + device_remove_file(dev, &dev_attr_survivability_mode);
>> +}
>> +
>> +static umode_t survivability_info_attrs_visible(struct kobject *kobj, struct attribute *attr,
>> + int idx)
>> +{
>> + struct xe_device *xe = kdev_to_xe_device(kobj_to_dev(kobj));
>> + struct xe_survivability *survivability = &xe->survivability;
>> + u32 *info = survivability->info;
>> +
>> + if (info[idx])
>> + return 0400;
>> +
>> + return 0;
>> }
>>
>> +/* Attributes are ordered according to enum scratch_reg */
>> +static struct attribute *survivability_info_attrs[] = {
>> + &attr_capability_info.attr.attr,
>> + &attr_postcode_trace.attr.attr,
>> + &attr_postcode_trace_overflow.attr.attr,
>> + &attr_aux_info0.attr.attr,
>> + &attr_aux_info1.attr.attr,
>> + &attr_aux_info2.attr.attr,
>> + &attr_aux_info3.attr.attr,
>> + &attr_aux_info4.attr.attr,
>> + NULL,
>> +};
>> +
>> +static const struct attribute_group survivability_info_group = {
>> + .name = "survivability_info",
>> + .attrs = survivability_info_attrs,
>> + .is_visible = survivability_info_attrs_visible,
>> +};
>> +
>> static int create_survivability_sysfs(struct pci_dev *pdev)
>> {
>> struct device *dev = &pdev->dev;
>> struct xe_device *xe = pdev_to_xe_device(pdev);
>> int ret;
>>
>> - /* create survivability mode sysfs */
>> - ret = sysfs_create_file(&dev->kobj, &dev_attr_survivability_mode.attr);
>> + ret = device_create_file(dev, &dev_attr_survivability_mode);
>> if (ret) {
>> dev_warn(dev, "Failed to create survivability sysfs files\n");
>> return ret;
>> @@ -203,6 +264,12 @@ static int create_survivability_sysfs(struct pci_dev *pdev)
>> if (ret)
>> return ret;
>>
>> + if (check_boot_failure(xe)) {
>> + ret = devm_device_add_group(dev, &survivability_info_group);
>> + if (ret)
>> + return ret;
>> + }
>> +
>> return 0;
>> }
>>
>> @@ -239,25 +306,6 @@ static int enable_boot_survivability_mode(struct pci_dev *pdev)
>> return ret;
>> }
>>
>> -static int init_survivability_mode(struct xe_device *xe)
>> -{
>> - struct xe_survivability *survivability = &xe->survivability;
>> - struct xe_survivability_info *info;
>> -
>> - survivability->size = MAX_SCRATCH_MMIO;
>> -
>> - info = devm_kcalloc(xe->drm.dev, survivability->size, sizeof(*info),
>> - GFP_KERNEL);
>> - if (!info)
>> - return -ENOMEM;
>> -
>> - survivability->info = info;
>> -
>> - populate_survivability_info(xe);
>> -
>> - return 0;
>> -}
>> -
>> /**
>> * xe_survivability_mode_is_boot_enabled- check if boot survivability mode is enabled
>> * @xe: xe device instance
>> @@ -325,9 +373,7 @@ int xe_survivability_mode_runtime_enable(struct xe_device *xe)
>> return -EINVAL;
>> }
>>
>> - ret = init_survivability_mode(xe);
>> - if (ret)
>> - return ret;
>> + populate_survivability_info(xe);
>>
>> ret = create_survivability_sysfs(pdev);
>> if (ret)
>> @@ -356,14 +402,11 @@ int xe_survivability_mode_boot_enable(struct xe_device *xe)
>> {
>> struct xe_survivability *survivability = &xe->survivability;
>> struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>> - int ret;
>>
>> if (!xe_survivability_mode_is_requested(xe))
>> return 0;
>>
>> - ret = init_survivability_mode(xe);
>> - if (ret)
>> - return ret;
>> + populate_survivability_info(xe);
>>
>> /* Log breadcrumbs but do not enter survivability mode for Critical boot errors */
>> if (survivability->boot_status == CRITICAL_FAILURE) {
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> index cd65a5d167c9..f31b3907d933 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> @@ -9,23 +9,29 @@
>> #include <linux/limits.h>
>> #include <linux/types.h>
>>
>> +enum scratch_reg {
>> + CAPABILITY_INFO,
>> + POSTCODE_TRACE,
>> + POSTCODE_TRACE_OVERFLOW,
>> + AUX_INFO0,
>> + AUX_INFO1,
>> + AUX_INFO2,
>> + AUX_INFO3,
>> + AUX_INFO4,
>> + MAX_SCRATCH_REG,
>> +};
>> +
>> enum xe_survivability_type {
>> XE_SURVIVABILITY_TYPE_BOOT,
>> XE_SURVIVABILITY_TYPE_RUNTIME,
>> };
>>
>> -struct xe_survivability_info {
>> - char name[NAME_MAX];
>> - u32 reg;
>> - u32 value;
>> -};
>> -
>> /**
>> * struct xe_survivability: Contains survivability mode information
>> */
>> struct xe_survivability {
>> - /** @info: struct that holds survivability info from scratch registers */
>> - struct xe_survivability_info *info;
>> + /** @info: survivability debug info */
>> + u32 info[MAX_SCRATCH_REG];
>>
>> /** @size: number of scratch registers */
>> u32 size;
>> --
>> 2.47.1
>>
next prev parent reply other threads:[~2025-12-02 3:10 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-01 14:34 [PATCH v2 0/2] Redesign survivability mode Riana Tauro
2025-12-01 14:34 ` [PATCH v2 1/2] drm/xe/xe_survivability: " Riana Tauro
2025-12-01 17:48 ` Rodrigo Vivi
2025-12-02 3:10 ` Riana Tauro [this message]
2025-12-01 14:34 ` [PATCH v2 2/2] drm/xe/xe_survivability: Add support for survivability mode v2 Riana Tauro
2025-12-01 17:43 ` Rodrigo Vivi
2025-12-02 3:07 ` Riana Tauro
2025-12-02 20:31 ` Rodrigo Vivi
2025-12-01 17:05 ` ✓ CI.KUnit: success for Redesign survivability mode (rev2) Patchwork
2025-12-01 17:45 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-01 20:12 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5438fd55-1d65-49e2-be16-1623da7646da@intel.com \
--to=riana.tauro@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=badal.nilawar@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=rodrigo.vivi@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox