Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Riana Tauro <riana.tauro@intel.com>
To: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <anshuman.gupta@intel.com>,
	<badal.nilawar@intel.com>
Subject: Re: [PATCH v2 1/2] drm/xe/xe_survivability: Redesign survivability mode
Date: Tue, 2 Dec 2025 08:40:01 +0530	[thread overview]
Message-ID: <5438fd55-1d65-49e2-be16-1623da7646da@intel.com> (raw)
In-Reply-To: <aS3U7x5mabFPF23e@intel.com>



On 12/1/2025 11:18 PM, Rodrigo Vivi wrote:
> On Mon, Dec 01, 2025 at 08:04:22PM +0530, Riana Tauro wrote:
>> Redesign survivability mode to have only one value per file.
>>
>> 1) Retain the survivability_mode sysfs to indicate the type
>>
>> 	cat /sys/bus/pci/devices/0000\:03\:00.0/survivability_mode
>> 	(Boot / Runtime)
>>
>> 2) Add survivability_info directory to expose boot breadcrumbs.
>> Entries in survivability mode sysfs are only visible when
>> boot breadcrumb registers are populated.
>>
>> 	/sys/bus/pci/devices/0000:03:00.0/survivability_info
>> 	├── aux_info0
>> 	├── aux_info1
>> 	├── aux_info2
>> 	├── aux_info3
>> 	├── aux_info4
>> 	├── capability_info
>> 	├── postcode_trace
>> 	└── postcode_trace_overflow
>>
>> Capability_info:
>>
>> 	Provides data about boot status and has bits that
>> 	indicate the support for the other breadcrumbs
>>
>> Postcode Trace / Postcode Trace Overflow :
>>
>> 	Each postcode is represented as an 8-bit value and represents
>> 	a boot failure event. When a new failure event is logged by Pcode
>> 	the existing postcodes are shifted left. These entries provide a
>> 	history of 8 postcodes.
>>
>> Auxiliary Info:
>>
>> 	Some failures have additional debug information.
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> v2: Add array map for names (Rodrigo)
>> ---
>>   drivers/gpu/drm/xe/xe_survivability_mode.c    | 199 +++++++++++-------
>>   .../gpu/drm/xe/xe_survivability_mode_types.h  |  22 +-
>>   2 files changed, 135 insertions(+), 86 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> index 1662bfddd4bc..b5b582442637 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> @@ -19,8 +19,6 @@
>>   #include "xe_pcode_api.h"
>>   #include "xe_vsec.h"
>>   
>> -#define MAX_SCRATCH_MMIO 8
>> -
>>   /**
>>    * DOC: Survivability Mode
>>    *
>> @@ -48,19 +46,25 @@
>>    *
>>    * Refer :ref:`xe_configfs` for more details on how to use configfs
>>    *
>> - * Survivability mode is indicated by the below admin-only readable sysfs which provides additional
>> - * debug information::
>> + * Survivability mode is indicated by the below admin-only readable sysfs entry::
>>    *
>>    *	/sys/bus/pci/devices/<device>/survivability_mode
>>    *
>> - * Capability Information:
>> - *	Provides boot status
>> - * Postcode Information:
>> - *	Provides information about the failure
>> - * Overflow Information
>> - *	Provides history of previous failures
>> - * Auxiliary Information
>> - *	Certain failures may have information in addition to postcode information
>> + * Survivability mode sysfs provides information about the type of survivability mode.
> 
> Perhaps expand here that it is ether 'Runtime' or 'Boot'...

sure will add this

> 
>> + * Any additional debug information if present will be visible under the directory
>> + * ``survivability_info``::
>> + *
>> + *     /sys/bus/pci/devices/<device>/survivability_info/
>> + *
>> + * This directory has the following attributes
>> + *
>> + * - ``capability_info`` : Indicates Boot status and support for additional information
>> + *
>> + * - ``postcode_trace``, ``postcode_trace_overflow`` : Each postcode is a 8bit value and
>> + *   represents a boot failure event. When a new failure event is logged by PCODE the
>> + *   existing postcodes are shifted left. These entries provide a history of 8 postcodes.
>> + *
>> + * - ``aux_info<n>`` : Some failures have additional debug information
>>    *
>>    * Runtime Survivability
>>    * =====================
>> @@ -79,49 +83,63 @@
>>    * to restore device to normal operation.
>>    */
>>   
>> +static const char * const reg_map[] = {
>> +	[CAPABILITY_INFO]         = "Capability_info",
> 
> Why do we need '_' between the words here instead of space like
> the entries below?

Sorry missed removing  '_'.
Thanks for catching this. Will fix this

Thank you
Riana

> 
>> +	[POSTCODE_TRACE]          = "Postcode trace",
>> +	[POSTCODE_TRACE_OVERFLOW] = "Postcode trace overflow",
>> +	[AUX_INFO0]               = "Auxiliary Info 0",
>> +	[AUX_INFO1]               = "Auxiliary Info 1",
>> +	[AUX_INFO2]               = "Auxiliary Info 2",
>> +	[AUX_INFO3]               = "Auxiliary Info 3",
>> +	[AUX_INFO4]               = "Auxiliary Info 4",
>> +};
>> +
>> +struct xe_survivability_attribute {
>> +	struct device_attribute attr;
>> +	u8 index;
>> +};
>> +
>> +static struct
>> +xe_survivability_attribute *dev_attr_to_survivability_attr(struct device_attribute *attr)
>> +{
>> +	return container_of(attr, struct xe_survivability_attribute, attr);
>> +}
>> +
>>   static u32 aux_history_offset(u32 reg_value)
>>   {
>>   	return REG_FIELD_GET(AUXINFO_HISTORY_OFFSET, reg_value);
>>   }
>>   
>> -static void set_survivability_info(struct xe_mmio *mmio, struct xe_survivability_info *info,
>> -				   int id, char *name)
>> +static void set_survivability_info(struct xe_mmio *mmio, u32  *info, int id)
>>   {
>> -	strscpy(info[id].name, name, sizeof(info[id].name));
>> -	info[id].reg = PCODE_SCRATCH(id).raw;
>> -	info[id].value = xe_mmio_read32(mmio, PCODE_SCRATCH(id));
>> +	info[id] = xe_mmio_read32(mmio, PCODE_SCRATCH(id));
>>   }
>>   
>>   static void populate_survivability_info(struct xe_device *xe)
>>   {
>>   	struct xe_survivability *survivability = &xe->survivability;
>> -	struct xe_survivability_info *info = survivability->info;
>> +	u32 *info = survivability->info;
>>   	struct xe_mmio *mmio;
>>   	u32 id = 0, reg_value;
>> -	char name[NAME_MAX];
>>   	int index;
>>   
>>   	mmio = xe_root_tile_mmio(xe);
>> -	set_survivability_info(mmio, info, id, "Capability Info");
>> -	reg_value = info[id].value;
>> +	set_survivability_info(mmio, info, CAPABILITY_INFO);
>> +	reg_value = info[CAPABILITY_INFO];
>>   
>>   	if (reg_value & HISTORY_TRACKING) {
>> -		id++;
>> -		set_survivability_info(mmio, info, id, "Postcode Info");
>> +		set_survivability_info(mmio, info, POSTCODE_TRACE);
>>   
>> -		if (reg_value & OVERFLOW_SUPPORT) {
>> -			id = REG_FIELD_GET(OVERFLOW_REG_OFFSET, reg_value);
>> -			set_survivability_info(mmio, info, id, "Overflow Info");
>> -		}
>> +		if (reg_value & OVERFLOW_SUPPORT)
>> +			set_survivability_info(mmio, info, POSTCODE_TRACE_OVERFLOW);
>>   	}
>>   
>>   	if (reg_value & AUXINFO_SUPPORT) {
>>   		id = REG_FIELD_GET(AUXINFO_REG_OFFSET, reg_value);
>>   
>> -		for (index = 0; id && reg_value; index++, reg_value = info[id].value,
>> -		     id = aux_history_offset(reg_value)) {
>> -			snprintf(name, NAME_MAX, "Auxiliary Info %d", index);
>> -			set_survivability_info(mmio, info, id, name);
>> +		for (index = 0; id >= AUX_INFO0 && id < MAX_SCRATCH_REG; index++) {
>> +			set_survivability_info(mmio, info, id);
>> +			id = aux_history_offset(info[id]);
>>   		}
>>   	}
>>   }
>> @@ -130,15 +148,14 @@ static void log_survivability_info(struct pci_dev *pdev)
>>   {
>>   	struct xe_device *xe = pdev_to_xe_device(pdev);
>>   	struct xe_survivability *survivability = &xe->survivability;
>> -	struct xe_survivability_info *info = survivability->info;
>> +	u32 *info = survivability->info;
>>   	int id;
>>   
>>   	dev_info(&pdev->dev, "Survivability Boot Status : Critical Failure (%d)\n",
>>   		 survivability->boot_status);
>> -	for (id = 0; id < MAX_SCRATCH_MMIO; id++) {
>> -		if (info[id].reg)
>> -			dev_info(&pdev->dev, "%s: 0x%x - 0x%x\n", info[id].name,
>> -				 info[id].reg, info[id].value);
>> +	for (id = 0; id < MAX_SCRATCH_REG; id++) {
>> +		if (info[id])
>> +			dev_info(&pdev->dev, "%s: 0x%x\n", reg_map[id], info[id]);
>>   	}
>>   }
>>   
>> @@ -156,25 +173,38 @@ static ssize_t survivability_mode_show(struct device *dev,
>>   	struct pci_dev *pdev = to_pci_dev(dev);
>>   	struct xe_device *xe = pdev_to_xe_device(pdev);
>>   	struct xe_survivability *survivability = &xe->survivability;
>> -	struct xe_survivability_info *info = survivability->info;
>> -	int index = 0, count = 0;
>>   
>> -	count += sysfs_emit_at(buff, count, "Survivability mode type: %s\n",
>> -			       survivability->type ? "Runtime" : "Boot");
>> +	return sysfs_emit(buff, "%s\n", survivability->type ? "Runtime" : "Boot");
>> +}
>>   
>> -	if (!check_boot_failure(xe))
>> -		return count;
>> +static DEVICE_ATTR_ADMIN_RO(survivability_mode);
>>   
>> -	for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
>> -		if (info[index].reg)
>> -			count += sysfs_emit_at(buff, count, "%s: 0x%x - 0x%x\n", info[index].name,
>> -					       info[index].reg, info[index].value);
>> -	}
>> +static ssize_t survivability_info_show(struct device *dev,
>> +				       struct device_attribute *attr, char *buff)
>> +{
>> +	struct xe_survivability_attribute *sa = dev_attr_to_survivability_attr(attr);
>> +	struct pci_dev *pdev = to_pci_dev(dev);
>> +	struct xe_device *xe = pdev_to_xe_device(pdev);
>> +	struct xe_survivability *survivability = &xe->survivability;
>> +	u32 *info = survivability->info;
>>   
>> -	return count;
>> +	return sysfs_emit(buff, "0x%x\n", info[sa->index]);
>>   }
>>   
>> -static DEVICE_ATTR_ADMIN_RO(survivability_mode);
>> +#define SURVIVABILITY_ATTR_RO(name, _index)					\
>> +	struct xe_survivability_attribute attr_##name =	{			\
>> +		.attr =  __ATTR(name, 0400, survivability_info_show, NULL),	\
>> +		.index = _index,						\
>> +	}
>> +
>> +SURVIVABILITY_ATTR_RO(capability_info, CAPABILITY_INFO);
>> +SURVIVABILITY_ATTR_RO(postcode_trace, POSTCODE_TRACE);
>> +SURVIVABILITY_ATTR_RO(postcode_trace_overflow, POSTCODE_TRACE_OVERFLOW);
>> +SURVIVABILITY_ATTR_RO(aux_info0, AUX_INFO0);
>> +SURVIVABILITY_ATTR_RO(aux_info1, AUX_INFO1);
>> +SURVIVABILITY_ATTR_RO(aux_info2, AUX_INFO2);
>> +SURVIVABILITY_ATTR_RO(aux_info3, AUX_INFO3);
>> +SURVIVABILITY_ATTR_RO(aux_info4, AUX_INFO4);
>>   
>>   static void xe_survivability_mode_fini(void *arg)
>>   {
>> @@ -182,17 +212,48 @@ static void xe_survivability_mode_fini(void *arg)
>>   	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>>   	struct device *dev = &pdev->dev;
>>   
>> -	sysfs_remove_file(&dev->kobj, &dev_attr_survivability_mode.attr);
>> +	device_remove_file(dev, &dev_attr_survivability_mode);
>> +}
>> +
>> +static umode_t survivability_info_attrs_visible(struct kobject *kobj, struct attribute *attr,
>> +						int idx)
>> +{
>> +	struct xe_device *xe = kdev_to_xe_device(kobj_to_dev(kobj));
>> +	struct xe_survivability *survivability = &xe->survivability;
>> +	u32 *info = survivability->info;
>> +
>> +	if (info[idx])
>> +		return 0400;
>> +
>> +	return 0;
>>   }
>>   
>> +/* Attributes are ordered according to enum scratch_reg */
>> +static struct attribute *survivability_info_attrs[] = {
>> +	&attr_capability_info.attr.attr,
>> +	&attr_postcode_trace.attr.attr,
>> +	&attr_postcode_trace_overflow.attr.attr,
>> +	&attr_aux_info0.attr.attr,
>> +	&attr_aux_info1.attr.attr,
>> +	&attr_aux_info2.attr.attr,
>> +	&attr_aux_info3.attr.attr,
>> +	&attr_aux_info4.attr.attr,
>> +	NULL,
>> +};
>> +
>> +static const struct attribute_group survivability_info_group = {
>> +	.name = "survivability_info",
>> +	.attrs = survivability_info_attrs,
>> +	.is_visible = survivability_info_attrs_visible,
>> +};
>> +
>>   static int create_survivability_sysfs(struct pci_dev *pdev)
>>   {
>>   	struct device *dev = &pdev->dev;
>>   	struct xe_device *xe = pdev_to_xe_device(pdev);
>>   	int ret;
>>   
>> -	/* create survivability mode sysfs */
>> -	ret = sysfs_create_file(&dev->kobj, &dev_attr_survivability_mode.attr);
>> +	ret = device_create_file(dev, &dev_attr_survivability_mode);
>>   	if (ret) {
>>   		dev_warn(dev, "Failed to create survivability sysfs files\n");
>>   		return ret;
>> @@ -203,6 +264,12 @@ static int create_survivability_sysfs(struct pci_dev *pdev)
>>   	if (ret)
>>   		return ret;
>>   
>> +	if (check_boot_failure(xe)) {
>> +		ret = devm_device_add_group(dev, &survivability_info_group);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>>   	return 0;
>>   }
>>   
>> @@ -239,25 +306,6 @@ static int enable_boot_survivability_mode(struct pci_dev *pdev)
>>   	return ret;
>>   }
>>   
>> -static int init_survivability_mode(struct xe_device *xe)
>> -{
>> -	struct xe_survivability *survivability = &xe->survivability;
>> -	struct xe_survivability_info *info;
>> -
>> -	survivability->size = MAX_SCRATCH_MMIO;
>> -
>> -	info = devm_kcalloc(xe->drm.dev, survivability->size, sizeof(*info),
>> -			    GFP_KERNEL);
>> -	if (!info)
>> -		return -ENOMEM;
>> -
>> -	survivability->info = info;
>> -
>> -	populate_survivability_info(xe);
>> -
>> -	return 0;
>> -}
>> -
>>   /**
>>    * xe_survivability_mode_is_boot_enabled- check if boot survivability mode is enabled
>>    * @xe: xe device instance
>> @@ -325,9 +373,7 @@ int xe_survivability_mode_runtime_enable(struct xe_device *xe)
>>   		return -EINVAL;
>>   	}
>>   
>> -	ret = init_survivability_mode(xe);
>> -	if (ret)
>> -		return ret;
>> +	populate_survivability_info(xe);
>>   
>>   	ret = create_survivability_sysfs(pdev);
>>   	if (ret)
>> @@ -356,14 +402,11 @@ int xe_survivability_mode_boot_enable(struct xe_device *xe)
>>   {
>>   	struct xe_survivability *survivability = &xe->survivability;
>>   	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>> -	int ret;
>>   
>>   	if (!xe_survivability_mode_is_requested(xe))
>>   		return 0;
>>   
>> -	ret = init_survivability_mode(xe);
>> -	if (ret)
>> -		return ret;
>> +	populate_survivability_info(xe);
>>   
>>   	/* Log breadcrumbs but do not enter survivability mode for Critical boot errors */
>>   	if (survivability->boot_status == CRITICAL_FAILURE) {
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> index cd65a5d167c9..f31b3907d933 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> @@ -9,23 +9,29 @@
>>   #include <linux/limits.h>
>>   #include <linux/types.h>
>>   
>> +enum scratch_reg {
>> +	CAPABILITY_INFO,
>> +	POSTCODE_TRACE,
>> +	POSTCODE_TRACE_OVERFLOW,
>> +	AUX_INFO0,
>> +	AUX_INFO1,
>> +	AUX_INFO2,
>> +	AUX_INFO3,
>> +	AUX_INFO4,
>> +	MAX_SCRATCH_REG,
>> +};
>> +
>>   enum xe_survivability_type {
>>   	XE_SURVIVABILITY_TYPE_BOOT,
>>   	XE_SURVIVABILITY_TYPE_RUNTIME,
>>   };
>>   
>> -struct xe_survivability_info {
>> -	char name[NAME_MAX];
>> -	u32 reg;
>> -	u32 value;
>> -};
>> -
>>   /**
>>    * struct xe_survivability: Contains survivability mode information
>>    */
>>   struct xe_survivability {
>> -	/** @info: struct that holds survivability info from scratch registers */
>> -	struct xe_survivability_info *info;
>> +	/** @info: survivability debug info */
>> +	u32 info[MAX_SCRATCH_REG];
>>   
>>   	/** @size: number of scratch registers */
>>   	u32 size;
>> -- 
>> 2.47.1
>>


  reply	other threads:[~2025-12-02  3:10 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-01 14:34 [PATCH v2 0/2] Redesign survivability mode Riana Tauro
2025-12-01 14:34 ` [PATCH v2 1/2] drm/xe/xe_survivability: " Riana Tauro
2025-12-01 17:48   ` Rodrigo Vivi
2025-12-02  3:10     ` Riana Tauro [this message]
2025-12-01 14:34 ` [PATCH v2 2/2] drm/xe/xe_survivability: Add support for survivability mode v2 Riana Tauro
2025-12-01 17:43   ` Rodrigo Vivi
2025-12-02  3:07     ` Riana Tauro
2025-12-02 20:31       ` Rodrigo Vivi
2025-12-01 17:05 ` ✓ CI.KUnit: success for Redesign survivability mode (rev2) Patchwork
2025-12-01 17:45 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-01 20:12 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5438fd55-1d65-49e2-be16-1623da7646da@intel.com \
    --to=riana.tauro@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox