Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Riana Tauro <riana.tauro@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <anshuman.gupta@intel.com>,
	<badal.nilawar@intel.com>
Subject: Re: [PATCH v2 1/2] drm/xe/xe_survivability: Redesign survivability mode
Date: Mon, 1 Dec 2025 12:48:31 -0500	[thread overview]
Message-ID: <aS3U7x5mabFPF23e@intel.com> (raw)
In-Reply-To: <20251201143420.3158372-5-riana.tauro@intel.com>

On Mon, Dec 01, 2025 at 08:04:22PM +0530, Riana Tauro wrote:
> Redesign survivability mode to have only one value per file.
> 
> 1) Retain the survivability_mode sysfs to indicate the type
> 
> 	cat /sys/bus/pci/devices/0000\:03\:00.0/survivability_mode
> 	(Boot / Runtime)
> 
> 2) Add survivability_info directory to expose boot breadcrumbs.
> Entries in survivability mode sysfs are only visible when
> boot breadcrumb registers are populated.
> 
> 	/sys/bus/pci/devices/0000:03:00.0/survivability_info
> 	├── aux_info0
> 	├── aux_info1
> 	├── aux_info2
> 	├── aux_info3
> 	├── aux_info4
> 	├── capability_info
> 	├── postcode_trace
> 	└── postcode_trace_overflow
> 
> Capability_info:
> 
> 	Provides data about boot status and has bits that
> 	indicate the support for the other breadcrumbs
> 
> Postcode Trace / Postcode Trace Overflow :
> 
> 	Each postcode is represented as an 8-bit value and represents
> 	a boot failure event. When a new failure event is logged by Pcode
> 	the existing postcodes are shifted left. These entries provide a
> 	history of 8 postcodes.
> 
> Auxiliary Info:
> 
> 	Some failures have additional debug information.
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> v2: Add array map for names (Rodrigo)
> ---
>  drivers/gpu/drm/xe/xe_survivability_mode.c    | 199 +++++++++++-------
>  .../gpu/drm/xe/xe_survivability_mode_types.h  |  22 +-
>  2 files changed, 135 insertions(+), 86 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
> index 1662bfddd4bc..b5b582442637 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
> @@ -19,8 +19,6 @@
>  #include "xe_pcode_api.h"
>  #include "xe_vsec.h"
>  
> -#define MAX_SCRATCH_MMIO 8
> -
>  /**
>   * DOC: Survivability Mode
>   *
> @@ -48,19 +46,25 @@
>   *
>   * Refer :ref:`xe_configfs` for more details on how to use configfs
>   *
> - * Survivability mode is indicated by the below admin-only readable sysfs which provides additional
> - * debug information::
> + * Survivability mode is indicated by the below admin-only readable sysfs entry::
>   *
>   *	/sys/bus/pci/devices/<device>/survivability_mode
>   *
> - * Capability Information:
> - *	Provides boot status
> - * Postcode Information:
> - *	Provides information about the failure
> - * Overflow Information
> - *	Provides history of previous failures
> - * Auxiliary Information
> - *	Certain failures may have information in addition to postcode information
> + * Survivability mode sysfs provides information about the type of survivability mode.

Perhaps expand here that it is ether 'Runtime' or 'Boot'...

> + * Any additional debug information if present will be visible under the directory
> + * ``survivability_info``::
> + *
> + *     /sys/bus/pci/devices/<device>/survivability_info/
> + *
> + * This directory has the following attributes
> + *
> + * - ``capability_info`` : Indicates Boot status and support for additional information
> + *
> + * - ``postcode_trace``, ``postcode_trace_overflow`` : Each postcode is a 8bit value and
> + *   represents a boot failure event. When a new failure event is logged by PCODE the
> + *   existing postcodes are shifted left. These entries provide a history of 8 postcodes.
> + *
> + * - ``aux_info<n>`` : Some failures have additional debug information
>   *
>   * Runtime Survivability
>   * =====================
> @@ -79,49 +83,63 @@
>   * to restore device to normal operation.
>   */
>  
> +static const char * const reg_map[] = {
> +	[CAPABILITY_INFO]         = "Capability_info",

Why do we need '_' between the words here instead of space like
the entries below?

> +	[POSTCODE_TRACE]          = "Postcode trace",
> +	[POSTCODE_TRACE_OVERFLOW] = "Postcode trace overflow",
> +	[AUX_INFO0]               = "Auxiliary Info 0",
> +	[AUX_INFO1]               = "Auxiliary Info 1",
> +	[AUX_INFO2]               = "Auxiliary Info 2",
> +	[AUX_INFO3]               = "Auxiliary Info 3",
> +	[AUX_INFO4]               = "Auxiliary Info 4",
> +};
> +
> +struct xe_survivability_attribute {
> +	struct device_attribute attr;
> +	u8 index;
> +};
> +
> +static struct
> +xe_survivability_attribute *dev_attr_to_survivability_attr(struct device_attribute *attr)
> +{
> +	return container_of(attr, struct xe_survivability_attribute, attr);
> +}
> +
>  static u32 aux_history_offset(u32 reg_value)
>  {
>  	return REG_FIELD_GET(AUXINFO_HISTORY_OFFSET, reg_value);
>  }
>  
> -static void set_survivability_info(struct xe_mmio *mmio, struct xe_survivability_info *info,
> -				   int id, char *name)
> +static void set_survivability_info(struct xe_mmio *mmio, u32  *info, int id)
>  {
> -	strscpy(info[id].name, name, sizeof(info[id].name));
> -	info[id].reg = PCODE_SCRATCH(id).raw;
> -	info[id].value = xe_mmio_read32(mmio, PCODE_SCRATCH(id));
> +	info[id] = xe_mmio_read32(mmio, PCODE_SCRATCH(id));
>  }
>  
>  static void populate_survivability_info(struct xe_device *xe)
>  {
>  	struct xe_survivability *survivability = &xe->survivability;
> -	struct xe_survivability_info *info = survivability->info;
> +	u32 *info = survivability->info;
>  	struct xe_mmio *mmio;
>  	u32 id = 0, reg_value;
> -	char name[NAME_MAX];
>  	int index;
>  
>  	mmio = xe_root_tile_mmio(xe);
> -	set_survivability_info(mmio, info, id, "Capability Info");
> -	reg_value = info[id].value;
> +	set_survivability_info(mmio, info, CAPABILITY_INFO);
> +	reg_value = info[CAPABILITY_INFO];
>  
>  	if (reg_value & HISTORY_TRACKING) {
> -		id++;
> -		set_survivability_info(mmio, info, id, "Postcode Info");
> +		set_survivability_info(mmio, info, POSTCODE_TRACE);
>  
> -		if (reg_value & OVERFLOW_SUPPORT) {
> -			id = REG_FIELD_GET(OVERFLOW_REG_OFFSET, reg_value);
> -			set_survivability_info(mmio, info, id, "Overflow Info");
> -		}
> +		if (reg_value & OVERFLOW_SUPPORT)
> +			set_survivability_info(mmio, info, POSTCODE_TRACE_OVERFLOW);
>  	}
>  
>  	if (reg_value & AUXINFO_SUPPORT) {
>  		id = REG_FIELD_GET(AUXINFO_REG_OFFSET, reg_value);
>  
> -		for (index = 0; id && reg_value; index++, reg_value = info[id].value,
> -		     id = aux_history_offset(reg_value)) {
> -			snprintf(name, NAME_MAX, "Auxiliary Info %d", index);
> -			set_survivability_info(mmio, info, id, name);
> +		for (index = 0; id >= AUX_INFO0 && id < MAX_SCRATCH_REG; index++) {
> +			set_survivability_info(mmio, info, id);
> +			id = aux_history_offset(info[id]);
>  		}
>  	}
>  }
> @@ -130,15 +148,14 @@ static void log_survivability_info(struct pci_dev *pdev)
>  {
>  	struct xe_device *xe = pdev_to_xe_device(pdev);
>  	struct xe_survivability *survivability = &xe->survivability;
> -	struct xe_survivability_info *info = survivability->info;
> +	u32 *info = survivability->info;
>  	int id;
>  
>  	dev_info(&pdev->dev, "Survivability Boot Status : Critical Failure (%d)\n",
>  		 survivability->boot_status);
> -	for (id = 0; id < MAX_SCRATCH_MMIO; id++) {
> -		if (info[id].reg)
> -			dev_info(&pdev->dev, "%s: 0x%x - 0x%x\n", info[id].name,
> -				 info[id].reg, info[id].value);
> +	for (id = 0; id < MAX_SCRATCH_REG; id++) {
> +		if (info[id])
> +			dev_info(&pdev->dev, "%s: 0x%x\n", reg_map[id], info[id]);
>  	}
>  }
>  
> @@ -156,25 +173,38 @@ static ssize_t survivability_mode_show(struct device *dev,
>  	struct pci_dev *pdev = to_pci_dev(dev);
>  	struct xe_device *xe = pdev_to_xe_device(pdev);
>  	struct xe_survivability *survivability = &xe->survivability;
> -	struct xe_survivability_info *info = survivability->info;
> -	int index = 0, count = 0;
>  
> -	count += sysfs_emit_at(buff, count, "Survivability mode type: %s\n",
> -			       survivability->type ? "Runtime" : "Boot");
> +	return sysfs_emit(buff, "%s\n", survivability->type ? "Runtime" : "Boot");
> +}
>  
> -	if (!check_boot_failure(xe))
> -		return count;
> +static DEVICE_ATTR_ADMIN_RO(survivability_mode);
>  
> -	for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
> -		if (info[index].reg)
> -			count += sysfs_emit_at(buff, count, "%s: 0x%x - 0x%x\n", info[index].name,
> -					       info[index].reg, info[index].value);
> -	}
> +static ssize_t survivability_info_show(struct device *dev,
> +				       struct device_attribute *attr, char *buff)
> +{
> +	struct xe_survivability_attribute *sa = dev_attr_to_survivability_attr(attr);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct xe_device *xe = pdev_to_xe_device(pdev);
> +	struct xe_survivability *survivability = &xe->survivability;
> +	u32 *info = survivability->info;
>  
> -	return count;
> +	return sysfs_emit(buff, "0x%x\n", info[sa->index]);
>  }
>  
> -static DEVICE_ATTR_ADMIN_RO(survivability_mode);
> +#define SURVIVABILITY_ATTR_RO(name, _index)					\
> +	struct xe_survivability_attribute attr_##name =	{			\
> +		.attr =  __ATTR(name, 0400, survivability_info_show, NULL),	\
> +		.index = _index,						\
> +	}
> +
> +SURVIVABILITY_ATTR_RO(capability_info, CAPABILITY_INFO);
> +SURVIVABILITY_ATTR_RO(postcode_trace, POSTCODE_TRACE);
> +SURVIVABILITY_ATTR_RO(postcode_trace_overflow, POSTCODE_TRACE_OVERFLOW);
> +SURVIVABILITY_ATTR_RO(aux_info0, AUX_INFO0);
> +SURVIVABILITY_ATTR_RO(aux_info1, AUX_INFO1);
> +SURVIVABILITY_ATTR_RO(aux_info2, AUX_INFO2);
> +SURVIVABILITY_ATTR_RO(aux_info3, AUX_INFO3);
> +SURVIVABILITY_ATTR_RO(aux_info4, AUX_INFO4);
>  
>  static void xe_survivability_mode_fini(void *arg)
>  {
> @@ -182,17 +212,48 @@ static void xe_survivability_mode_fini(void *arg)
>  	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>  	struct device *dev = &pdev->dev;
>  
> -	sysfs_remove_file(&dev->kobj, &dev_attr_survivability_mode.attr);
> +	device_remove_file(dev, &dev_attr_survivability_mode);
> +}
> +
> +static umode_t survivability_info_attrs_visible(struct kobject *kobj, struct attribute *attr,
> +						int idx)
> +{
> +	struct xe_device *xe = kdev_to_xe_device(kobj_to_dev(kobj));
> +	struct xe_survivability *survivability = &xe->survivability;
> +	u32 *info = survivability->info;
> +
> +	if (info[idx])
> +		return 0400;
> +
> +	return 0;
>  }
>  
> +/* Attributes are ordered according to enum scratch_reg */
> +static struct attribute *survivability_info_attrs[] = {
> +	&attr_capability_info.attr.attr,
> +	&attr_postcode_trace.attr.attr,
> +	&attr_postcode_trace_overflow.attr.attr,
> +	&attr_aux_info0.attr.attr,
> +	&attr_aux_info1.attr.attr,
> +	&attr_aux_info2.attr.attr,
> +	&attr_aux_info3.attr.attr,
> +	&attr_aux_info4.attr.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group survivability_info_group = {
> +	.name = "survivability_info",
> +	.attrs = survivability_info_attrs,
> +	.is_visible = survivability_info_attrs_visible,
> +};
> +
>  static int create_survivability_sysfs(struct pci_dev *pdev)
>  {
>  	struct device *dev = &pdev->dev;
>  	struct xe_device *xe = pdev_to_xe_device(pdev);
>  	int ret;
>  
> -	/* create survivability mode sysfs */
> -	ret = sysfs_create_file(&dev->kobj, &dev_attr_survivability_mode.attr);
> +	ret = device_create_file(dev, &dev_attr_survivability_mode);
>  	if (ret) {
>  		dev_warn(dev, "Failed to create survivability sysfs files\n");
>  		return ret;
> @@ -203,6 +264,12 @@ static int create_survivability_sysfs(struct pci_dev *pdev)
>  	if (ret)
>  		return ret;
>  
> +	if (check_boot_failure(xe)) {
> +		ret = devm_device_add_group(dev, &survivability_info_group);
> +		if (ret)
> +			return ret;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -239,25 +306,6 @@ static int enable_boot_survivability_mode(struct pci_dev *pdev)
>  	return ret;
>  }
>  
> -static int init_survivability_mode(struct xe_device *xe)
> -{
> -	struct xe_survivability *survivability = &xe->survivability;
> -	struct xe_survivability_info *info;
> -
> -	survivability->size = MAX_SCRATCH_MMIO;
> -
> -	info = devm_kcalloc(xe->drm.dev, survivability->size, sizeof(*info),
> -			    GFP_KERNEL);
> -	if (!info)
> -		return -ENOMEM;
> -
> -	survivability->info = info;
> -
> -	populate_survivability_info(xe);
> -
> -	return 0;
> -}
> -
>  /**
>   * xe_survivability_mode_is_boot_enabled- check if boot survivability mode is enabled
>   * @xe: xe device instance
> @@ -325,9 +373,7 @@ int xe_survivability_mode_runtime_enable(struct xe_device *xe)
>  		return -EINVAL;
>  	}
>  
> -	ret = init_survivability_mode(xe);
> -	if (ret)
> -		return ret;
> +	populate_survivability_info(xe);
>  
>  	ret = create_survivability_sysfs(pdev);
>  	if (ret)
> @@ -356,14 +402,11 @@ int xe_survivability_mode_boot_enable(struct xe_device *xe)
>  {
>  	struct xe_survivability *survivability = &xe->survivability;
>  	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> -	int ret;
>  
>  	if (!xe_survivability_mode_is_requested(xe))
>  		return 0;
>  
> -	ret = init_survivability_mode(xe);
> -	if (ret)
> -		return ret;
> +	populate_survivability_info(xe);
>  
>  	/* Log breadcrumbs but do not enter survivability mode for Critical boot errors */
>  	if (survivability->boot_status == CRITICAL_FAILURE) {
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> index cd65a5d167c9..f31b3907d933 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> @@ -9,23 +9,29 @@
>  #include <linux/limits.h>
>  #include <linux/types.h>
>  
> +enum scratch_reg {
> +	CAPABILITY_INFO,
> +	POSTCODE_TRACE,
> +	POSTCODE_TRACE_OVERFLOW,
> +	AUX_INFO0,
> +	AUX_INFO1,
> +	AUX_INFO2,
> +	AUX_INFO3,
> +	AUX_INFO4,
> +	MAX_SCRATCH_REG,
> +};
> +
>  enum xe_survivability_type {
>  	XE_SURVIVABILITY_TYPE_BOOT,
>  	XE_SURVIVABILITY_TYPE_RUNTIME,
>  };
>  
> -struct xe_survivability_info {
> -	char name[NAME_MAX];
> -	u32 reg;
> -	u32 value;
> -};
> -
>  /**
>   * struct xe_survivability: Contains survivability mode information
>   */
>  struct xe_survivability {
> -	/** @info: struct that holds survivability info from scratch registers */
> -	struct xe_survivability_info *info;
> +	/** @info: survivability debug info */
> +	u32 info[MAX_SCRATCH_REG];
>  
>  	/** @size: number of scratch registers */
>  	u32 size;
> -- 
> 2.47.1
> 

  reply	other threads:[~2025-12-01 17:48 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-01 14:34 [PATCH v2 0/2] Redesign survivability mode Riana Tauro
2025-12-01 14:34 ` [PATCH v2 1/2] drm/xe/xe_survivability: " Riana Tauro
2025-12-01 17:48   ` Rodrigo Vivi [this message]
2025-12-02  3:10     ` Riana Tauro
2025-12-01 14:34 ` [PATCH v2 2/2] drm/xe/xe_survivability: Add support for survivability mode v2 Riana Tauro
2025-12-01 17:43   ` Rodrigo Vivi
2025-12-02  3:07     ` Riana Tauro
2025-12-02 20:31       ` Rodrigo Vivi
2025-12-01 17:05 ` ✓ CI.KUnit: success for Redesign survivability mode (rev2) Patchwork
2025-12-01 17:45 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-01 20:12 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aS3U7x5mabFPF23e@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=riana.tauro@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox