Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Riana Tauro <riana.tauro@intel.com>
To: "Summers, Stuart" <stuart.summers@intel.com>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Cc: "Jadav, Raag" <raag.jadav@intel.com>,
	"Anirban, Sk" <sk.anirban@intel.com>,
	"Vivi, Rodrigo" <rodrigo.vivi@intel.com>,
	"Scarbrough, Frank" <frank.scarbrough@intel.com>,
	"aravind.iddamsetty@linux.intel.com"
	<aravind.iddamsetty@linux.intel.com>,
	"Gupta, Anshuman" <anshuman.gupta@intel.com>,
	"Nerlige Ramappa, Umesh" <umesh.nerlige.ramappa@intel.com>,
	"De Marchi, Lucas" <lucas.demarchi@intel.com>
Subject: Re: [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode
Date: Thu, 10 Jul 2025 10:57:54 +0530	[thread overview]
Message-ID: <de50743e-3d5c-4765-aa1f-4a61329c85a4@intel.com> (raw)
In-Reply-To: <0d56280d48bc707917bd1e11e3d93683a9de98f1.camel@intel.com>

Hi Stuart

On 7/9/2025 11:34 PM, Summers, Stuart wrote:
> On Wed, 2025-07-02 at 19:41 +0530, Riana Tauro wrote:
>> Certain runtime firmware errors can cause the device to be wedged
>> requiring a firmware flash to restore normal operation.
>> Runtime Survivability Mode indicates that a firmware flash is
>> necessary to
>> recover the device.
> 
> I'm not understanding why we need to overload survivability mode here
> in the case of a CSC (or other hardware error) failure. I see there is
> some vesc initialization that happens there and GSC initialization
> (need to look further, but presumably this puts GSC in a survivability
> state also?). But we already have the vendor specific wedge. Do we
> really need the extra hook to survivability mode which was really built
> as a boot time config.

vendor-specific without a reason is vague and could be reused for a 
different action in the future. There needs to be a indication that this 
wedged uevent indicates firmware flash. So the survivability mode sysfs

This patch will further be extended to handle d3cold resume pcode 
failures which will send a similar wedged event and survivability mode

Thanks
Riana>
> Thanks,
> Stuart
> 
>>
>> The below sysfs is an indication that device is in survivability mode
>>
>> /sys/bus/pci/devices/<device>/surivability_mode
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_device.c                |  2 +-
>>   drivers/gpu/drm/xe/xe_survivability_mode.c    | 26 ++++++++++++++++-
>> --
>>   drivers/gpu/drm/xe/xe_survivability_mode.h    |  4 ++-
>>   .../gpu/drm/xe/xe_survivability_mode_types.h  |  8 ++++++
>>   4 files changed, 35 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>> b/drivers/gpu/drm/xe/xe_device.c
>> index 4a38486dccc8..5defa54ccd26 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -716,7 +716,7 @@ int xe_device_probe_early(struct xe_device *xe)
>>                   * possible, but still return the previous error for
>> error
>>                   * propagation
>>                   */
>> -               err = xe_survivability_mode_enable(xe);
>> +               err = xe_survivability_mode_enable(xe,
>> XE_SURVIVABILITY_TYPE_BOOT);
>>                  if (err)
>>                          return err;
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c
>> b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> index 1f710b3fc599..e1adcb33c9b0 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> @@ -129,7 +129,10 @@ static ssize_t survivability_mode_show(struct
>> device *dev,
>>          struct xe_survivability_info *info = survivability->info;
>>          int index = 0, count = 0;
>>   
>> -       for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
>> +       count += sysfs_emit_at(buff, count, "Survivability mode:
>> %s\n",
>> +                              survivability->type ? "Runtime" :
>> "Boot");
>> +
>> +       for (index = 0; survivability->boot_status && index <
>> MAX_SCRATCH_MMIO; index++) {
>>                  if (info[index].reg)
>>                          count += sysfs_emit_at(buff, count, "%s: 0x%x
>> - 0x%x\n", info[index].name,
>>                                                 info[index].reg,
>> info[index].value);
>> @@ -169,6 +172,10 @@ static int enable_survivability_mode(struct
>> pci_dev *pdev)
>>          if (ret)
>>                  return ret;
>>   
>> +       /* Only create sysfs for runtime survivability mode */
>> +       if (xe_survivability_mode_is_runtime(xe))
>> +               return 0;
>> +
>>          /* Make sure xe_heci_gsc_init() knows about survivability
>> mode */
>>          survivability->mode = true;
>>   
>> @@ -189,6 +196,17 @@ static int enable_survivability_mode(struct
>> pci_dev *pdev)
>>          return 0;
>>   }
>>   
>> +/**
>> + * xe_survivability_mode_is_runtime - check if survivability mode is
>> runtime
>> + * @xe: xe device instance
>> + *
>> + * Returns true if in runtime survivability mode, false otherwise
>> + */
>> +bool xe_survivability_mode_is_runtime(struct xe_device *xe)
>> +{
>> +       return xe->survivability.type ==
>> XE_SURVIVABILITY_TYPE_RUNTIME;
>> +}
>> +
>>   /**
>>    * xe_survivability_mode_is_enabled - check if survivability mode is
>> enabled
>>    * @xe: xe device instance
>> @@ -251,16 +269,18 @@ bool xe_survivability_mode_is_requested(struct
>> xe_device *xe)
>>    * Return: 0 if survivability mode is enabled or not requested;
>> negative error
>>    * code otherwise.
>>    */
>> -int xe_survivability_mode_enable(struct xe_device *xe)
>> +int xe_survivability_mode_enable(struct xe_device *xe, const enum
>> xe_survivability_type type)
>>   {
>>          struct xe_survivability *survivability = &xe->survivability;
>>          struct xe_survivability_info *info;
>>          struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>>   
>> -       if (!xe_survivability_mode_is_requested(xe))
>> +       if (!xe_survivability_mode_is_requested(xe) &&
>> +           type != XE_SURVIVABILITY_TYPE_RUNTIME)
>>                  return 0;
>>   
>>          survivability->size = MAX_SCRATCH_MMIO;
>> +       survivability->type = type;
>>   
>>          info = devm_kcalloc(xe->drm.dev, survivability->size,
>> sizeof(*info),
>>                              GFP_KERNEL);
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h
>> b/drivers/gpu/drm/xe/xe_survivability_mode.h
>> index 02231c2bf008..559d1e99b03a 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
>> @@ -9,9 +9,11 @@
>>   #include <linux/types.h>
>>   
>>   struct xe_device;
>> +enum xe_survivability_type;
>>   
>> -int xe_survivability_mode_enable(struct xe_device *xe);
>> +int xe_survivability_mode_enable(struct xe_device *xe, const enum
>> xe_survivability_type);
>>   bool xe_survivability_mode_is_enabled(struct xe_device *xe);
>> +bool xe_survivability_mode_is_runtime(struct xe_device *xe);
>>   bool xe_survivability_mode_is_requested(struct xe_device *xe);
>>   
>>   #endif /* _XE_SURVIVABILITY_MODE_H_ */
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> index 19d433e253df..01f07d9c4124 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> @@ -9,6 +9,11 @@
>>   #include <linux/limits.h>
>>   #include <linux/types.h>
>>   
>> +enum xe_survivability_type {
>> +       XE_SURVIVABILITY_TYPE_BOOT,
>> +       XE_SURVIVABILITY_TYPE_RUNTIME,
>> +};
>> +
>>   struct xe_survivability_info {
>>          char name[NAME_MAX];
>>          u32 reg;
>> @@ -30,6 +35,9 @@ struct xe_survivability {
>>   
>>          /** @mode: boolean to indicate survivability mode */
>>          bool mode;
>> +
>> +       /** @type: survivability mode type (boot or runtime) */
>> +       enum xe_survivability_type type;
>>   };
>>   
>>   #endif /* _XE_SURVIVABILITY_MODE_TYPES_H_ */
> 




  reply	other threads:[~2025-07-10  5:28 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
2025-07-02 14:11 ` [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
2025-07-03  4:06   ` Raag Jadav
2025-07-03  5:20     ` Riana Tauro
2025-07-03  6:40       ` Raag Jadav
2025-07-03  6:50         ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 2/7] drm/xe: Set GT as wedged before sending " Riana Tauro
2025-07-02 21:41   ` Rodrigo Vivi
2025-07-03  4:18   ` Raag Jadav
2025-07-03  5:18     ` Riana Tauro
2025-07-03  6:45       ` Raag Jadav
2025-07-07  6:44         ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode Riana Tauro
2025-07-02 21:40   ` Rodrigo Vivi
2025-07-03  5:16     ` Riana Tauro
2025-07-02 23:33   ` kernel test robot
2025-07-09 18:04   ` Summers, Stuart
2025-07-10  5:27     ` Riana Tauro [this message]
2025-07-15 17:30       ` Summers, Stuart
2025-07-02 14:11 ` [PATCH v3 4/7] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
2025-07-02 13:55   ` Riana Tauro
2025-07-03  7:19   ` Raag Jadav
2025-07-02 14:11 ` [PATCH v3 5/7] drm/xe: Add support to handle hardware errors Riana Tauro
2025-07-09 17:27   ` Summers, Stuart
2025-07-10  5:54     ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-07-02 21:35   ` Rodrigo Vivi
2025-07-03  5:28     ` Riana Tauro
2025-07-09 17:57   ` Summers, Stuart
2025-07-10  5:38     ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 7/7] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
2025-07-02 15:53 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev3) Patchwork
2025-07-02 15:54 ` ✓ CI.KUnit: success " Patchwork
2025-07-02 16:17 ` ✗ CI.checksparse: warning " Patchwork
2025-07-02 16:39 ` ✓ Xe.CI.BAT: success " Patchwork
2025-07-04  6:45 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=de50743e-3d5c-4765-aa1f-4a61329c85a4@intel.com \
    --to=riana.tauro@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=frank.scarbrough@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=sk.anirban@intel.com \
    --cc=stuart.summers@intel.com \
    --cc=umesh.nerlige.ramappa@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox