Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Summers, Stuart" <stuart.summers@intel.com>
To: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"Tauro,  Riana" <riana.tauro@intel.com>
Cc: "Anirban, Sk" <sk.anirban@intel.com>,
	"Jadav, Raag" <raag.jadav@intel.com>,
	"Vivi, Rodrigo" <rodrigo.vivi@intel.com>,
	"Scarbrough, Frank" <frank.scarbrough@intel.com>,
	"aravind.iddamsetty@linux.intel.com"
	<aravind.iddamsetty@linux.intel.com>,
	"Gupta, Anshuman" <anshuman.gupta@intel.com>,
	"De Marchi, Lucas" <lucas.demarchi@intel.com>,
	"Nerlige Ramappa, Umesh" <umesh.nerlige.ramappa@intel.com>
Subject: Re: [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode
Date: Tue, 15 Jul 2025 17:30:46 +0000	[thread overview]
Message-ID: <67abc9c91458903bc6f5014ac70f5208c0dfd1b2.camel@intel.com> (raw)
In-Reply-To: <de50743e-3d5c-4765-aa1f-4a61329c85a4@intel.com>

On Thu, 2025-07-10 at 10:57 +0530, Riana Tauro wrote:
> Hi Stuart
> 
> On 7/9/2025 11:34 PM, Summers, Stuart wrote:
> > On Wed, 2025-07-02 at 19:41 +0530, Riana Tauro wrote:
> > > Certain runtime firmware errors can cause the device to be wedged
> > > requiring a firmware flash to restore normal operation.
> > > Runtime Survivability Mode indicates that a firmware flash is
> > > necessary to
> > > recover the device.
> > 
> > I'm not understanding why we need to overload survivability mode
> > here
> > in the case of a CSC (or other hardware error) failure. I see there
> > is
> > some vesc initialization that happens there and GSC initialization
> > (need to look further, but presumably this puts GSC in a
> > survivability
> > state also?). But we already have the vendor specific wedge. Do we
> > really need the extra hook to survivability mode which was really
> > built
> > as a boot time config.
> 
> vendor-specific without a reason is vague and could be reused for a 
> different action in the future. There needs to be a indication that
> this 
> wedged uevent indicates firmware flash. So the survivability mode
> sysfs
> 
> This patch will further be extended to handle d3cold resume pcode 
> failures which will send a similar wedged event and survivability
> mode

I know you have the new series up, but just coming back to confirm...
ack from me here. And just for my understanding, basically the idea is
we send the uevent, then set the "state" of the driver/hardware by
triggering survivability runtime mode. The user can then use this sysfs
state to determine that a recovery of some kind is needed - in this
case firmware flash and... re-enumeration? soft reset? or just a driver
reload?

Thanks,
Stuart

> 
> Thanks
> Riana>
> > Thanks,
> > Stuart
> > 
> > > 
> > > The below sysfs is an indication that device is in survivability
> > > mode
> > > 
> > > /sys/bus/pci/devices/<device>/surivability_mode
> > > 
> > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > ---
> > >   drivers/gpu/drm/xe/xe_device.c                |  2 +-
> > >   drivers/gpu/drm/xe/xe_survivability_mode.c    | 26
> > > ++++++++++++++++-
> > > --
> > >   drivers/gpu/drm/xe/xe_survivability_mode.h    |  4 ++-
> > >   .../gpu/drm/xe/xe_survivability_mode_types.h  |  8 ++++++
> > >   4 files changed, 35 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > b/drivers/gpu/drm/xe/xe_device.c
> > > index 4a38486dccc8..5defa54ccd26 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > @@ -716,7 +716,7 @@ int xe_device_probe_early(struct xe_device
> > > *xe)
> > >                   * possible, but still return the previous error
> > > for
> > > error
> > >                   * propagation
> > >                   */
> > > -               err = xe_survivability_mode_enable(xe);
> > > +               err = xe_survivability_mode_enable(xe,
> > > XE_SURVIVABILITY_TYPE_BOOT);
> > >                  if (err)
> > >                          return err;
> > >   
> > > diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c
> > > b/drivers/gpu/drm/xe/xe_survivability_mode.c
> > > index 1f710b3fc599..e1adcb33c9b0 100644
> > > --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
> > > +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
> > > @@ -129,7 +129,10 @@ static ssize_t
> > > survivability_mode_show(struct
> > > device *dev,
> > >          struct xe_survivability_info *info = survivability-
> > > >info;
> > >          int index = 0, count = 0;
> > >   
> > > -       for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
> > > +       count += sysfs_emit_at(buff, count, "Survivability mode:
> > > %s\n",
> > > +                              survivability->type ? "Runtime" :
> > > "Boot");
> > > +
> > > +       for (index = 0; survivability->boot_status && index <
> > > MAX_SCRATCH_MMIO; index++) {
> > >                  if (info[index].reg)
> > >                          count += sysfs_emit_at(buff, count, "%s:
> > > 0x%x
> > > - 0x%x\n", info[index].name,
> > >                                                 info[index].reg,
> > > info[index].value);
> > > @@ -169,6 +172,10 @@ static int enable_survivability_mode(struct
> > > pci_dev *pdev)
> > >          if (ret)
> > >                  return ret;
> > >   
> > > +       /* Only create sysfs for runtime survivability mode */
> > > +       if (xe_survivability_mode_is_runtime(xe))
> > > +               return 0;
> > > +
> > >          /* Make sure xe_heci_gsc_init() knows about
> > > survivability
> > > mode */
> > >          survivability->mode = true;
> > >   
> > > @@ -189,6 +196,17 @@ static int enable_survivability_mode(struct
> > > pci_dev *pdev)
> > >          return 0;
> > >   }
> > >   
> > > +/**
> > > + * xe_survivability_mode_is_runtime - check if survivability
> > > mode is
> > > runtime
> > > + * @xe: xe device instance
> > > + *
> > > + * Returns true if in runtime survivability mode, false
> > > otherwise
> > > + */
> > > +bool xe_survivability_mode_is_runtime(struct xe_device *xe)
> > > +{
> > > +       return xe->survivability.type ==
> > > XE_SURVIVABILITY_TYPE_RUNTIME;
> > > +}
> > > +
> > >   /**
> > >    * xe_survivability_mode_is_enabled - check if survivability
> > > mode is
> > > enabled
> > >    * @xe: xe device instance
> > > @@ -251,16 +269,18 @@ bool
> > > xe_survivability_mode_is_requested(struct
> > > xe_device *xe)
> > >    * Return: 0 if survivability mode is enabled or not requested;
> > > negative error
> > >    * code otherwise.
> > >    */
> > > -int xe_survivability_mode_enable(struct xe_device *xe)
> > > +int xe_survivability_mode_enable(struct xe_device *xe, const
> > > enum
> > > xe_survivability_type type)
> > >   {
> > >          struct xe_survivability *survivability = &xe-
> > > >survivability;
> > >          struct xe_survivability_info *info;
> > >          struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> > >   
> > > -       if (!xe_survivability_mode_is_requested(xe))
> > > +       if (!xe_survivability_mode_is_requested(xe) &&
> > > +           type != XE_SURVIVABILITY_TYPE_RUNTIME)
> > >                  return 0;
> > >   
> > >          survivability->size = MAX_SCRATCH_MMIO;
> > > +       survivability->type = type;
> > >   
> > >          info = devm_kcalloc(xe->drm.dev, survivability->size,
> > > sizeof(*info),
> > >                              GFP_KERNEL);
> > > diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h
> > > b/drivers/gpu/drm/xe/xe_survivability_mode.h
> > > index 02231c2bf008..559d1e99b03a 100644
> > > --- a/drivers/gpu/drm/xe/xe_survivability_mode.h
> > > +++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
> > > @@ -9,9 +9,11 @@
> > >   #include <linux/types.h>
> > >   
> > >   struct xe_device;
> > > +enum xe_survivability_type;
> > >   
> > > -int xe_survivability_mode_enable(struct xe_device *xe);
> > > +int xe_survivability_mode_enable(struct xe_device *xe, const
> > > enum
> > > xe_survivability_type);
> > >   bool xe_survivability_mode_is_enabled(struct xe_device *xe);
> > > +bool xe_survivability_mode_is_runtime(struct xe_device *xe);
> > >   bool xe_survivability_mode_is_requested(struct xe_device *xe);
> > >   
> > >   #endif /* _XE_SURVIVABILITY_MODE_H_ */
> > > diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> > > b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> > > index 19d433e253df..01f07d9c4124 100644
> > > --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> > > @@ -9,6 +9,11 @@
> > >   #include <linux/limits.h>
> > >   #include <linux/types.h>
> > >   
> > > +enum xe_survivability_type {
> > > +       XE_SURVIVABILITY_TYPE_BOOT,
> > > +       XE_SURVIVABILITY_TYPE_RUNTIME,
> > > +};
> > > +
> > >   struct xe_survivability_info {
> > >          char name[NAME_MAX];
> > >          u32 reg;
> > > @@ -30,6 +35,9 @@ struct xe_survivability {
> > >   
> > >          /** @mode: boolean to indicate survivability mode */
> > >          bool mode;
> > > +
> > > +       /** @type: survivability mode type (boot or runtime) */
> > > +       enum xe_survivability_type type;
> > >   };
> > >   
> > >   #endif /* _XE_SURVIVABILITY_MODE_TYPES_H_ */
> > 
> 
> 
> 


  reply	other threads:[~2025-07-15 17:30 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
2025-07-02 14:11 ` [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
2025-07-03  4:06   ` Raag Jadav
2025-07-03  5:20     ` Riana Tauro
2025-07-03  6:40       ` Raag Jadav
2025-07-03  6:50         ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 2/7] drm/xe: Set GT as wedged before sending " Riana Tauro
2025-07-02 21:41   ` Rodrigo Vivi
2025-07-03  4:18   ` Raag Jadav
2025-07-03  5:18     ` Riana Tauro
2025-07-03  6:45       ` Raag Jadav
2025-07-07  6:44         ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode Riana Tauro
2025-07-02 21:40   ` Rodrigo Vivi
2025-07-03  5:16     ` Riana Tauro
2025-07-02 23:33   ` kernel test robot
2025-07-09 18:04   ` Summers, Stuart
2025-07-10  5:27     ` Riana Tauro
2025-07-15 17:30       ` Summers, Stuart [this message]
2025-07-02 14:11 ` [PATCH v3 4/7] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
2025-07-02 13:55   ` Riana Tauro
2025-07-03  7:19   ` Raag Jadav
2025-07-02 14:11 ` [PATCH v3 5/7] drm/xe: Add support to handle hardware errors Riana Tauro
2025-07-09 17:27   ` Summers, Stuart
2025-07-10  5:54     ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-07-02 21:35   ` Rodrigo Vivi
2025-07-03  5:28     ` Riana Tauro
2025-07-09 17:57   ` Summers, Stuart
2025-07-10  5:38     ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 7/7] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
2025-07-02 15:53 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev3) Patchwork
2025-07-02 15:54 ` ✓ CI.KUnit: success " Patchwork
2025-07-02 16:17 ` ✗ CI.checksparse: warning " Patchwork
2025-07-02 16:39 ` ✓ Xe.CI.BAT: success " Patchwork
2025-07-04  6:45 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=67abc9c91458903bc6f5014ac70f5208c0dfd1b2.camel@intel.com \
    --to=stuart.summers@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=frank.scarbrough@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=sk.anirban@intel.com \
    --cc=umesh.nerlige.ramappa@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox