From: "Summers, Stuart" <stuart.summers@intel.com>
To: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
"Tauro, Riana" <riana.tauro@intel.com>
Cc: "Jadav, Raag" <raag.jadav@intel.com>,
"Anirban, Sk" <sk.anirban@intel.com>,
"Vivi, Rodrigo" <rodrigo.vivi@intel.com>,
"Scarbrough, Frank" <frank.scarbrough@intel.com>,
"aravind.iddamsetty@linux.intel.com"
<aravind.iddamsetty@linux.intel.com>,
"Gupta, Anshuman" <anshuman.gupta@intel.com>,
"Nerlige Ramappa, Umesh" <umesh.nerlige.ramappa@intel.com>,
"De Marchi, Lucas" <lucas.demarchi@intel.com>
Subject: Re: [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode
Date: Wed, 9 Jul 2025 18:04:36 +0000 [thread overview]
Message-ID: <0d56280d48bc707917bd1e11e3d93683a9de98f1.camel@intel.com> (raw)
In-Reply-To: <20250702141118.3564242-4-riana.tauro@intel.com>
On Wed, 2025-07-02 at 19:41 +0530, Riana Tauro wrote:
> Certain runtime firmware errors can cause the device to be wedged
> requiring a firmware flash to restore normal operation.
> Runtime Survivability Mode indicates that a firmware flash is
> necessary to
> recover the device.
I'm not understanding why we need to overload survivability mode here
in the case of a CSC (or other hardware error) failure. I see there is
some vesc initialization that happens there and GSC initialization
(need to look further, but presumably this puts GSC in a survivability
state also?). But we already have the vendor specific wedge. Do we
really need the extra hook to survivability mode which was really built
as a boot time config.
Thanks,
Stuart
>
> The below sysfs is an indication that device is in survivability mode
>
> /sys/bus/pci/devices/<device>/surivability_mode
>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> drivers/gpu/drm/xe/xe_device.c | 2 +-
> drivers/gpu/drm/xe/xe_survivability_mode.c | 26 ++++++++++++++++-
> --
> drivers/gpu/drm/xe/xe_survivability_mode.h | 4 ++-
> .../gpu/drm/xe/xe_survivability_mode_types.h | 8 ++++++
> 4 files changed, 35 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_device.c
> b/drivers/gpu/drm/xe/xe_device.c
> index 4a38486dccc8..5defa54ccd26 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -716,7 +716,7 @@ int xe_device_probe_early(struct xe_device *xe)
> * possible, but still return the previous error for
> error
> * propagation
> */
> - err = xe_survivability_mode_enable(xe);
> + err = xe_survivability_mode_enable(xe,
> XE_SURVIVABILITY_TYPE_BOOT);
> if (err)
> return err;
>
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c
> b/drivers/gpu/drm/xe/xe_survivability_mode.c
> index 1f710b3fc599..e1adcb33c9b0 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
> @@ -129,7 +129,10 @@ static ssize_t survivability_mode_show(struct
> device *dev,
> struct xe_survivability_info *info = survivability->info;
> int index = 0, count = 0;
>
> - for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
> + count += sysfs_emit_at(buff, count, "Survivability mode:
> %s\n",
> + survivability->type ? "Runtime" :
> "Boot");
> +
> + for (index = 0; survivability->boot_status && index <
> MAX_SCRATCH_MMIO; index++) {
> if (info[index].reg)
> count += sysfs_emit_at(buff, count, "%s: 0x%x
> - 0x%x\n", info[index].name,
> info[index].reg,
> info[index].value);
> @@ -169,6 +172,10 @@ static int enable_survivability_mode(struct
> pci_dev *pdev)
> if (ret)
> return ret;
>
> + /* Only create sysfs for runtime survivability mode */
> + if (xe_survivability_mode_is_runtime(xe))
> + return 0;
> +
> /* Make sure xe_heci_gsc_init() knows about survivability
> mode */
> survivability->mode = true;
>
> @@ -189,6 +196,17 @@ static int enable_survivability_mode(struct
> pci_dev *pdev)
> return 0;
> }
>
> +/**
> + * xe_survivability_mode_is_runtime - check if survivability mode is
> runtime
> + * @xe: xe device instance
> + *
> + * Returns true if in runtime survivability mode, false otherwise
> + */
> +bool xe_survivability_mode_is_runtime(struct xe_device *xe)
> +{
> + return xe->survivability.type ==
> XE_SURVIVABILITY_TYPE_RUNTIME;
> +}
> +
> /**
> * xe_survivability_mode_is_enabled - check if survivability mode is
> enabled
> * @xe: xe device instance
> @@ -251,16 +269,18 @@ bool xe_survivability_mode_is_requested(struct
> xe_device *xe)
> * Return: 0 if survivability mode is enabled or not requested;
> negative error
> * code otherwise.
> */
> -int xe_survivability_mode_enable(struct xe_device *xe)
> +int xe_survivability_mode_enable(struct xe_device *xe, const enum
> xe_survivability_type type)
> {
> struct xe_survivability *survivability = &xe->survivability;
> struct xe_survivability_info *info;
> struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>
> - if (!xe_survivability_mode_is_requested(xe))
> + if (!xe_survivability_mode_is_requested(xe) &&
> + type != XE_SURVIVABILITY_TYPE_RUNTIME)
> return 0;
>
> survivability->size = MAX_SCRATCH_MMIO;
> + survivability->type = type;
>
> info = devm_kcalloc(xe->drm.dev, survivability->size,
> sizeof(*info),
> GFP_KERNEL);
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h
> b/drivers/gpu/drm/xe/xe_survivability_mode.h
> index 02231c2bf008..559d1e99b03a 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode.h
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
> @@ -9,9 +9,11 @@
> #include <linux/types.h>
>
> struct xe_device;
> +enum xe_survivability_type;
>
> -int xe_survivability_mode_enable(struct xe_device *xe);
> +int xe_survivability_mode_enable(struct xe_device *xe, const enum
> xe_survivability_type);
> bool xe_survivability_mode_is_enabled(struct xe_device *xe);
> +bool xe_survivability_mode_is_runtime(struct xe_device *xe);
> bool xe_survivability_mode_is_requested(struct xe_device *xe);
>
> #endif /* _XE_SURVIVABILITY_MODE_H_ */
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> index 19d433e253df..01f07d9c4124 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> @@ -9,6 +9,11 @@
> #include <linux/limits.h>
> #include <linux/types.h>
>
> +enum xe_survivability_type {
> + XE_SURVIVABILITY_TYPE_BOOT,
> + XE_SURVIVABILITY_TYPE_RUNTIME,
> +};
> +
> struct xe_survivability_info {
> char name[NAME_MAX];
> u32 reg;
> @@ -30,6 +35,9 @@ struct xe_survivability {
>
> /** @mode: boolean to indicate survivability mode */
> bool mode;
> +
> + /** @type: survivability mode type (boot or runtime) */
> + enum xe_survivability_type type;
> };
>
> #endif /* _XE_SURVIVABILITY_MODE_TYPES_H_ */
next prev parent reply other threads:[~2025-07-09 18:04 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
2025-07-02 14:11 ` [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
2025-07-03 4:06 ` Raag Jadav
2025-07-03 5:20 ` Riana Tauro
2025-07-03 6:40 ` Raag Jadav
2025-07-03 6:50 ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 2/7] drm/xe: Set GT as wedged before sending " Riana Tauro
2025-07-02 21:41 ` Rodrigo Vivi
2025-07-03 4:18 ` Raag Jadav
2025-07-03 5:18 ` Riana Tauro
2025-07-03 6:45 ` Raag Jadav
2025-07-07 6:44 ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode Riana Tauro
2025-07-02 21:40 ` Rodrigo Vivi
2025-07-03 5:16 ` Riana Tauro
2025-07-02 23:33 ` kernel test robot
2025-07-09 18:04 ` Summers, Stuart [this message]
2025-07-10 5:27 ` Riana Tauro
2025-07-15 17:30 ` Summers, Stuart
2025-07-02 14:11 ` [PATCH v3 4/7] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
2025-07-02 13:55 ` Riana Tauro
2025-07-03 7:19 ` Raag Jadav
2025-07-02 14:11 ` [PATCH v3 5/7] drm/xe: Add support to handle hardware errors Riana Tauro
2025-07-09 17:27 ` Summers, Stuart
2025-07-10 5:54 ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-07-02 21:35 ` Rodrigo Vivi
2025-07-03 5:28 ` Riana Tauro
2025-07-09 17:57 ` Summers, Stuart
2025-07-10 5:38 ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 7/7] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
2025-07-02 15:53 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev3) Patchwork
2025-07-02 15:54 ` ✓ CI.KUnit: success " Patchwork
2025-07-02 16:17 ` ✗ CI.checksparse: warning " Patchwork
2025-07-02 16:39 ` ✓ Xe.CI.BAT: success " Patchwork
2025-07-04 6:45 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0d56280d48bc707917bd1e11e3d93683a9de98f1.camel@intel.com \
--to=stuart.summers@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=frank.scarbrough@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
--cc=raag.jadav@intel.com \
--cc=riana.tauro@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=sk.anirban@intel.com \
--cc=umesh.nerlige.ramappa@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox