Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Riana Tauro <riana.tauro@intel.com>
To: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <anshuman.gupta@intel.com>,
	<rodrigo.vivi@intel.com>, <lucas.demarchi@intel.com>,
	<aravind.iddamsetty@linux.intel.com>, <raag.jadav@intel.com>,
	<frank.scarbrough@intel.com>, <sk.anirban@intel.com>
Subject: Re: [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime survivability mode
Date: Thu, 10 Jul 2025 11:29:44 +0530	[thread overview]
Message-ID: <e2880cff-fcc8-4c9c-9ff2-e446327fbcb8@intel.com> (raw)
In-Reply-To: <aG7-7Lqw79Y8azK1@unerlige-desk.amr.corp.intel.com>

Hi Umesh

On 7/10/2025 5:14 AM, Umesh Nerlige Ramappa wrote:
> On Wed, Jul 09, 2025 at 04:50:17PM +0530, Riana Tauro wrote:
>> Certain runtime firmware errors can cause the device to be in a unusable
>> state requiring a firmware flash to restore normal operation.
>> Runtime Survivability Mode indicates firmware flash is necessary by
>> wedging the device and exposing survivability mode sysfs.
>>
>> The below sysfs is an indication that device is in survivability mode
>>
>> /sys/bus/pci/devices/<device>/survivability_mode
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_survivability_mode.c    | 42 ++++++++++++++++++-
>> drivers/gpu/drm/xe/xe_survivability_mode.h    |  1 +
>> .../gpu/drm/xe/xe_survivability_mode_types.h  |  1 +
>> 3 files changed, 43 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/ 
>> drm/xe/xe_survivability_mode.c
>> index fefb027b1c84..ca1cfa13525a 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> @@ -137,7 +137,8 @@ static ssize_t survivability_mode_show(struct 
>> device *dev,
>>     struct xe_survivability_info *info = survivability->info;
>>     int index = 0, count = 0;
>>
>> -    count += sysfs_emit_at(buff, count, "Survivability mode type: 
>> Boot\n");
>> +    count += sysfs_emit_at(buff, count, "Survivability mode type: %s\n",
>> +                   survivability->type ? "Runtime" : "Boot");
>>
>>     if (!check_boot_failure(xe))
>>         return count;
>> @@ -288,6 +289,45 @@ bool xe_survivability_mode_is_requested(struct 
>> xe_device *xe)
>>     return check_boot_failure(xe);
>> }
>>
>> +/**
>> + * xe_survivability_mode_runtime_enable - Initialize and enable 
>> runtime survivability mode
>> + * @xe: xe device instance
>> + *
>> + * Initialize survivability information and enable runtime 
>> survivability mode.
>> + * Runtime survivability mode is enabled when certain errors cause 
>> the device to be
>> + * in non-recoverable state. The device is declared wedged with the 
>> appropriate
>> + * recovery method and survivability mode sysfs exposed to userspace
>> + *
>> + * Return: 0 if runtime survivability mode is enabled or not 
>> requested, negative error
> 
> is the "not requested" still applicable here?

Copied it from boot survivability. Not applicable, will remove this

> 
> 
>> + * code otherwise.
>> + */
>> +int xe_survivability_mode_runtime_enable(struct xe_device *xe)
>> +{
>> +    struct xe_survivability *survivability = &xe->survivability;
>> +    struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>> +    int ret;
>> +
>> +    if (!IS_DGFX(xe) || IS_SRIOV_VF(xe) || xe->info.platform < 
>> XE_BATTLEMAGE) {
> 
> Do you think this condition can be better handled with a 
> has_runtime_survivability for platforms that support it?

Was used once so added it here. Can be split out to a different function

> 
>> +        dev_err(&pdev->dev, "Runtime Survivability Mode not 
>> supported\n");
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = init_survivability_mode(xe);
>> +    if (ret)
>> +        return ret;
>> +
>> +    ret = create_survivability_sysfs(pdev);
>> +    if (ret)
>> +        dev_err(&pdev->dev, "Failed to create survivability mode 
>> sysfs\n");
> 
> You do not return ret in the above if condition. Is that intenational?

yeah this is intentional. The device has to be wedged since it is not 
usable on such errors even without the sysfs.

Thanks
Riana

> 
> Regards,
> Umesh
> 
>> +
>> +    survivability->type = XE_SURVIVABILITY_TYPE_RUNTIME;
>> +    xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_VENDOR);
>> +    xe_device_declare_wedged(xe);
>> +
>> +    dev_err(&pdev->dev, "Runtime Survivability mode enabled\n");
>> +    return 0;
>> +}
>> +
>> /**
>>  * xe_survivability_mode_boot_enable - Initialize and enable boot 
>> survivability mode
>>  * @xe: xe device instance
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h b/drivers/gpu/ 
>> drm/xe/xe_survivability_mode.h
>> index f6ee283ea5e8..1cc94226aa82 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
>> @@ -11,6 +11,7 @@
>> struct xe_device;
>>
>> int xe_survivability_mode_boot_enable(struct xe_device *xe);
>> +int xe_survivability_mode_runtime_enable(struct xe_device *xe);
>> bool xe_survivability_mode_is_boot_enabled(struct xe_device *xe);
>> bool xe_survivability_mode_is_requested(struct xe_device *xe);
>>
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/ 
>> drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> index 5dce393498da..cd65a5d167c9 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> @@ -11,6 +11,7 @@
>>
>> enum xe_survivability_type {
>>     XE_SURVIVABILITY_TYPE_BOOT,
>> +    XE_SURVIVABILITY_TYPE_RUNTIME,
>> };
>>
>> struct xe_survivability_info {
>> -- 
>> 2.47.1
>>



  reply	other threads:[~2025-07-10  6:00 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
2025-07-09 11:20 ` [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
2025-07-09 13:41   ` Simona Vetter
2025-07-09 14:09     ` Christian König
2025-07-09 14:18       ` Raag Jadav
2025-07-09 16:52         ` Rodrigo Vivi
2025-07-10  9:01           ` Simona Vetter
2025-07-10  9:37             ` Christian König
2025-07-10 10:24               ` Raag Jadav
2025-07-10 19:00                 ` Rodrigo Vivi
2025-07-10 21:46                   ` Raag Jadav
2025-07-11  5:17                     ` Riana Tauro
2025-07-11  6:08                       ` Raag Jadav
2025-07-11  8:56                   ` Simona Vetter
2025-07-11  8:59               ` Simona Vetter
2025-07-14  5:27                 ` Riana Tauro
2025-07-14 12:33                   ` Simona Vetter
2025-07-09 14:46     ` Riana Tauro
2025-07-09 11:20 ` [PATCH v4 2/9] drm/xe: Set GT as wedged before sending " Riana Tauro
2025-07-09 17:26   ` Matthew Brost
2025-07-09 11:20 ` [PATCH v4 3/9] drm/xe: Add a helper function to set recovery method Riana Tauro
2025-07-09 11:20 ` [PATCH v4 4/9] drm/xe/xe_survivability: Refactor survivability mode Riana Tauro
2025-07-09 11:20 ` [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime " Riana Tauro
2025-07-09 23:44   ` Umesh Nerlige Ramappa
2025-07-10  5:59     ` Riana Tauro [this message]
2025-07-10 17:12       ` Umesh Nerlige Ramappa
2025-07-11  5:23         ` Riana Tauro
2025-07-09 11:20 ` [PATCH v4 6/9] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
2025-07-11  5:39   ` Raag Jadav
2025-07-11  6:09     ` Riana Tauro
2025-07-12  5:45       ` Raag Jadav
2025-07-14  9:04         ` Riana Tauro
2025-07-09 11:20 ` [PATCH v4 7/9] drm/xe: Add support to handle hardware errors Riana Tauro
2025-07-10 21:09   ` Umesh Nerlige Ramappa
2025-07-11  5:35     ` Riana Tauro
2025-07-11 17:34       ` Umesh Nerlige Ramappa
2025-07-09 11:20 ` [PATCH v4 8/9] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-07-11  0:36   ` Umesh Nerlige Ramappa
2025-07-11  5:46     ` Riana Tauro
2025-07-11 17:38       ` Umesh Nerlige Ramappa
2025-07-09 11:20 ` [PATCH v4 9/9] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
2025-07-11 17:41   ` Umesh Nerlige Ramappa
2025-07-14  7:07     ` Riana Tauro
2025-07-09 12:28 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev4) Patchwork
2025-07-09 12:30 ` ✓ CI.KUnit: success " Patchwork
2025-07-09 12:44 ` ✗ CI.checksparse: warning " Patchwork
2025-07-09 13:06 ` ✓ Xe.CI.BAT: success " Patchwork
2025-07-09 15:02 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e2880cff-fcc8-4c9c-9ff2-e446327fbcb8@intel.com \
    --to=riana.tauro@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=frank.scarbrough@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=sk.anirban@intel.com \
    --cc=umesh.nerlige.ramappa@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox