Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Riana Tauro <riana.tauro@intel.com>
To: Raag Jadav <raag.jadav@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	<dri-devel@lists.freedesktop.org>, <anshuman.gupta@intel.com>,
	<rodrigo.vivi@intel.com>, <lucas.demarchi@intel.com>,
	<aravind.iddamsetty@linux.intel.com>,
	<himal.prasad.ghimiray@intel.com>, <frank.scarbrough@intel.com>
Subject: Re: [PATCH 2/4] drm/xe: Add a helper function to set recovery method
Date: Thu, 19 Jun 2025 12:56:28 +0530	[thread overview]
Message-ID: <dca4697f-c76b-4023-8605-79f7d593826f@intel.com> (raw)
In-Reply-To: <aEMFcBSWL_jPMYKa@black.fi.intel.com>

Hi Raag

Thank you for the review comments

On 6/6/2025 8:42 PM, Raag Jadav wrote:
> On Tue, Jun 03, 2025 at 01:43:58PM +0530, Riana Tauro wrote:
>> Add a helper function to set recovery method. The recovery
>> method has to be set before declaring the device wedged and sending the
>> drm wedged uevent. If no method is set, default unbind/re-bind method
>> will be set
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_device.c       | 30 +++++++++++++++++++++-------
>>   drivers/gpu/drm/xe/xe_device.h       |  1 +
>>   drivers/gpu/drm/xe/xe_device_types.h |  2 ++
>>   3 files changed, 26 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index 660b0c5126dc..3fd604ebdc6e 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -1120,16 +1120,28 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>   	xe_pm_runtime_put(xe);
>>   }
>>   
>> +/**
>> + * xe_device_set_wedged_method - Set wedged recovery method
>> + * @xe: xe device instance
> 
> Missing @method

Missed this. Will fix it>
>> + *
>> + * Set wedged recovery method to be sent using drm wedged uevent.
>> + */
>> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method)
>> +{
>> +	xe->wedged.method = method;
>> +}
>> +
>>   /**
>>    * xe_device_declare_wedged - Declare device wedged
>>    * @xe: xe device instance
>>    *
>> - * This is a final state that can only be cleared with a module
>> - * re-probe (unbind + bind).
>> - * In this state every IOCTL will be blocked so the GT cannot be used.
>> + * This is a final state that can only be cleared with the method specified
>> + * in the drm wedged uevent. The method needs to be set using xe_device_set_wedged_method
>> + * before declaring the device as wedged or the default method of reprobe (unbind/re-bind)
>> + * will be sent. In this state every IOCTL will be blocked so the GT cannot be used.
> 
> The file convention seems like 80 characters for kernel doc, so let's
> stick to it.

okay

> 
>>    * In general it will be called upon any critical error such as gt reset
>> - * failure or guc loading failure. Userspace will be notified of this state
>> - * through device wedged uevent.
>> + * failure or guc loading failure or firmware failure.
>> + * Userspace will be notified of this state through device wedged uevent.
>>    * If xe.wedged module parameter is set to 2, this function will be called
>>    * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>>    * snapshot capture. In this mode, GT reset won't be attempted so the state of
>> @@ -1152,6 +1164,11 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>   		return;
>>   	}
>>   
>> +	/* If no wedge recovery method is set, use default */
>> +	if (!xe->wedged.method)
>> +		xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND
>> +					    | DRM_WEDGE_RECOVERY_BUS_RESET);
> 
> Although there are no strict rules about this, we usually don't begin a
> new line with a symbol.

will fix this

> 
>> +
>>   	if (!atomic_xchg(&xe->wedged.flag, 1)) {
>>   		xe->needs_flr_on_fini = true;
>>   		drm_err(&xe->drm,
>> @@ -1161,8 +1178,7 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>   			dev_name(xe->drm.dev));
>>   
>>   		/* Notify userspace of wedged device */
>> -		drm_dev_wedged_event(&xe->drm,
>> -				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
>> +		drm_dev_wedged_event(&xe->drm, xe->wedged.method);
> 
> I was a bit late to realize it when I originally added this. The event
> call should be after xe_gt_declare_wedged() to comply with wedging rules.
> We notify userspace *after* we're done with driver cleanup.

Will move gt_wedged before uevent

Thanks
Riana

> 
> Raag
> 
>>   	}
>>   
>>   	for_each_gt(gt, xe, id)
>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
>> index 0bc3bc8e6803..06350740aac5 100644
>> --- a/drivers/gpu/drm/xe/xe_device.h
>> +++ b/drivers/gpu/drm/xe/xe_device.h
>> @@ -191,6 +191,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>>   }
>>   
>>   void xe_device_declare_wedged(struct xe_device *xe);
>> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
>>   
>>   struct xe_file *xe_file_get(struct xe_file *xef);
>>   void xe_file_put(struct xe_file *xef);
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> index b93c04466637..fb3617956d63 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -559,6 +559,8 @@ struct xe_device {
>>   		atomic_t flag;
>>   		/** @wedged.mode: Mode controlled by kernel parameter and debugfs */
>>   		int mode;
>> +		/** @wedged.method: Recovery method to be sent in the drm device wedged uevent */
>> +		unsigned long method;
>>   	} wedged;
>>   
>>   	/** @bo_device: Struct to control async free of BOs */
>> -- 
>> 2.47.1
>>
w


  reply	other threads:[~2025-06-19  7:26 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-03  8:13 [PATCH 0/4] Handle Firmware reported Hardware Errors Riana Tauro
2025-06-03  7:55 ` ✓ CI.Patch_applied: success for " Patchwork
2025-06-03  7:56 ` ✗ CI.checkpatch: warning " Patchwork
2025-06-03  7:57 ` ✓ CI.KUnit: success " Patchwork
2025-06-03  8:07 ` ✓ CI.Build: " Patchwork
2025-06-03  8:10 ` ✗ CI.Hooks: failure " Patchwork
2025-06-03  8:12 ` ✗ CI.checksparse: warning " Patchwork
2025-06-03  8:13 ` [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent Riana Tauro
2025-06-04 10:43   ` Raag Jadav
2025-06-05 11:24     ` Riana Tauro
2025-06-16 20:39       ` Rodrigo Vivi
2025-06-03  8:13 ` [PATCH 2/4] drm/xe: Add a helper function to set recovery method Riana Tauro
2025-06-06 15:12   ` Raag Jadav
2025-06-19  7:26     ` Riana Tauro [this message]
2025-06-03  8:13 ` [PATCH 3/4] drm/xe: Add support to handle hardware errors Riana Tauro
2025-06-03  8:14 ` [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-06-03  9:31   ` Ghimiray, Himal Prasad
2025-06-03  9:53     ` Riana Tauro
2025-06-03  8:57 ` ✓ Xe.CI.BAT: success for Handle Firmware reported Hardware Errors Patchwork
2025-06-04  8:55 ` ✓ Xe.CI.Full: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dca4697f-c76b-4023-8605-79f7d593826f@intel.com \
    --to=riana.tauro@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=frank.scarbrough@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox