Re: [PATCH 2/4] drm/xe: Add a helper function to set recovery method

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

From: Raag Jadav <raag.jadav@intel.com>
To: Riana Tauro <riana.tauro@intel.com>
Cc: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	anshuman.gupta@intel.com, rodrigo.vivi@intel.com,
	lucas.demarchi@intel.com, aravind.iddamsetty@linux.intel.com,
	himal.prasad.ghimiray@intel.com, frank.scarbrough@intel.com
Subject: Re: [PATCH 2/4] drm/xe: Add a helper function to set recovery method
Date: Fri, 6 Jun 2025 18:12:48 +0300	[thread overview]
Message-ID: <aEMFcBSWL_jPMYKa@black.fi.intel.com> (raw)
In-Reply-To: <20250603081409.1509709-3-riana.tauro@intel.com>

On Tue, Jun 03, 2025 at 01:43:58PM +0530, Riana Tauro wrote:
> Add a helper function to set recovery method. The recovery
> method has to be set before declaring the device wedged and sending the
> drm wedged uevent. If no method is set, default unbind/re-bind method
> will be set
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c       | 30 +++++++++++++++++++++-------
>  drivers/gpu/drm/xe/xe_device.h       |  1 +
>  drivers/gpu/drm/xe/xe_device_types.h |  2 ++
>  3 files changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 660b0c5126dc..3fd604ebdc6e 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1120,16 +1120,28 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>  	xe_pm_runtime_put(xe);
>  }
>  
> +/**
> + * xe_device_set_wedged_method - Set wedged recovery method
> + * @xe: xe device instance

Missing @method

> + *
> + * Set wedged recovery method to be sent using drm wedged uevent.
> + */
> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method)
> +{
> +	xe->wedged.method = method;
> +}
> +
>  /**
>   * xe_device_declare_wedged - Declare device wedged
>   * @xe: xe device instance
>   *
> - * This is a final state that can only be cleared with a module
> - * re-probe (unbind + bind).
> - * In this state every IOCTL will be blocked so the GT cannot be used.
> + * This is a final state that can only be cleared with the method specified
> + * in the drm wedged uevent. The method needs to be set using xe_device_set_wedged_method
> + * before declaring the device as wedged or the default method of reprobe (unbind/re-bind)
> + * will be sent. In this state every IOCTL will be blocked so the GT cannot be used.

The file convention seems like 80 characters for kernel doc, so let's
stick to it.

>   * In general it will be called upon any critical error such as gt reset
> - * failure or guc loading failure. Userspace will be notified of this state
> - * through device wedged uevent.
> + * failure or guc loading failure or firmware failure.
> + * Userspace will be notified of this state through device wedged uevent.
>   * If xe.wedged module parameter is set to 2, this function will be called
>   * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>   * snapshot capture. In this mode, GT reset won't be attempted so the state of
> @@ -1152,6 +1164,11 @@ void xe_device_declare_wedged(struct xe_device *xe)
>  		return;
>  	}
>  
> +	/* If no wedge recovery method is set, use default */
> +	if (!xe->wedged.method)
> +		xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND
> +					    | DRM_WEDGE_RECOVERY_BUS_RESET);

Although there are no strict rules about this, we usually don't begin a
new line with a symbol.

> +
>  	if (!atomic_xchg(&xe->wedged.flag, 1)) {
>  		xe->needs_flr_on_fini = true;
>  		drm_err(&xe->drm,
> @@ -1161,8 +1178,7 @@ void xe_device_declare_wedged(struct xe_device *xe)
>  			dev_name(xe->drm.dev));
>  
>  		/* Notify userspace of wedged device */
> -		drm_dev_wedged_event(&xe->drm,
> -				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
> +		drm_dev_wedged_event(&xe->drm, xe->wedged.method);

I was a bit late to realize it when I originally added this. The event
call should be after xe_gt_declare_wedged() to comply with wedging rules.
We notify userspace *after* we're done with driver cleanup.

Raag

>  	}
>  
>  	for_each_gt(gt, xe, id)
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index 0bc3bc8e6803..06350740aac5 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -191,6 +191,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>  }
>  
>  void xe_device_declare_wedged(struct xe_device *xe);
> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
>  
>  struct xe_file *xe_file_get(struct xe_file *xef);
>  void xe_file_put(struct xe_file *xef);
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index b93c04466637..fb3617956d63 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -559,6 +559,8 @@ struct xe_device {
>  		atomic_t flag;
>  		/** @wedged.mode: Mode controlled by kernel parameter and debugfs */
>  		int mode;
> +		/** @wedged.method: Recovery method to be sent in the drm device wedged uevent */
> +		unsigned long method;
>  	} wedged;
>  
>  	/** @bo_device: Struct to control async free of BOs */
> -- 
> 2.47.1
>

next prev parent reply	other threads:[~2025-06-06 15:12 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-03  8:13 [PATCH 0/4] Handle Firmware reported Hardware Errors Riana Tauro
2025-06-03  8:13 ` [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent Riana Tauro
2025-06-04 10:43   ` Raag Jadav
2025-06-05 11:24     ` Riana Tauro
2025-06-16 20:39       ` Rodrigo Vivi
2025-06-03  8:13 ` [PATCH 2/4] drm/xe: Add a helper function to set recovery method Riana Tauro
2025-06-06 15:12   ` Raag Jadav [this message]
2025-06-19  7:26     ` Riana Tauro
2025-06-03  8:13 ` [PATCH 3/4] drm/xe: Add support to handle hardware errors Riana Tauro
2025-06-03  8:14 ` [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-06-03  9:31   ` Ghimiray, Himal Prasad
2025-06-03  9:53     ` Riana Tauro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aEMFcBSWL_jPMYKa@black.fi.intel.com \
    --to=raag.jadav@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=frank.scarbrough@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).