All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Lukasz Laguna <lukasz.laguna@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <michal.wajdeczko@intel.com>,
	<matthew.brost@intel.com>
Subject: Re: [PATCH v8 2/4] drm/xe: Don't update wedged mode in case of an error
Date: Tue, 5 Aug 2025 18:15:24 -0400	[thread overview]
Message-ID: <aJKCfFvucbl-7lKN@intel.com> (raw)
In-Reply-To: <20250731123756.29259-3-lukasz.laguna@intel.com>

On Thu, Jul 31, 2025 at 02:37:54PM +0200, Lukasz Laguna wrote:
> Change driver's internal wedged.mode state only in case of a success and
> update GuC's reset policy only when it's necessary.
> 
> Fixes: 6b8ef44cc0a9 ("drm/xe: Introduce the wedged_mode debugfs")
> Signed-off-by: Lukasz Laguna <lukasz.laguna@intel.com>
> ---
> v7: Don't introduce XE_WEDGED_MODE_MISCONFIGURED enum field (Michal)
>     Add needs_policy_update helper (Michal)
>     Rename wedged_mode_set_reset_policy to set_reset_policy (Lukasz)
> ---
>  drivers/gpu/drm/xe/xe_debugfs.c      | 72 ++++++++++++++++++++++------
>  drivers/gpu/drm/xe/xe_device_types.h |  2 +
>  drivers/gpu/drm/xe/xe_guc_ads.c      | 12 ++---
>  drivers/gpu/drm/xe/xe_guc_ads.h      |  4 +-
>  4 files changed, 68 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c
> index 16b2e306559a..8aff93401eb9 100644
> --- a/drivers/gpu/drm/xe/xe_debugfs.c
> +++ b/drivers/gpu/drm/xe/xe_debugfs.c
> @@ -251,14 +251,64 @@ static ssize_t wedged_mode_show(struct file *f, char __user *ubuf,
>  	return simple_read_from_buffer(ubuf, size, pos, buf, len);
>  }
>  
> +static int __set_reset_policy(struct xe_gt *gt, enum xe_wedged_mode mode)
> +{
> +	int ret;
> +
> +	ret = xe_guc_ads_scheduler_policy_toggle_reset(&gt->uc.guc.ads,
> +						       !(mode == XE_WEDGED_MODE_UPON_ANY_HANG));

mode != XE_WEDGED_MODE_UPON_ANY_HANG please

> +	if (ret)
> +		xe_gt_err(gt, "Failed to update GuC ADS scheduler policy (%pe)\n", ERR_PTR(ret));
> +
> +	return ret;
> +}
> +
> +static int set_reset_policy(struct xe_device *xe, enum xe_wedged_mode mode)
> +{
> +	struct xe_gt *gt;
> +	int ret;
> +	u8 id;
> +
> +	xe_pm_runtime_get(xe);
> +	for_each_gt(gt, xe, id) {
> +		ret = __set_reset_policy(gt, mode);
> +		if (ret) {
> +			if (id > 0) {
> +				xe->wedged.inconsistent_reset = true;
> +				drm_err(&xe->drm, "Inconsistent reset policy state between GTs\n");
> +			}
> +
> +			xe_pm_runtime_put(xe);
> +			return ret;

Why to return on the first GT? perhaps we should continue and just leave one
behind?
perhaps we should have a 3 times retry attempt logic before giving up?
perhaps both combined?

> +		}
> +	}
> +	xe_pm_runtime_put(xe);
> +
> +	xe->wedged.inconsistent_reset = false;

then move this to the beginning of the function
also ret = 0; up there

> +
> +	return 0;

and simply return ret; here

> +}
> +
> +static bool needs_policy_update(struct xe_device *xe, enum xe_wedged_mode mode)
> +{
> +	if (xe->wedged.inconsistent_reset)
> +		return true;
> +
> +	if (xe->wedged.mode == mode)
> +		return false;
> +
> +	return !((xe->wedged.mode == XE_WEDGED_MODE_NEVER &&
> +		  mode == XE_WEDGED_MODE_UPON_CRITICAL_ERROR) ||
> +		 (xe->wedged.mode == XE_WEDGED_MODE_UPON_CRITICAL_ERROR &&
> +		  mode == XE_WEDGED_MODE_NEVER));

This is worst then the one above... please expand this to inside
the function instead of this overloaded return full of not-or-and logic.

Some wording about what cases are this in the commit message should be very
helpful

> +}
> +
>  static ssize_t wedged_mode_set(struct file *f, const char __user *ubuf,
>  			       size_t size, loff_t *pos)
>  {
>  	struct xe_device *xe = file_inode(f)->i_private;
> -	struct xe_gt *gt;
>  	u32 wedged_mode;
>  	ssize_t ret;
> -	u8 id;
>  
>  	ret = kstrtouint_from_user(ubuf, size, 0, &wedged_mode);
>  	if (ret)
> @@ -268,22 +318,14 @@ static ssize_t wedged_mode_set(struct file *f, const char __user *ubuf,
>  	if (ret)
>  		return ret;
>  
> -	if (xe->wedged.mode == wedged_mode)
> -		return size;
> +	if (needs_policy_update(xe, wedged_mode)) {
> +		ret = set_reset_policy(xe, wedged_mode);
> +		if (ret)
> +			return ret;
> +	}
>  
>  	xe->wedged.mode = wedged_mode;
>  
> -	xe_pm_runtime_get(xe);
> -	for_each_gt(gt, xe, id) {
> -		ret = xe_guc_ads_scheduler_policy_toggle_reset(&gt->uc.guc.ads);
> -		if (ret) {
> -			xe_gt_err(gt, "Failed to update GuC ADS scheduler policy. GuC may still cause engine reset even with wedged_mode=2\n");
> -			xe_pm_runtime_put(xe);
> -			return -EIO;
> -		}
> -	}
> -	xe_pm_runtime_put(xe);
> -
>  	return size;
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 1e496845c91f..829252db3a47 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -546,6 +546,8 @@ struct xe_device {
>  			XE_WEDGED_MODE_UPON_ANY_HANG = 2,
>  			XE_WEDGED_MODE_DEFAULT = XE_WEDGED_MODE_UPON_CRITICAL_ERROR,
>  		} mode;
> +		/** @wedged.inconsistent_reset: Inconsistent reset policy state between GTs */
> +		bool inconsistent_reset;
>  	} wedged;
>  
>  	/** @bo_device: Struct to control async free of BOs */
> diff --git a/drivers/gpu/drm/xe/xe_guc_ads.c b/drivers/gpu/drm/xe/xe_guc_ads.c
> index c4ea4e6d82ce..7f58d77e0ab9 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ads.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ads.c
> @@ -1033,16 +1033,16 @@ static int guc_ads_action_update_policies(struct xe_guc_ads *ads, u32 policy_off
>  /**
>   * xe_guc_ads_scheduler_policy_toggle_reset - Toggle reset policy
>   * @ads: Additional data structures object
> + * @enable: true to enable engine resets, false otherwise
>   *
> - * This function update the GuC's engine reset policy based on wedged.mode.
> + * This function update the GuC's engine reset policy.
>   *
>   * Return: 0 on success, and negative error code otherwise.
>   */
> -int xe_guc_ads_scheduler_policy_toggle_reset(struct xe_guc_ads *ads)
> +int xe_guc_ads_scheduler_policy_toggle_reset(struct xe_guc_ads *ads, bool enable)
>  {
>  	struct guc_policies *policies;
>  	struct xe_guc *guc = ads_to_guc(ads);
> -	struct xe_device *xe = ads_to_xe(ads);
>  	CLASS(xe_guc_buf, buf)(&guc->buf, sizeof(*policies));
>  
>  	if (!xe_guc_buf_is_valid(buf))
> @@ -1054,10 +1054,10 @@ int xe_guc_ads_scheduler_policy_toggle_reset(struct xe_guc_ads *ads)
>  	policies->dpc_promote_time = ads_blob_read(ads, policies.dpc_promote_time);
>  	policies->max_num_work_items = ads_blob_read(ads, policies.max_num_work_items);
>  	policies->is_valid = 1;
> -	if (xe->wedged.mode == XE_WEDGED_MODE_UPON_ANY_HANG)
> -		policies->global_flags |= GLOBAL_POLICY_DISABLE_ENGINE_RESET;
> -	else
> +	if (enable)
>  		policies->global_flags &= ~GLOBAL_POLICY_DISABLE_ENGINE_RESET;
> +	else
> +		policies->global_flags |= GLOBAL_POLICY_DISABLE_ENGINE_RESET;
>  
>  	return guc_ads_action_update_policies(ads, xe_guc_buf_flush(buf));
>  }
> diff --git a/drivers/gpu/drm/xe/xe_guc_ads.h b/drivers/gpu/drm/xe/xe_guc_ads.h
> index 2e6674c760ff..9879aadd22d6 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ads.h
> +++ b/drivers/gpu/drm/xe/xe_guc_ads.h
> @@ -6,6 +6,8 @@
>  #ifndef _XE_GUC_ADS_H_
>  #define _XE_GUC_ADS_H_
>  
> +#include <linux/types.h>
> +
>  struct xe_guc_ads;
>  
>  int xe_guc_ads_init(struct xe_guc_ads *ads);
> @@ -13,6 +15,6 @@ int xe_guc_ads_init_post_hwconfig(struct xe_guc_ads *ads);
>  void xe_guc_ads_populate(struct xe_guc_ads *ads);
>  void xe_guc_ads_populate_minimal(struct xe_guc_ads *ads);
>  void xe_guc_ads_populate_post_load(struct xe_guc_ads *ads);
> -int xe_guc_ads_scheduler_policy_toggle_reset(struct xe_guc_ads *ads);
> +int xe_guc_ads_scheduler_policy_toggle_reset(struct xe_guc_ads *ads, bool enable);
>  
>  #endif
> -- 
> 2.40.0
> 

  reply	other threads:[~2025-08-05 22:16 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-31 12:37 [PATCH v8 0/4] drm/xe: Improve wedged mode handling Lukasz Laguna
2025-07-31 12:37 ` [PATCH v8 1/4] drm/xe: Validate wedged_mode parameter and define enum for modes Lukasz Laguna
2025-07-31 12:37 ` [PATCH v8 2/4] drm/xe: Don't update wedged mode in case of an error Lukasz Laguna
2025-08-05 22:15   ` Rodrigo Vivi [this message]
2025-08-06  8:37     ` Laguna, Lukasz
2025-08-06  9:42       ` Laguna, Lukasz
2025-07-31 12:37 ` [PATCH v8 3/4] drm/xe/vf: Disallow setting wedged mode to upon-any-hang Lukasz Laguna
2025-07-31 12:37 ` [PATCH v8 4/4] drm/xe/pf: Allow upon-any-hang wedged mode only in debug config Lukasz Laguna
2025-07-31 16:55 ` ✓ CI.KUnit: success for drm/xe: Improve wedged mode handling (rev8) Patchwork
2025-07-31 18:20 ` ✓ Xe.CI.BAT: " Patchwork
2025-07-31 20:20 ` ✓ Xe.CI.Full: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aJKCfFvucbl-7lKN@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lukasz.laguna@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=michal.wajdeczko@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.