Re: [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: "Cavitt, Jonathan" <jonathan.cavitt@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"Belgaumkar, Vinay" <vinay.belgaumkar@intel.com>
Subject: Re: [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails
Date: Tue, 11 Feb 2025 15:00:32 -0500	[thread overview]
Message-ID: <Z6usYPd1o2l6S8cU@intel.com> (raw)
In-Reply-To: <CH0PR11MB5444C8CA507F0ED75E8BD033E5F22@CH0PR11MB5444.namprd11.prod.outlook.com>

On Mon, Feb 10, 2025 at 05:04:17PM -0500, Cavitt, Jonathan wrote:
> -----Original Message-----
> From: Intel-xe <intel-xe-bounces@lists.freedesktop.org> On Behalf Of Rodrigo Vivi
> Sent: Monday, February 10, 2025 1:07 PM
> To: intel-xe@lists.freedesktop.org
> Cc: Vivi, Rodrigo <rodrigo.vivi@intel.com>; Belgaumkar, Vinay <vinay.belgaumkar@intel.com>
> Subject: [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails
> > 
> > In a rare situation of thermal limit during resume, GuC can
> > be slow and run into delays like this:
> > 
> > xe 0000:00:02.0: [drm] GT1: excessive init time: 667ms! \
> >    		 [status = 0x8002F034, timeouts = 0]
> > xe 0000:00:02.0: [drm] GT1: excessive init time: \
> >    		 [freq = 100MHz (req = 800MHz), before = 100MHz, \
> >    		 perf_limit_reasons = 0x1C001000]
> > xe 0000:00:02.0: [drm] *ERROR* GT1: GuC PC Start failed
> > ------------[ cut here ]------------
> > xe 0000:00:02.0: [drm] GT1: Failed to start GuC PC: -EIO
> > 
> > If this happens, this can block entirely the GPU to be used.
> > However, GPU can still be used, although the GT frequencies might be
> > messed up.
> > 
> > Let's report the error, but not block the flow.
> > But, instead of just giving up and moving on, let's re-attempt a wait
> > with a very long second timeout.
> > 
> > Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> 
> Minor nit below, but you can safely ignore it:
> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
> 
> > ---
> >  drivers/gpu/drm/xe/xe_guc_pc.c | 20 ++++++++++++--------
> >  1 file changed, 12 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
> > index 02409eedb914..aa58f9ddbf84 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_pc.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
> > @@ -114,9 +114,10 @@ static struct iosys_map *pc_to_maps(struct xe_guc_pc *pc)
> >  	 FIELD_PREP(HOST2GUC_PC_SLPC_REQUEST_MSG_1_EVENT_ARGC, count))
> >  
> >  static int wait_for_pc_state(struct xe_guc_pc *pc,
> > -			     enum slpc_global_state state)
> > +			     enum slpc_global_state state,
> > +			     int timeout_ms)
> >  {
> > -	int timeout_us = 5000; /* rought 5ms, but no need for precision */
> > +	int timeout_us = 1000 * timeout_ms;
> 
> NIT:
> AFAICT from the comment above, this wait lacks some form of precision.  It might be worth while
> to keep the comment here remarking on the lack of precision (or how the precision is unimportant
> in this instance), but I won't block on it because it's not important.

yeap, I honestly don't know where that comment originally came from...

on i915 the same code is in a define:
#define SLPC_RESET_TIMEOUT_MS 5

which seems like a cleaner approach then all these hardcoded '5'
spread over.... 

sending a v2...

> -Jonathan Cavitt
> 
> >  	int slept, wait = 10;
> >  
> >  	xe_device_assert_mem_access(pc_to_xe(pc));
> > @@ -165,7 +166,7 @@ static int pc_action_query_task_state(struct xe_guc_pc *pc)
> >  	};
> >  	int ret;
> >  
> > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 5))
> >  		return -EAGAIN;
> >  
> >  	/* Blocking here to ensure the results are ready before reading them */
> > @@ -188,7 +189,7 @@ static int pc_action_set_param(struct xe_guc_pc *pc, u8 id, u32 value)
> >  	};
> >  	int ret;
> >  
> > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 5))
> >  		return -EAGAIN;
> >  
> >  	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
> > @@ -209,7 +210,7 @@ static int pc_action_unset_param(struct xe_guc_pc *pc, u8 id)
> >  	struct xe_guc_ct *ct = &pc_to_guc(pc)->ct;
> >  	int ret;
> >  
> > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 5))
> >  		return -EAGAIN;
> >  
> >  	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
> > @@ -1033,9 +1034,12 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
> >  	if (ret)
> >  		goto out;
> >  
> > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING)) {
> > -		xe_gt_err(gt, "GuC PC Start failed\n");
> > -		ret = -EIO;
> > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 5)) {
> > +		xe_gt_warn(gt, "GuC PC Start taking longer than expected\n");
> > +		if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000))
> > +			xe_gt_err(gt, "GuC PC Start failed\n");
> > +		/* Although GuC PC failed, do not block the usage of GPU */
> > +		ret = 0;
> >  		goto out;
> >  	}
> >  
> > -- 
> > 2.48.1
> > 
> >

next prev parent reply	other threads:[~2025-02-11 20:01 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-10 21:07 [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails Rodrigo Vivi
2025-02-10 21:07 ` [PATCH 2/2] drm/xe/guc_pc: Remove duplicated pc_start call Rodrigo Vivi
2025-02-10 22:04   ` Cavitt, Jonathan
2025-02-10 22:04 ` [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails Cavitt, Jonathan
2025-02-11 20:00   ` Rodrigo Vivi [this message]
2025-02-10 22:09 ` ✓ CI.Patch_applied: success for series starting with [1/2] " Patchwork
2025-02-10 22:09 ` ✓ CI.checkpatch: " Patchwork
2025-02-10 22:11 ` ✓ CI.KUnit: " Patchwork
2025-02-10 22:27 ` ✓ CI.Build: " Patchwork
2025-02-10 22:29 ` ✗ CI.Hooks: failure " Patchwork
2025-02-10 22:29 ` ✗ CI.checksparse: warning " Patchwork
2025-02-10 22:48 ` ✓ Xe.CI.BAT: success " Patchwork
2025-02-11  9:03 ` ✗ Xe.CI.Full: failure " Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2025-02-11 20:09 [PATCH 1/2] " Rodrigo Vivi
2025-02-12  1:19 ` Belgaumkar, Vinay
2025-02-12 18:15   ` Rodrigo Vivi
2025-02-14  1:37     ` Belgaumkar, Vinay
2025-02-14 15:00       ` Rodrigo Vivi
2025-02-14 17:22         ` Belgaumkar, Vinay
2025-02-14 17:25 Rodrigo Vivi
2025-02-28 16:33 ` Belgaumkar, Vinay
2025-02-28 19:22 ` John Harrison
2025-02-28 19:45   ` Rodrigo Vivi
2025-02-28 20:13     ` John Harrison
2025-02-28 20:32       ` Rodrigo Vivi
2025-03-06 23:36         ` Rodrigo Vivi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z6usYPd1o2l6S8cU@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jonathan.cavitt@intel.com \
    --cc=vinay.belgaumkar@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.