Re: [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: John Harrison <john.c.harrison@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	Vinay Belgaumkar <vinay.belgaumkar@intel.com>,
	Jonathan Cavitt <jonathan.cavitt@intel.com>
Subject: Re: [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails
Date: Fri, 28 Feb 2025 15:32:54 -0500	[thread overview]
Message-ID: <Z8Iddm4HB1O1yY0Q@intel.com> (raw)
In-Reply-To: <f9ffb2c6-00a3-43f7-b7c3-68f8de064254@intel.com>

On Fri, Feb 28, 2025 at 12:13:24PM -0800, John Harrison wrote:
> On 2/28/2025 11:45, Rodrigo Vivi wrote:
> > On Fri, Feb 28, 2025 at 11:22:02AM -0800, John Harrison wrote:
> > > On 2/14/2025 09:25, Rodrigo Vivi wrote:
> > > > In a rare situation of thermal limit during resume, GuC can
> > > > be slow and run into delays like this:
> > > > 
> > > > xe 0000:00:02.0: [drm] GT1: excessive init time: 667ms! \
> > > >      		 [status = 0x8002F034, timeouts = 0]
> > > > xe 0000:00:02.0: [drm] GT1: excessive init time: \
> > > >      		 [freq = 100MHz (req = 800MHz), before = 100MHz, \
> > > >      		 perf_limit_reasons = 0x1C001000]
> > > > xe 0000:00:02.0: [drm] *ERROR* GT1: GuC PC Start failed
> > > > ------------[ cut here ]------------
> > > > xe 0000:00:02.0: [drm] GT1: Failed to start GuC PC: -EIO
> > > > 
> > > > If this happens, this can block entirely the GPU to be used.
> > > > However, GPU can still be used, although the GT frequencies might be
> > > > messed up.
> > > > 
> > > > Let's report the error, but not block the flow.
> > > > But, instead of just giving up and moving on, let's re-attempt a wait
> > > > with a very long second timeout.
> > > > 
> > > > v2: Keep the precision comment (Jonathan)
> > > >       Use a define for the regular SLPC reset timeout.
> > > > v3: Improve messages (Vinay)
> > > >       Only skip initialization if the second full-second wait failed.
> > > > 
> > > > Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
> > > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v2
> > > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/xe/xe_guc_pc.c | 46 ++++++++++++++++++++++++----------
> > > >    1 file changed, 33 insertions(+), 13 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
> > > > index 02409eedb914..74cc13012532 100644
> > > > --- a/drivers/gpu/drm/xe/xe_guc_pc.c
> > > > +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
> > > > @@ -20,6 +20,7 @@
> > > >    #include "xe_gt.h"
> > > >    #include "xe_gt_idle.h"
> > > >    #include "xe_gt_printk.h"
> > > > +#include "xe_gt_throttle.h"
> > > >    #include "xe_gt_types.h"
> > > >    #include "xe_guc.h"
> > > >    #include "xe_guc_ct.h"
> > > > @@ -50,6 +51,8 @@
> > > >    #define LNL_MERT_FREQ_CAP	800
> > > >    #define BMG_MERT_FREQ_CAP	2133
> > > > +#define SLPC_RESET_TIMEOUT_MS 5 /* rought 5ms, but no need for precision */
> > > > +
> > > >    /**
> > > >     * DOC: GuC Power Conservation (PC)
> > > >     *
> > > > @@ -114,9 +117,10 @@ static struct iosys_map *pc_to_maps(struct xe_guc_pc *pc)
> > > >    	 FIELD_PREP(HOST2GUC_PC_SLPC_REQUEST_MSG_1_EVENT_ARGC, count))
> > > >    static int wait_for_pc_state(struct xe_guc_pc *pc,
> > > > -			     enum slpc_global_state state)
> > > > +			     enum slpc_global_state state,
> > > > +			     int timeout_ms)
> > > >    {
> > > > -	int timeout_us = 5000; /* rought 5ms, but no need for precision */
> > > > +	int timeout_us = 1000 * timeout_ms;
> > > >    	int slept, wait = 10;
> > > >    	xe_device_assert_mem_access(pc_to_xe(pc));
> > > > @@ -165,7 +169,8 @@ static int pc_action_query_task_state(struct xe_guc_pc *pc)
> > > >    	};
> > > >    	int ret;
> > > > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> > > > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> > > > +			      SLPC_RESET_TIMEOUT_MS))
> > > >    		return -EAGAIN;
> > > >    	/* Blocking here to ensure the results are ready before reading them */
> > > > @@ -188,7 +193,8 @@ static int pc_action_set_param(struct xe_guc_pc *pc, u8 id, u32 value)
> > > >    	};
> > > >    	int ret;
> > > > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> > > > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> > > > +			      SLPC_RESET_TIMEOUT_MS))
> > > >    		return -EAGAIN;
> > > >    	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
> > > > @@ -209,7 +215,8 @@ static int pc_action_unset_param(struct xe_guc_pc *pc, u8 id)
> > > >    	struct xe_guc_ct *ct = &pc_to_guc(pc)->ct;
> > > >    	int ret;
> > > > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> > > > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> > > > +			      SLPC_RESET_TIMEOUT_MS))
> > > >    		return -EAGAIN;
> > > >    	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
> > > > @@ -443,6 +450,15 @@ u32 xe_guc_pc_get_act_freq(struct xe_guc_pc *pc)
> > > >    	return freq;
> > > >    }
> > > > +static u32 get_cur_freq(struct xe_gt *gt)
> > > > +{
> > > > +	u32 freq;
> > > > +
> > > > +	freq = xe_mmio_read32(&gt->mmio, RPNSWREQ);
> > > > +	freq = REG_FIELD_GET(REQ_RATIO_MASK, freq);
> > > > +	return decode_freq(freq);
> > > > +}
> > > > +
> > > >    /**
> > > >     * xe_guc_pc_get_cur_freq - Get Current requested frequency
> > > >     * @pc: The GuC PC
> > > > @@ -466,10 +482,7 @@ int xe_guc_pc_get_cur_freq(struct xe_guc_pc *pc, u32 *freq)
> > > >    		return -ETIMEDOUT;
> > > >    	}
> > > > -	*freq = xe_mmio_read32(&gt->mmio, RPNSWREQ);
> > > > -
> > > > -	*freq = REG_FIELD_GET(REQ_RATIO_MASK, *freq);
> > > > -	*freq = decode_freq(*freq);
> > > > +	*freq = get_cur_freq(gt);
> > > >    	xe_force_wake_put(gt_to_fw(gt), fw_ref);
> > > >    	return 0;
> > > > @@ -1033,10 +1046,17 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
> > > >    	if (ret)
> > > >    		goto out;
> > > > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING)) {
> > > > -		xe_gt_err(gt, "GuC PC Start failed\n");
> > > > -		ret = -EIO;
> > > > -		goto out;
> > > > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> > > > +			      SLPC_RESET_TIMEOUT_MS)) {
> > > > +		xe_gt_warn(gt, "GuC PC excessive start time: [freq = %dMHz (req = %dMHz), perf_limit_reasons = 0x%08X]\n",
> > > > +			   xe_guc_pc_get_act_freq(pc), get_cur_freq(gt),
> > > > +			   xe_gt_throttle_get_limit_reasons(gt));
> > > > +		if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000)) {
> > > Shouldn't this be a define as well - SLPC_RESET_EXTENDED_TIMEOUT_MS or
> > > something?
> > good idea! will do.
> > 
> > > More importantly, Is 1ms enough of an extra wait?
> > The new timeout argument is in ms, so it is 1 second.
> Doh! Yes, I saw that but then completely spaced it out again!
> 
> > 
> > > If the GT freq is 100MHz
> > > instead of 2GHz or some such then the expected max of 5ms could now be more
> > > like 100ms if not even longer (the slow down does not seem linear). As an
> > > example, the GuC load itself should be <10ms but with clamped frequencies we
> > > generally see over 500ms, sometimes over 1s.
> > hmm... over 1s possible? so, perhaps 1250 to be on the safe side?
> > other suggestions?
> I think a second should be good but I don't what is involved in the SLPC
> start up? The long delay loading the GuC is due to doing decryption which is
> a hugely CPU intensive task and the GuC is not a huge CPU! If SLPC is more
> about waiting for hardware to respond then maybe the slow down won't be as
> severe? Plus the GuC load is inherently slower in the first place - our
> original timeout was 200ms with expected values in the 5-15ms range. If SLPC
> is starting from a 5ms timeout then presumably the expected time is actually
> more like 1ms or less?

Yeap, I randomly put a big wait because I wasn't sure why/what.

> 
> You could try running with the frequency manually set to 300MHz and see how
> long it takes. I think that is the lowest we can explicitly request from the
> KMD?

Great idea! Although it can change a lot by platform and SKUs, but we could
have at least a rough idea instead of a blind big guess.

> 
> > 
> > > > +			xe_gt_err(gt, "GuC PC Start failed: Dynamic GT frequency control and GT sleep states are now disabled.\n");
> > > > +			/* Although GuC PC failed, do not block the usage of GPU */
> > > > +			ret = 0;
> > > I thought the new policy was that any subsystem failure should now be
> > > considered fatal and abort driver load? I recall a PXP start failure was
> > > recently upgrading to being fatal even though PXP is almost never used by
> > > any actual users. SLPC seems much more vital to the system than PXP!
> > Hmm... good point! I have to get back to the board then and have
> > this logic only for the resume?!
> > 
> > If this happens during the probe yeap, let's block because subsystems
> > are buggy. But the case I'm hunting here is a resume from S2idle that
> > is entirely hanging the platform when this happens under thermal constrains.
> Hmm. What platform is the problem showing up on? There are a couple of other
> bug reports about systems coming up in an odd state after suspend - e.g. GuC
> image not loading due to memory corruption. I wonder if it is not actually a
> thermal problem but just something confused due to uninitialised state
> somewhere? Plus, how can you be in thermal meltdown on a resume? If the
> power was lost then the device should be cold!

Indeed. It was a LNL case in a very specific kernel version. Issue is not
reproducible anymore. But with that bug I realized we were actually entirely
hanging the platform on resume and this is not a good approach, even though
the original issue was not ours.

> 
> > 
> > Thoughts? I'm open to suggestions here.
> My main thought is that if the frequency is clamped (by the hardware itself)
> at absolute minimum then the system is not going to be very usable anyway.
> So is continuing to run by using huge timeouts actually beneficial? But not
> sure what else we can do at this point? Maybe try an FLR? But yeah, it is
> probably good to try harder to keep going on a resume than on first driver
> load.

Well, with the resume happening, the FLR could be a bad hammer. But well,
worth considering indeed. I will do some more experiments around and see
our options. But the hang as currently is is the worst scenario.

Thanks a lot again,
Rodrigo.

> 
> John.
> 
> > 
> > Thanks a lot for raising these so far,
> > Rodrigo.
> > 
> > > John.
> > > 
> > > > +			goto out;
> > > > +		}
> > > >    	}
> > > >    	ret = pc_init_freqs(pc);
>

next prev parent reply	other threads:[~2025-02-28 20:33 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-14 17:25 [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails Rodrigo Vivi
2025-02-14 17:25 ` [PATCH 2/2] drm/xe/guc_pc: Remove duplicated pc_start call Rodrigo Vivi
2025-02-14 18:24 ` ✓ CI.Patch_applied: success for series starting with [1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails Patchwork
2025-02-14 18:24 ` ✓ CI.checkpatch: " Patchwork
2025-02-14 18:25 ` ✓ CI.KUnit: " Patchwork
2025-02-14 18:42 ` ✓ CI.Build: " Patchwork
2025-02-14 18:44 ` ✓ CI.Hooks: " Patchwork
2025-02-14 18:45 ` ✓ CI.checksparse: " Patchwork
2025-02-14 19:06 ` ✓ Xe.CI.BAT: " Patchwork
2025-02-15 19:19 ` ✗ Xe.CI.Full: failure " Patchwork
2025-02-28 16:33 ` [PATCH 1/2] " Belgaumkar, Vinay
2025-02-28 19:22 ` John Harrison
2025-02-28 19:45   ` Rodrigo Vivi
2025-02-28 20:13     ` John Harrison
2025-02-28 20:32       ` Rodrigo Vivi [this message]
2025-03-06 23:36         ` Rodrigo Vivi
  -- strict thread matches above, loose matches on Subject: below --
2025-02-11 20:09 Rodrigo Vivi
2025-02-12  1:19 ` Belgaumkar, Vinay
2025-02-12 18:15   ` Rodrigo Vivi
2025-02-14  1:37     ` Belgaumkar, Vinay
2025-02-14 15:00       ` Rodrigo Vivi
2025-02-14 17:22         ` Belgaumkar, Vinay
2025-02-10 21:07 Rodrigo Vivi
2025-02-10 22:04 ` Cavitt, Jonathan
2025-02-11 20:00   ` Rodrigo Vivi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z8Iddm4HB1O1yY0Q@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=john.c.harrison@intel.com \
    --cc=jonathan.cavitt@intel.com \
    --cc=vinay.belgaumkar@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox