Re: [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Belgaumkar, Vinay" <vinay.belgaumkar@intel.com>
To: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	Jonathan Cavitt <jonathan.cavitt@intel.com>
Subject: Re: [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails
Date: Thu, 13 Feb 2025 17:37:34 -0800	[thread overview]
Message-ID: <0c223a7e-7078-4905-abde-1e2924352937@intel.com> (raw)
In-Reply-To: <Z6zlUOQkRtA5RI7X@intel.com>


On 2/12/2025 10:15 AM, Rodrigo Vivi wrote:
> On Tue, Feb 11, 2025 at 05:19:14PM -0800, Belgaumkar, Vinay wrote:
>> On 2/11/2025 12:09 PM, Rodrigo Vivi wrote:
>>> In a rare situation of thermal limit during resume, GuC can
>>> be slow and run into delays like this:
>>>
>>> xe 0000:00:02.0: [drm] GT1: excessive init time: 667ms! \
>>>      		 [status = 0x8002F034, timeouts = 0]
>>> xe 0000:00:02.0: [drm] GT1: excessive init time: \
>>>      		 [freq = 100MHz (req = 800MHz), before = 100MHz, \
>>>      		 perf_limit_reasons = 0x1C001000]
>>> xe 0000:00:02.0: [drm] *ERROR* GT1: GuC PC Start failed
>>> ------------[ cut here ]------------
>>> xe 0000:00:02.0: [drm] GT1: Failed to start GuC PC: -EIO
>>>
>>> If this happens, this can block entirely the GPU to be used.
>>> However, GPU can still be used, although the GT frequencies might be
>>> messed up.
>>>
>>> Let's report the error, but not block the flow.
>> Can we expect other random CI failures due to this? If GT is not getting
>> expected frequencies, certain tests which rely on this will likely fail,
>> causing a bunch of noise. Is that worse than driver load failing in this
>> case?
> This issue which I pasted the log above is blocking the resume of the
> a LNL laptop. Everything goes blank forcing the user to reboot the
> laptop.
>
> I prefer to have to deal with CI noise with bugs that we can work on
> than blocking users resume.
>
> But well, we are still waiting one entire extra second there.
> That should be more than enough even with the thermal limited
> condition there. So, I'm not expecting more bugs than we already
> have.
>
> Also, our IGT test cases are prepared to deal with some EAGAIN
> returns right? The probe and resume functions are not....
>
> But well, any suggestion here on a more robust approach?
> Or can we go with this one?

True, this will unblock resume. However, if this is a pcode bug, we will 
allow boot in spite of a persistent failure to get anything above Pmin. 
Maybe we can print the frequencies again here and explicitly warn about 
the loss of dynamic frequencies and GuCRC (and all freq/c6 related 
interfaces) from here on?

>
> Thanks,
> Rodrigo.
>
>> Thanks,
>>
>> Vinay.
>>
>>> But, instead of just giving up and moving on, let's re-attempt a wait
>>> with a very long second timeout.
>>>
>>> v2: Keep the precision comment (Jonathan)
>>>       Use a define for the regular SLPC reset timeout.
>>>
>>> Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
>>> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
>>> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>> ---
>>>    drivers/gpu/drm/xe/xe_guc_pc.c | 26 ++++++++++++++++++--------
>>>    1 file changed, 18 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
>>> index 02409eedb914..3b04b62937eb 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_pc.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
>>> @@ -50,6 +50,8 @@
>>>    #define LNL_MERT_FREQ_CAP	800
>>>    #define BMG_MERT_FREQ_CAP	2133
>>> +#define SLPC_RESET_TIMEOUT_MS 5 /* rought 5ms, but no need for precision */
>>> +
>>>    /**
>>>     * DOC: GuC Power Conservation (PC)
>>>     *
>>> @@ -114,9 +116,10 @@ static struct iosys_map *pc_to_maps(struct xe_guc_pc *pc)
>>>    	 FIELD_PREP(HOST2GUC_PC_SLPC_REQUEST_MSG_1_EVENT_ARGC, count))
>>>    static int wait_for_pc_state(struct xe_guc_pc *pc,
>>> -			     enum slpc_global_state state)
>>> +			     enum slpc_global_state state,
>>> +			     int timeout_ms)
>>>    {
>>> -	int timeout_us = 5000; /* rought 5ms, but no need for precision */
>>> +	int timeout_us = 1000 * timeout_ms;
>>>    	int slept, wait = 10;
>>>    	xe_device_assert_mem_access(pc_to_xe(pc));
>>> @@ -165,7 +168,8 @@ static int pc_action_query_task_state(struct xe_guc_pc *pc)
>>>    	};
>>>    	int ret;
>>> -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
>>> +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
>>> +			      SLPC_RESET_TIMEOUT_MS))
>>>    		return -EAGAIN;
>>>    	/* Blocking here to ensure the results are ready before reading them */
>>> @@ -188,7 +192,8 @@ static int pc_action_set_param(struct xe_guc_pc *pc, u8 id, u32 value)
>>>    	};
>>>    	int ret;
>>> -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
>>> +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
>>> +			      SLPC_RESET_TIMEOUT_MS))
>>>    		return -EAGAIN;
>>>    	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
>>> @@ -209,7 +214,8 @@ static int pc_action_unset_param(struct xe_guc_pc *pc, u8 id)
>>>    	struct xe_guc_ct *ct = &pc_to_guc(pc)->ct;
>>>    	int ret;
>>> -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
>>> +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
>>> +			      SLPC_RESET_TIMEOUT_MS))
>>>    		return -EAGAIN;
>>>    	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
>>> @@ -1033,9 +1039,13 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
>>>    	if (ret)
>>>    		goto out;
>>> -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING)) {
>>> -		xe_gt_err(gt, "GuC PC Start failed\n");
>>> -		ret = -EIO;
>>> +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
>>> +			      SLPC_RESET_TIMEOUT_MS)) {
>>> +		xe_gt_warn(gt, "GuC PC Start taking longer than expected\n");
>>> +		if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000))
>>> +			xe_gt_err(gt, "GuC PC Start failed\n");
>>> +		/* Although GuC PC failed, do not block the usage of GPU */
>>> +		ret = 0;

Looks like we are skipping SLPC init even if we succeed in getting the 
right pc_state on the retry? We should continue with normal init in that 
case(need an else).

Thanks,

Vinay.

>>>    		goto out;
>>>    	}

next prev parent reply	other threads:[~2025-02-14  1:38 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-11 20:09 [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails Rodrigo Vivi
2025-02-11 20:09 ` [PATCH 2/2] drm/xe/guc_pc: Remove duplicated pc_start call Rodrigo Vivi
2025-02-14  0:31   ` Belgaumkar, Vinay
2025-02-11 20:17 ` ✓ CI.Patch_applied: success for series starting with [1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails Patchwork
2025-02-11 20:18 ` ✓ CI.checkpatch: " Patchwork
2025-02-11 20:19 ` ✓ CI.KUnit: " Patchwork
2025-02-11 20:35 ` ✓ CI.Build: " Patchwork
2025-02-11 20:37 ` ✓ CI.Hooks: " Patchwork
2025-02-11 20:39 ` ✓ CI.checksparse: " Patchwork
2025-02-11 20:59 ` ✓ Xe.CI.BAT: " Patchwork
2025-02-12  1:19 ` [PATCH 1/2] " Belgaumkar, Vinay
2025-02-12 18:15   ` Rodrigo Vivi
2025-02-14  1:37     ` Belgaumkar, Vinay [this message]
2025-02-14 15:00       ` Rodrigo Vivi
2025-02-14 17:22         ` Belgaumkar, Vinay
2025-02-12  4:48 ` ✗ Xe.CI.Full: failure for series starting with [1/2] " Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2025-02-14 17:25 [PATCH 1/2] " Rodrigo Vivi
2025-02-28 16:33 ` Belgaumkar, Vinay
2025-02-28 19:22 ` John Harrison
2025-02-28 19:45   ` Rodrigo Vivi
2025-02-28 20:13     ` John Harrison
2025-02-28 20:32       ` Rodrigo Vivi
2025-03-06 23:36         ` Rodrigo Vivi
2025-02-10 21:07 Rodrigo Vivi
2025-02-10 22:04 ` Cavitt, Jonathan
2025-02-11 20:00   ` Rodrigo Vivi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0c223a7e-7078-4905-abde-1e2924352937@intel.com \
    --to=vinay.belgaumkar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jonathan.cavitt@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox