From: "Belgaumkar, Vinay" <vinay.belgaumkar@intel.com>
To: Rodrigo Vivi <rodrigo.vivi@intel.com>, <intel-xe@lists.freedesktop.org>
Cc: Jonathan Cavitt <jonathan.cavitt@intel.com>
Subject: Re: [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails
Date: Fri, 28 Feb 2025 08:33:42 -0800 [thread overview]
Message-ID: <dc8a78ea-e61b-4820-ae4f-573951bcfcae@intel.com> (raw)
In-Reply-To: <20250214172503.502320-1-rodrigo.vivi@intel.com>
On 2/14/2025 9:25 AM, Rodrigo Vivi wrote:
> In a rare situation of thermal limit during resume, GuC can
> be slow and run into delays like this:
>
> xe 0000:00:02.0: [drm] GT1: excessive init time: 667ms! \
> [status = 0x8002F034, timeouts = 0]
> xe 0000:00:02.0: [drm] GT1: excessive init time: \
> [freq = 100MHz (req = 800MHz), before = 100MHz, \
> perf_limit_reasons = 0x1C001000]
> xe 0000:00:02.0: [drm] *ERROR* GT1: GuC PC Start failed
> ------------[ cut here ]------------
> xe 0000:00:02.0: [drm] GT1: Failed to start GuC PC: -EIO
>
> If this happens, this can block entirely the GPU to be used.
> However, GPU can still be used, although the GT frequencies might be
> messed up.
>
> Let's report the error, but not block the flow.
> But, instead of just giving up and moving on, let's re-attempt a wait
> with a very long second timeout.
>
> v2: Keep the precision comment (Jonathan)
> Use a define for the regular SLPC reset timeout.
> v3: Improve messages (Vinay)
> Only skip initialization if the second full-second wait failed.
>
> Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v2
> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> ---
> drivers/gpu/drm/xe/xe_guc_pc.c | 46 ++++++++++++++++++++++++----------
> 1 file changed, 33 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
> index 02409eedb914..74cc13012532 100644
> --- a/drivers/gpu/drm/xe/xe_guc_pc.c
> +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
> @@ -20,6 +20,7 @@
> #include "xe_gt.h"
> #include "xe_gt_idle.h"
> #include "xe_gt_printk.h"
> +#include "xe_gt_throttle.h"
> #include "xe_gt_types.h"
> #include "xe_guc.h"
> #include "xe_guc_ct.h"
> @@ -50,6 +51,8 @@
> #define LNL_MERT_FREQ_CAP 800
> #define BMG_MERT_FREQ_CAP 2133
>
> +#define SLPC_RESET_TIMEOUT_MS 5 /* rought 5ms, but no need for precision */
> +
> /**
> * DOC: GuC Power Conservation (PC)
> *
> @@ -114,9 +117,10 @@ static struct iosys_map *pc_to_maps(struct xe_guc_pc *pc)
> FIELD_PREP(HOST2GUC_PC_SLPC_REQUEST_MSG_1_EVENT_ARGC, count))
>
> static int wait_for_pc_state(struct xe_guc_pc *pc,
> - enum slpc_global_state state)
> + enum slpc_global_state state,
> + int timeout_ms)
> {
> - int timeout_us = 5000; /* rought 5ms, but no need for precision */
> + int timeout_us = 1000 * timeout_ms;
> int slept, wait = 10;
>
> xe_device_assert_mem_access(pc_to_xe(pc));
> @@ -165,7 +169,8 @@ static int pc_action_query_task_state(struct xe_guc_pc *pc)
> };
> int ret;
>
> - if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> + if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> + SLPC_RESET_TIMEOUT_MS))
> return -EAGAIN;
>
> /* Blocking here to ensure the results are ready before reading them */
> @@ -188,7 +193,8 @@ static int pc_action_set_param(struct xe_guc_pc *pc, u8 id, u32 value)
> };
> int ret;
>
> - if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> + if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> + SLPC_RESET_TIMEOUT_MS))
> return -EAGAIN;
>
> ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
> @@ -209,7 +215,8 @@ static int pc_action_unset_param(struct xe_guc_pc *pc, u8 id)
> struct xe_guc_ct *ct = &pc_to_guc(pc)->ct;
> int ret;
>
> - if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> + if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> + SLPC_RESET_TIMEOUT_MS))
> return -EAGAIN;
>
> ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
> @@ -443,6 +450,15 @@ u32 xe_guc_pc_get_act_freq(struct xe_guc_pc *pc)
> return freq;
> }
>
> +static u32 get_cur_freq(struct xe_gt *gt)
> +{
> + u32 freq;
> +
> + freq = xe_mmio_read32(>->mmio, RPNSWREQ);
Now that this is split off into another method, so we need to add an
assert to ensure we are holding fwake?
> + freq = REG_FIELD_GET(REQ_RATIO_MASK, freq);
> + return decode_freq(freq);
> +}
> +
> /**
> * xe_guc_pc_get_cur_freq - Get Current requested frequency
> * @pc: The GuC PC
> @@ -466,10 +482,7 @@ int xe_guc_pc_get_cur_freq(struct xe_guc_pc *pc, u32 *freq)
> return -ETIMEDOUT;
> }
>
> - *freq = xe_mmio_read32(>->mmio, RPNSWREQ);
> -
> - *freq = REG_FIELD_GET(REQ_RATIO_MASK, *freq);
> - *freq = decode_freq(*freq);
> + *freq = get_cur_freq(gt);
>
> xe_force_wake_put(gt_to_fw(gt), fw_ref);
> return 0;
> @@ -1033,10 +1046,17 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
> if (ret)
> goto out;
>
> - if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING)) {
> - xe_gt_err(gt, "GuC PC Start failed\n");
> - ret = -EIO;
> - goto out;
> + if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> + SLPC_RESET_TIMEOUT_MS)) {
> + xe_gt_warn(gt, "GuC PC excessive start time: [freq = %dMHz (req = %dMHz), perf_limit_reasons = 0x%08X]\n",
> + xe_guc_pc_get_act_freq(pc), get_cur_freq(gt),
> + xe_gt_throttle_get_limit_reasons(gt));
> + if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000)) {
> + xe_gt_err(gt, "GuC PC Start failed: Dynamic GT frequency control and GT sleep states are now disabled.\n");
> + /* Although GuC PC failed, do not block the usage of GPU */
> + ret = 0;
> + goto out;
> + }
Other than above nit, LGTM,
Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
> }
>
> ret = pc_init_freqs(pc);
next prev parent reply other threads:[~2025-02-28 16:34 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-14 17:25 [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails Rodrigo Vivi
2025-02-14 17:25 ` [PATCH 2/2] drm/xe/guc_pc: Remove duplicated pc_start call Rodrigo Vivi
2025-02-14 18:24 ` ✓ CI.Patch_applied: success for series starting with [1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails Patchwork
2025-02-14 18:24 ` ✓ CI.checkpatch: " Patchwork
2025-02-14 18:25 ` ✓ CI.KUnit: " Patchwork
2025-02-14 18:42 ` ✓ CI.Build: " Patchwork
2025-02-14 18:44 ` ✓ CI.Hooks: " Patchwork
2025-02-14 18:45 ` ✓ CI.checksparse: " Patchwork
2025-02-14 19:06 ` ✓ Xe.CI.BAT: " Patchwork
2025-02-15 19:19 ` ✗ Xe.CI.Full: failure " Patchwork
2025-02-28 16:33 ` Belgaumkar, Vinay [this message]
2025-02-28 19:22 ` [PATCH 1/2] " John Harrison
2025-02-28 19:45 ` Rodrigo Vivi
2025-02-28 20:13 ` John Harrison
2025-02-28 20:32 ` Rodrigo Vivi
2025-03-06 23:36 ` Rodrigo Vivi
-- strict thread matches above, loose matches on Subject: below --
2025-02-11 20:09 Rodrigo Vivi
2025-02-12 1:19 ` Belgaumkar, Vinay
2025-02-12 18:15 ` Rodrigo Vivi
2025-02-14 1:37 ` Belgaumkar, Vinay
2025-02-14 15:00 ` Rodrigo Vivi
2025-02-14 17:22 ` Belgaumkar, Vinay
2025-02-10 21:07 Rodrigo Vivi
2025-02-10 22:04 ` Cavitt, Jonathan
2025-02-11 20:00 ` Rodrigo Vivi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dc8a78ea-e61b-4820-ae4f-573951bcfcae@intel.com \
--to=vinay.belgaumkar@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=jonathan.cavitt@intel.com \
--cc=rodrigo.vivi@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox