Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails
@ 2025-02-11 20:09 Rodrigo Vivi
  2025-02-11 20:09 ` [PATCH 2/2] drm/xe/guc_pc: Remove duplicated pc_start call Rodrigo Vivi
                   ` (9 more replies)
  0 siblings, 10 replies; 26+ messages in thread
From: Rodrigo Vivi @ 2025-02-11 20:09 UTC (permalink / raw)
  To: intel-xe; +Cc: Rodrigo Vivi, Vinay Belgaumkar, Jonathan Cavitt

In a rare situation of thermal limit during resume, GuC can
be slow and run into delays like this:

xe 0000:00:02.0: [drm] GT1: excessive init time: 667ms! \
   		 [status = 0x8002F034, timeouts = 0]
xe 0000:00:02.0: [drm] GT1: excessive init time: \
   		 [freq = 100MHz (req = 800MHz), before = 100MHz, \
   		 perf_limit_reasons = 0x1C001000]
xe 0000:00:02.0: [drm] *ERROR* GT1: GuC PC Start failed
------------[ cut here ]------------
xe 0000:00:02.0: [drm] GT1: Failed to start GuC PC: -EIO

If this happens, this can block entirely the GPU to be used.
However, GPU can still be used, although the GT frequencies might be
messed up.

Let's report the error, but not block the flow.
But, instead of just giving up and moving on, let's re-attempt a wait
with a very long second timeout.

v2: Keep the precision comment (Jonathan)
    Use a define for the regular SLPC reset timeout.

Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_pc.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
index 02409eedb914..3b04b62937eb 100644
--- a/drivers/gpu/drm/xe/xe_guc_pc.c
+++ b/drivers/gpu/drm/xe/xe_guc_pc.c
@@ -50,6 +50,8 @@
 #define LNL_MERT_FREQ_CAP	800
 #define BMG_MERT_FREQ_CAP	2133
 
+#define SLPC_RESET_TIMEOUT_MS 5 /* rought 5ms, but no need for precision */
+
 /**
  * DOC: GuC Power Conservation (PC)
  *
@@ -114,9 +116,10 @@ static struct iosys_map *pc_to_maps(struct xe_guc_pc *pc)
 	 FIELD_PREP(HOST2GUC_PC_SLPC_REQUEST_MSG_1_EVENT_ARGC, count))
 
 static int wait_for_pc_state(struct xe_guc_pc *pc,
-			     enum slpc_global_state state)
+			     enum slpc_global_state state,
+			     int timeout_ms)
 {
-	int timeout_us = 5000; /* rought 5ms, but no need for precision */
+	int timeout_us = 1000 * timeout_ms;
 	int slept, wait = 10;
 
 	xe_device_assert_mem_access(pc_to_xe(pc));
@@ -165,7 +168,8 @@ static int pc_action_query_task_state(struct xe_guc_pc *pc)
 	};
 	int ret;
 
-	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
+	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
+			      SLPC_RESET_TIMEOUT_MS))
 		return -EAGAIN;
 
 	/* Blocking here to ensure the results are ready before reading them */
@@ -188,7 +192,8 @@ static int pc_action_set_param(struct xe_guc_pc *pc, u8 id, u32 value)
 	};
 	int ret;
 
-	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
+	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
+			      SLPC_RESET_TIMEOUT_MS))
 		return -EAGAIN;
 
 	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
@@ -209,7 +214,8 @@ static int pc_action_unset_param(struct xe_guc_pc *pc, u8 id)
 	struct xe_guc_ct *ct = &pc_to_guc(pc)->ct;
 	int ret;
 
-	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
+	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
+			      SLPC_RESET_TIMEOUT_MS))
 		return -EAGAIN;
 
 	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
@@ -1033,9 +1039,13 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
 	if (ret)
 		goto out;
 
-	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING)) {
-		xe_gt_err(gt, "GuC PC Start failed\n");
-		ret = -EIO;
+	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
+			      SLPC_RESET_TIMEOUT_MS)) {
+		xe_gt_warn(gt, "GuC PC Start taking longer than expected\n");
+		if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000))
+			xe_gt_err(gt, "GuC PC Start failed\n");
+		/* Although GuC PC failed, do not block the usage of GPU */
+		ret = 0;
 		goto out;
 	}
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread
* [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails
@ 2025-02-14 17:25 Rodrigo Vivi
  2025-02-28 16:33 ` Belgaumkar, Vinay
  2025-02-28 19:22 ` John Harrison
  0 siblings, 2 replies; 26+ messages in thread
From: Rodrigo Vivi @ 2025-02-14 17:25 UTC (permalink / raw)
  To: intel-xe; +Cc: Rodrigo Vivi, Vinay Belgaumkar, Jonathan Cavitt

In a rare situation of thermal limit during resume, GuC can
be slow and run into delays like this:

xe 0000:00:02.0: [drm] GT1: excessive init time: 667ms! \
   		 [status = 0x8002F034, timeouts = 0]
xe 0000:00:02.0: [drm] GT1: excessive init time: \
   		 [freq = 100MHz (req = 800MHz), before = 100MHz, \
   		 perf_limit_reasons = 0x1C001000]
xe 0000:00:02.0: [drm] *ERROR* GT1: GuC PC Start failed
------------[ cut here ]------------
xe 0000:00:02.0: [drm] GT1: Failed to start GuC PC: -EIO

If this happens, this can block entirely the GPU to be used.
However, GPU can still be used, although the GT frequencies might be
messed up.

Let's report the error, but not block the flow.
But, instead of just giving up and moving on, let's re-attempt a wait
with a very long second timeout.

v2: Keep the precision comment (Jonathan)
    Use a define for the regular SLPC reset timeout.
v3: Improve messages (Vinay)
    Only skip initialization if the second full-second wait failed.

Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v2
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_pc.c | 46 ++++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
index 02409eedb914..74cc13012532 100644
--- a/drivers/gpu/drm/xe/xe_guc_pc.c
+++ b/drivers/gpu/drm/xe/xe_guc_pc.c
@@ -20,6 +20,7 @@
 #include "xe_gt.h"
 #include "xe_gt_idle.h"
 #include "xe_gt_printk.h"
+#include "xe_gt_throttle.h"
 #include "xe_gt_types.h"
 #include "xe_guc.h"
 #include "xe_guc_ct.h"
@@ -50,6 +51,8 @@
 #define LNL_MERT_FREQ_CAP	800
 #define BMG_MERT_FREQ_CAP	2133
 
+#define SLPC_RESET_TIMEOUT_MS 5 /* rought 5ms, but no need for precision */
+
 /**
  * DOC: GuC Power Conservation (PC)
  *
@@ -114,9 +117,10 @@ static struct iosys_map *pc_to_maps(struct xe_guc_pc *pc)
 	 FIELD_PREP(HOST2GUC_PC_SLPC_REQUEST_MSG_1_EVENT_ARGC, count))
 
 static int wait_for_pc_state(struct xe_guc_pc *pc,
-			     enum slpc_global_state state)
+			     enum slpc_global_state state,
+			     int timeout_ms)
 {
-	int timeout_us = 5000; /* rought 5ms, but no need for precision */
+	int timeout_us = 1000 * timeout_ms;
 	int slept, wait = 10;
 
 	xe_device_assert_mem_access(pc_to_xe(pc));
@@ -165,7 +169,8 @@ static int pc_action_query_task_state(struct xe_guc_pc *pc)
 	};
 	int ret;
 
-	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
+	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
+			      SLPC_RESET_TIMEOUT_MS))
 		return -EAGAIN;
 
 	/* Blocking here to ensure the results are ready before reading them */
@@ -188,7 +193,8 @@ static int pc_action_set_param(struct xe_guc_pc *pc, u8 id, u32 value)
 	};
 	int ret;
 
-	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
+	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
+			      SLPC_RESET_TIMEOUT_MS))
 		return -EAGAIN;
 
 	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
@@ -209,7 +215,8 @@ static int pc_action_unset_param(struct xe_guc_pc *pc, u8 id)
 	struct xe_guc_ct *ct = &pc_to_guc(pc)->ct;
 	int ret;
 
-	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
+	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
+			      SLPC_RESET_TIMEOUT_MS))
 		return -EAGAIN;
 
 	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
@@ -443,6 +450,15 @@ u32 xe_guc_pc_get_act_freq(struct xe_guc_pc *pc)
 	return freq;
 }
 
+static u32 get_cur_freq(struct xe_gt *gt)
+{
+	u32 freq;
+
+	freq = xe_mmio_read32(&gt->mmio, RPNSWREQ);
+	freq = REG_FIELD_GET(REQ_RATIO_MASK, freq);
+	return decode_freq(freq);
+}
+
 /**
  * xe_guc_pc_get_cur_freq - Get Current requested frequency
  * @pc: The GuC PC
@@ -466,10 +482,7 @@ int xe_guc_pc_get_cur_freq(struct xe_guc_pc *pc, u32 *freq)
 		return -ETIMEDOUT;
 	}
 
-	*freq = xe_mmio_read32(&gt->mmio, RPNSWREQ);
-
-	*freq = REG_FIELD_GET(REQ_RATIO_MASK, *freq);
-	*freq = decode_freq(*freq);
+	*freq = get_cur_freq(gt);
 
 	xe_force_wake_put(gt_to_fw(gt), fw_ref);
 	return 0;
@@ -1033,10 +1046,17 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
 	if (ret)
 		goto out;
 
-	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING)) {
-		xe_gt_err(gt, "GuC PC Start failed\n");
-		ret = -EIO;
-		goto out;
+	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
+			      SLPC_RESET_TIMEOUT_MS)) {
+		xe_gt_warn(gt, "GuC PC excessive start time: [freq = %dMHz (req = %dMHz), perf_limit_reasons = 0x%08X]\n",
+			   xe_guc_pc_get_act_freq(pc), get_cur_freq(gt),
+			   xe_gt_throttle_get_limit_reasons(gt));
+		if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000)) {
+			xe_gt_err(gt, "GuC PC Start failed: Dynamic GT frequency control and GT sleep states are now disabled.\n");
+			/* Although GuC PC failed, do not block the usage of GPU */
+			ret = 0;
+			goto out;
+		}
 	}
 
 	ret = pc_init_freqs(pc);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread
* [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails
@ 2025-02-10 21:07 Rodrigo Vivi
  2025-02-10 22:04 ` Cavitt, Jonathan
  0 siblings, 1 reply; 26+ messages in thread
From: Rodrigo Vivi @ 2025-02-10 21:07 UTC (permalink / raw)
  To: intel-xe; +Cc: Rodrigo Vivi, Vinay Belgaumkar

In a rare situation of thermal limit during resume, GuC can
be slow and run into delays like this:

xe 0000:00:02.0: [drm] GT1: excessive init time: 667ms! \
   		 [status = 0x8002F034, timeouts = 0]
xe 0000:00:02.0: [drm] GT1: excessive init time: \
   		 [freq = 100MHz (req = 800MHz), before = 100MHz, \
   		 perf_limit_reasons = 0x1C001000]
xe 0000:00:02.0: [drm] *ERROR* GT1: GuC PC Start failed
------------[ cut here ]------------
xe 0000:00:02.0: [drm] GT1: Failed to start GuC PC: -EIO

If this happens, this can block entirely the GPU to be used.
However, GPU can still be used, although the GT frequencies might be
messed up.

Let's report the error, but not block the flow.
But, instead of just giving up and moving on, let's re-attempt a wait
with a very long second timeout.

Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_pc.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
index 02409eedb914..aa58f9ddbf84 100644
--- a/drivers/gpu/drm/xe/xe_guc_pc.c
+++ b/drivers/gpu/drm/xe/xe_guc_pc.c
@@ -114,9 +114,10 @@ static struct iosys_map *pc_to_maps(struct xe_guc_pc *pc)
 	 FIELD_PREP(HOST2GUC_PC_SLPC_REQUEST_MSG_1_EVENT_ARGC, count))
 
 static int wait_for_pc_state(struct xe_guc_pc *pc,
-			     enum slpc_global_state state)
+			     enum slpc_global_state state,
+			     int timeout_ms)
 {
-	int timeout_us = 5000; /* rought 5ms, but no need for precision */
+	int timeout_us = 1000 * timeout_ms;
 	int slept, wait = 10;
 
 	xe_device_assert_mem_access(pc_to_xe(pc));
@@ -165,7 +166,7 @@ static int pc_action_query_task_state(struct xe_guc_pc *pc)
 	};
 	int ret;
 
-	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
+	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 5))
 		return -EAGAIN;
 
 	/* Blocking here to ensure the results are ready before reading them */
@@ -188,7 +189,7 @@ static int pc_action_set_param(struct xe_guc_pc *pc, u8 id, u32 value)
 	};
 	int ret;
 
-	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
+	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 5))
 		return -EAGAIN;
 
 	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
@@ -209,7 +210,7 @@ static int pc_action_unset_param(struct xe_guc_pc *pc, u8 id)
 	struct xe_guc_ct *ct = &pc_to_guc(pc)->ct;
 	int ret;
 
-	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
+	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 5))
 		return -EAGAIN;
 
 	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
@@ -1033,9 +1034,12 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
 	if (ret)
 		goto out;
 
-	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING)) {
-		xe_gt_err(gt, "GuC PC Start failed\n");
-		ret = -EIO;
+	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 5)) {
+		xe_gt_warn(gt, "GuC PC Start taking longer than expected\n");
+		if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000))
+			xe_gt_err(gt, "GuC PC Start failed\n");
+		/* Although GuC PC failed, do not block the usage of GPU */
+		ret = 0;
 		goto out;
 	}
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-03-06 23:36 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-11 20:09 [PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails Rodrigo Vivi
2025-02-11 20:09 ` [PATCH 2/2] drm/xe/guc_pc: Remove duplicated pc_start call Rodrigo Vivi
2025-02-14  0:31   ` Belgaumkar, Vinay
2025-02-11 20:17 ` ✓ CI.Patch_applied: success for series starting with [1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails Patchwork
2025-02-11 20:18 ` ✓ CI.checkpatch: " Patchwork
2025-02-11 20:19 ` ✓ CI.KUnit: " Patchwork
2025-02-11 20:35 ` ✓ CI.Build: " Patchwork
2025-02-11 20:37 ` ✓ CI.Hooks: " Patchwork
2025-02-11 20:39 ` ✓ CI.checksparse: " Patchwork
2025-02-11 20:59 ` ✓ Xe.CI.BAT: " Patchwork
2025-02-12  1:19 ` [PATCH 1/2] " Belgaumkar, Vinay
2025-02-12 18:15   ` Rodrigo Vivi
2025-02-14  1:37     ` Belgaumkar, Vinay
2025-02-14 15:00       ` Rodrigo Vivi
2025-02-14 17:22         ` Belgaumkar, Vinay
2025-02-12  4:48 ` ✗ Xe.CI.Full: failure for series starting with [1/2] " Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2025-02-14 17:25 [PATCH 1/2] " Rodrigo Vivi
2025-02-28 16:33 ` Belgaumkar, Vinay
2025-02-28 19:22 ` John Harrison
2025-02-28 19:45   ` Rodrigo Vivi
2025-02-28 20:13     ` John Harrison
2025-02-28 20:32       ` Rodrigo Vivi
2025-03-06 23:36         ` Rodrigo Vivi
2025-02-10 21:07 Rodrigo Vivi
2025-02-10 22:04 ` Cavitt, Jonathan
2025-02-11 20:00   ` Rodrigo Vivi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox