Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>,
	John Harrison <John.C.Harrison@Intel.com>,
	Matthew Brost <matthew.brost@intel.com>,
	Stuart Summers <stuart.summers@intel.com>,
	Julia Filipchuk <julia.filipchuk@intel.com>,
	Sasha Levin <sashal@kernel.org>,
	lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com,
	rodrigo.vivi@intel.com, intel-xe@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Set upper limit of H2G retries over CTB
Date: Sat, 25 Oct 2025 11:57:21 -0400	[thread overview]
Message-ID: <20251025160905.3857885-210-sashal@kernel.org> (raw)
In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org>

From: Michal Wajdeczko <michal.wajdeczko@intel.com>

[ Upstream commit 2506af5f8109a387a5e8e9e3d7c498480b8033db ]

The GuC communication protocol allows GuC to send NO_RESPONSE_RETRY
reply message to indicate that due to some interim condition it can
not handle incoming H2G request and the host shall resend it.

But in some cases, due to errors, this unsatisfied condition might
be final and this could lead to endless retries as it was recently
seen on the CI:

 [drm] GT0: PF: VF1 FLR didn't finish in 5000 ms (-ETIMEDOUT)
 [drm] GT0: PF: VF1 resource sanitizing failed (-ETIMEDOUT)
 [drm] GT0: PF: VF1 FLR failed!
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0

To avoid such dangerous loops allow only limited number of retries
(for now 50) and add some delays (n * 5ms) to slow down the rate of
resending this repeated request.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://lore.kernel.org/r/20250903223330.6408-1-michal.wajdeczko@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES

**Why Backport**
- Prevents infinite retry loops on GuC “NO_RESPONSE_RETRY” replies that
  can occur when the underlying condition never clears (e.g., VF FLR
  stuck), which was observed in CI. This is a real-world hang/DoS bug
  impacting system stability and recoverability.
- Change is small, self-contained to the Xe GuC CT H2G/G2H send/recv
  path, and does not alter UAPI or broader architecture.
- Returns a clear error after bounded retries instead of looping
  forever; adds linear backoff to reduce busy looping.

**What Changes**
- Introduces bounded retries and backoff for GuC “retry” responses
  within the blocking send/receive helper:
  - Adds retry budget and delay constants: `GUC_SEND_RETRY_LIMIT` = 50
    and `GUC_SEND_RETRY_MSLEEP` = 5 ms
    (drivers/gpu/drm/xe/xe_guc_ct.c:1081,
    drivers/gpu/drm/xe/xe_guc_ct.c:1082).
  - Tracks the number of retries in `guc_ct_send_recv()` and applies
    increasing sleep before re-sending
    (drivers/gpu/drm/xe/xe_guc_ct.c:1084).
  - On each retry indication from GuC (`g2h_fence.retry`), after
    unlocking the mutex, either:
    - Sleep for n*5ms and retry; or
    - If the retry count exceeds the limit, log an error and return
      `-ELOOP` (drivers/gpu/drm/xe/xe_guc_ct.c:1151,
      drivers/gpu/drm/xe/xe_guc_ct.c:1154,
      drivers/gpu/drm/xe/xe_guc_ct.c:1156,
      drivers/gpu/drm/xe/xe_guc_ct.c:1159).

**Key Code References**
- Retry limit and delay constants:
  - drivers/gpu/drm/xe/xe_guc_ct.c:1081
  - drivers/gpu/drm/xe/xe_guc_ct.c:1082
- Core change in `guc_ct_send_recv()` (retry handling/backoff/limit):
  - Function start: drivers/gpu/drm/xe/xe_guc_ct.c:1084
  - Retry debug log: drivers/gpu/drm/xe/xe_guc_ct.c:1151
  - Limit check and `-ELOOP`: drivers/gpu/drm/xe/xe_guc_ct.c:1154
  - Error log on limit reached: drivers/gpu/drm/xe/xe_guc_ct.c:1155
  - Backoff sleep: drivers/gpu/drm/xe/xe_guc_ct.c:1159

**Safety and Regression Risk**
- Concurrency correctness: The code explicitly unlocks `ct->lock` before
  sleeping, avoiding sleeping under a mutex
  (drivers/gpu/drm/xe/xe_guc_ct.c:1151).
- Blocking contract preserved: This helper is the blocking path; sleep
  is expected. The G2H-handler special path uses
  `xe_guc_ct_send_g2h_handler()` and does not call the blocking
  `send_recv()` (drivers/gpu/drm/xe/xe_guc_ct.h:63).
- Error propagation consistent: Callers already treat negative returns
  as failures and log/abort appropriately. For example:
  - `xe_guc_ct_send_block()` is a thin wrapper over
    `xe_guc_ct_send_recv()` (drivers/gpu/drm/xe/xe_guc_ct.h:57), and
    many users propagate errors directly (e.g.,
    drivers/gpu/drm/xe/xe_guc.c:303).
  - Relay path logs negative errors via `ERR_PTR(ret)` and returns
    failure (drivers/gpu/drm/xe/xe_guc_relay.c:298).
- Scope limited to Xe driver’s GuC CT path; no cross-subsystem impact,
  no API/ABI changes.
- The new `-ELOOP` code is a standard error value; replacing an
  unbounded loop with a bounded error is safer and more diagnosable. The
  linear backoff caps total added sleep to roughly 6.375 seconds in the
  worst case, which is acceptable for a blocking control path and
  reduces log spam/CPU waste.

**Stable Criteria Assessment**
- Fixes an important bug that can hang the driver and spam logs
  indefinitely (user-visible stability issue).
- Small, localized change with clear intent and minimal risk.
- No architectural changes or new features.
- Aligns with stable rules: a defensive fix that prevents system-harming
  behavior.

Given the above, this is a strong candidate for backporting to stable
trees that ship the Xe driver and GuC CT infrastructure.

 drivers/gpu/drm/xe/xe_guc_ct.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 6d70dd1c106d4..ff622628d823f 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -1079,11 +1079,15 @@ static bool retry_failure(struct xe_guc_ct *ct, int ret)
 	return true;
 }
 
+#define GUC_SEND_RETRY_LIMIT	50
+#define GUC_SEND_RETRY_MSLEEP	5
+
 static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
 			    u32 *response_buffer, bool no_fail)
 {
 	struct xe_gt *gt = ct_to_gt(ct);
 	struct g2h_fence g2h_fence;
+	unsigned int retries = 0;
 	int ret = 0;
 
 	/*
@@ -1148,6 +1152,12 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
 		xe_gt_dbg(gt, "H2G action %#x retrying: reason %#x\n",
 			  action[0], g2h_fence.reason);
 		mutex_unlock(&ct->lock);
+		if (++retries > GUC_SEND_RETRY_LIMIT) {
+			xe_gt_err(gt, "H2G action %#x reached retry limit=%u, aborting\n",
+				  action[0], GUC_SEND_RETRY_LIMIT);
+			return -ELOOP;
+		}
+		msleep(GUC_SEND_RETRY_MSLEEP * retries);
 		goto retry;
 	}
 	if (g2h_fence.fail) {
-- 
2.51.0


  parent reply	other threads:[~2025-10-25 16:18 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20251025160905.3857885-1-sashal@kernel.org>
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/pcode: Initialize data0 for pcode read routine Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: improve dma-resv handling for backup object Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: Extend wa_13012615864 to additional Xe2 and Xe3 platforms Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/ptl: Apply Wa_16026007364 Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe: Set GT as wedged before sending wedged uevent Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/i2c: Enable bus mastering Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/configfs: Enforce canonical device names Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Extend Wa_22021007897 to Xe3 platforms Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Cancel pending TLB inval workers on teardown Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Increase GuC crash dump buffer size Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/wcl: Extend L3bank mask workaround Sasha Levin
2025-10-25 15:57 ` Sasha Levin [this message]
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe: Make page size consistent in loop Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Add devm release action to safely tear down CT Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Program LMTT directory pointer on all GTs within a tile Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Always add CT disable action during second init step Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Don't resume device from restart worker Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Return an error code if the GuC load fails Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: Ensure GT is in C0 during resumes Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: rework PDE PAT index selection Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Add more GuC load error status codes Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe: Fix oops in xe_gem_fault when running core_hotunplug test Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251025160905.3857885-210-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=John.C.Harrison@Intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=julia.filipchuk@intel.com \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=patches@lists.linux.dev \
    --cc=rodrigo.vivi@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=stuart.summers@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox