[PATCH AUTOSEL 6.11 28/32] drm/xe: Enlarge the invalidation timeout from 150 to 500

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.11 28/32] drm/xe: Enlarge the invalidation timeout from 150 to 500
       [not found] <20241028105050.3559169-1-sashal@kernel.org>
@ 2024-10-28 10:50 ` Sasha Levin
  2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 29/32] drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout Sasha Levin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: Sasha Levin @ 2024-10-28 10:50 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Shuicheng Lin, Jia Yao, Lucas De Marchi, Matthew Auld, Nirmoy Das,
	Jonathan Cavitt, Zongyao Bai, Sasha Levin, thomas.hellstrom,
	rodrigo.vivi, maarten.lankhorst, mripard, tzimmermann, airlied,
	simona, intel-xe, dri-devel

From: Shuicheng Lin <shuicheng.lin@intel.com>

[ Upstream commit c8fb95e7a54315460b45090f0968167a332e1657 ]

There are error messages like below that are occurring during stress
testing: "[   31.004009] xe 0000:03:00.0: [drm] ERROR GT0: Global
invalidation timeout". Previously it was hitting this 3 out of 1000
executions of warm reboot.  After raising it to 500, 1000 warm reboot
executions passed and it didn't fail.

Due to the way xe_mmio_wait32() is implemented, the timeout is able to
expire early when the register matches the expected value due to the
wait increments starting small. So, the larger timeout value should have
no effect during normal use cases.

v2 (Jonathan):
  - rework the commit message
v3 (Lucas):
  - add conclusive message for the fail rate and test case
v4:
  - add suggested-by

Suggested-by: Jia Yao <jia.yao@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Tested-by: Zongyao Bai <zongyao.bai@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241015161207.1373401-1-shuicheng.lin@intel.com
(cherry picked from commit 2eb460ab9f4bc5b575f52568d17936da0af681d8)
[ Fix conflict with gt->mmio ]
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/gpu/drm/xe/xe_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 8a44a2b6dcbb6..5226333cfdd6d 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -870,7 +870,7 @@ void xe_device_l2_flush(struct xe_device *xe)
 	spin_lock(&gt->global_invl_lock);
 	xe_mmio_write32(gt, XE2_GLOBAL_INVAL, 0x1);
 
-	if (xe_mmio_wait32(gt, XE2_GLOBAL_INVAL, 0x1, 0x0, 150, NULL, true))
+	if (xe_mmio_wait32(gt, XE2_GLOBAL_INVAL, 0x1, 0x0, 500, NULL, true))
 		xe_gt_err_once(gt, "Global invalidation timeout\n");
 	spin_unlock(&gt->global_invl_lock);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH AUTOSEL 6.11 29/32] drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout
       [not found] <20241028105050.3559169-1-sashal@kernel.org>
  2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 28/32] drm/xe: Enlarge the invalidation timeout from 150 to 500 Sasha Levin
@ 2024-10-28 10:50 ` Sasha Levin
  2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 30/32] drm/xe: Handle unreliable MMIO reads during forcewake Sasha Levin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: Sasha Levin @ 2024-10-28 10:50 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Badal Nilawar, Matthew Brost, Matthew Auld, John Harrison,
	Himal Prasad Ghimiray, Lucas De Marchi, Sasha Levin,
	thomas.hellstrom, rodrigo.vivi, maarten.lankhorst, mripard,
	tzimmermann, airlied, simona, intel-xe, dri-devel

From: Badal Nilawar <badal.nilawar@intel.com>

[ Upstream commit 22ef43c78647dd37b0dafe2182b8650b99dbbe59 ]

In case if g2h worker doesn't get opportunity to within specified
timeout delay then flush the g2h worker explicitly.

v2:
  - Describe change in the comment and add TODO (Matt B/John H)
  - Add xe_gt_warn on fence done after G2H flush (John H)
v3:
  - Updated the comment with root cause
  - Clean up xe_gt_warn message (John H)

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2902
Signed-off-by: Badal Nilawar <badal.nilawar@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Acked-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241017111410.2553784-2-badal.nilawar@intel.com
(cherry picked from commit e5152723380404acb8175e0777b1cea57f319a01)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/gpu/drm/xe/xe_guc_ct.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index cd9918e3896c0..ab24053f8766f 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -888,6 +888,24 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
 
 	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
 
+	/*
+	 * Occasionally it is seen that the G2H worker starts running after a delay of more than
+	 * a second even after being queued and activated by the Linux workqueue subsystem. This
+	 * leads to G2H timeout error. The root cause of issue lies with scheduling latency of
+	 * Lunarlake Hybrid CPU. Issue dissappears if we disable Lunarlake atom cores from BIOS
+	 * and this is beyond xe kmd.
+	 *
+	 * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
+	 */
+	if (!ret) {
+		flush_work(&ct->g2h_worker);
+		if (g2h_fence.done) {
+			xe_gt_warn(gt, "G2H fence %u, action %04x, done\n",
+				   g2h_fence.seqno, action[0]);
+			ret = 1;
+		}
+	}
+
 	/*
 	 * Ensure we serialize with completion side to prevent UAF with fence going out of scope on
 	 * the stack, since we have no clue if it will fire after the timeout before we can erase
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH AUTOSEL 6.11 30/32] drm/xe: Handle unreliable MMIO reads during forcewake
       [not found] <20241028105050.3559169-1-sashal@kernel.org>
  2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 28/32] drm/xe: Enlarge the invalidation timeout from 150 to 500 Sasha Levin
  2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 29/32] drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout Sasha Levin
@ 2024-10-28 10:50 ` Sasha Levin
  2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 31/32] drm/xe/ufence: Prefetch ufence addr to catch bogus address Sasha Levin
  2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 32/32] drm/xe: Don't restart parallel queues multiple times on GT reset Sasha Levin
  4 siblings, 0 replies; 5+ messages in thread
From: Sasha Levin @ 2024-10-28 10:50 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Shuicheng Lin, Alex Zuo, Matthew Brost, Michal Wajdeczko,
	Himal Prasad Ghimiray, Badal Nilawar, Anshuman Gupta, Matt Roper,
	Rodrigo Vivi, Lucas De Marchi, Sasha Levin, thomas.hellstrom,
	maarten.lankhorst, mripard, tzimmermann, airlied, simona,
	intel-xe, dri-devel

From: Shuicheng Lin <shuicheng.lin@intel.com>

[ Upstream commit 69418db678567bdf9a4992c83d448da462ffa78c ]

In some cases, when the driver attempts to read an MMIO register,
the hardware may return 0xFFFFFFFF. The current force wake path
code treats this as a valid response, as it only checks the BIT.
However, 0xFFFFFFFF should be considered an invalid value, indicating
a potential issue. To address this, we should add a log entry to
highlight this condition and return failure.
The force wake failure log level is changed from notice to err
to match the failure return value.

v2 (Matt Brost):
  - set ret value (-EIO) to kick the error to upper layers
v3 (Rodrigo):
  - add commit message for the log level promotion from notice to err
v4:
  - update reviewed info

Suggested-by: Alex Zuo <alex.zuo@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Acked-by: Badal Nilawar <badal.nilawar@intel.com>
Cc: Anshuman Gupta <anshuman.gupta@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241017221547.1564029-1-shuicheng.lin@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit a9fbeabe7226a3bf90f82d0e28a02c18e3c67447)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/gpu/drm/xe/xe_force_wake.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_force_wake.c b/drivers/gpu/drm/xe/xe_force_wake.c
index b263fff152737..7d9fc489dcb81 100644
--- a/drivers/gpu/drm/xe/xe_force_wake.c
+++ b/drivers/gpu/drm/xe/xe_force_wake.c
@@ -115,9 +115,15 @@ static int __domain_wait(struct xe_gt *gt, struct xe_force_wake_domain *domain,
 			     XE_FORCE_WAKE_ACK_TIMEOUT_MS * USEC_PER_MSEC,
 			     &value, true);
 	if (ret)
-		xe_gt_notice(gt, "Force wake domain %d failed to ack %s (%pe) reg[%#x] = %#x\n",
-			     domain->id, str_wake_sleep(wake), ERR_PTR(ret),
-			     domain->reg_ack.addr, value);
+		xe_gt_err(gt, "Force wake domain %d failed to ack %s (%pe) reg[%#x] = %#x\n",
+			  domain->id, str_wake_sleep(wake), ERR_PTR(ret),
+			  domain->reg_ack.addr, value);
+	if (value == ~0) {
+		xe_gt_err(gt,
+			  "Force wake domain %d: %s. MMIO unreliable (forcewake register returns 0xFFFFFFFF)!\n",
+			  domain->id, str_wake_sleep(wake));
+		ret = -EIO;
+	}
 
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH AUTOSEL 6.11 31/32] drm/xe/ufence: Prefetch ufence addr to catch bogus address
       [not found] <20241028105050.3559169-1-sashal@kernel.org>
                   ` (2 preceding siblings ...)
  2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 30/32] drm/xe: Handle unreliable MMIO reads during forcewake Sasha Levin
@ 2024-10-28 10:50 ` Sasha Levin
  2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 32/32] drm/xe: Don't restart parallel queues multiple times on GT reset Sasha Levin
  4 siblings, 0 replies; 5+ messages in thread
From: Sasha Levin @ 2024-10-28 10:50 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Nirmoy Das, Francois Dugast, Maarten Lankhorst, Matthew Auld,
	Matthew Brost, Lucas De Marchi, Sasha Levin, thomas.hellstrom,
	rodrigo.vivi, mripard, tzimmermann, airlied, simona, intel-xe,
	dri-devel

From: Nirmoy Das <nirmoy.das@intel.com>

[ Upstream commit 9c1813b3253480b30604c680026c7dc721ce86d1 ]

access_ok() only checks for addr overflow so also try to read the addr
to catch invalid addr sent from userspace.

Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1630
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241016082304.66009-2-nirmoy.das@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
(cherry picked from commit 9408c4508483ffc60811e910a93d6425b8e63928)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/gpu/drm/xe/xe_sync.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_sync.c b/drivers/gpu/drm/xe/xe_sync.c
index de80c8b7c8913..9d77f2d4096f5 100644
--- a/drivers/gpu/drm/xe/xe_sync.c
+++ b/drivers/gpu/drm/xe/xe_sync.c
@@ -54,8 +54,9 @@ static struct xe_user_fence *user_fence_create(struct xe_device *xe, u64 addr,
 {
 	struct xe_user_fence *ufence;
 	u64 __user *ptr = u64_to_user_ptr(addr);
+	u64 __maybe_unused prefetch_val;
 
-	if (!access_ok(ptr, sizeof(*ptr)))
+	if (get_user(prefetch_val, ptr))
 		return ERR_PTR(-EFAULT);
 
 	ufence = kzalloc(sizeof(*ufence), GFP_KERNEL);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH AUTOSEL 6.11 32/32] drm/xe: Don't restart parallel queues multiple times on GT reset
       [not found] <20241028105050.3559169-1-sashal@kernel.org>
                   ` (3 preceding siblings ...)
  2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 31/32] drm/xe/ufence: Prefetch ufence addr to catch bogus address Sasha Levin
@ 2024-10-28 10:50 ` Sasha Levin
  4 siblings, 0 replies; 5+ messages in thread
From: Sasha Levin @ 2024-10-28 10:50 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Nirmoy Das, Jonathan Cavitt, Himal Prasad Ghimiray, Matthew Auld,
	Matthew Brost, Tejas Upadhyay, Lucas De Marchi, Sasha Levin,
	thomas.hellstrom, rodrigo.vivi, maarten.lankhorst, mripard,
	tzimmermann, airlied, simona, intel-xe, dri-devel

From: Nirmoy Das <nirmoy.das@intel.com>

[ Upstream commit cdc21021f0351226a4845715564afd5dc50ed44b ]

In case of parallel submissions multiple GuC id will point to the
same exec queue and on GT reset such exec queues will get restarted
multiple times which is not desirable.

v2: don't use exec_queue_enabled() which could race,
    do the same for xe_guc_submit_stop (Matt B)

Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2295
Cc: Jonathan Cavitt <jonathan.cavitt@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241022103555.731557-1-nirmoy.das@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
(cherry picked from commit c8b0acd6d8745fd7e6450f5acc38f0227bd253b3)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 690f821f8bf5a..8628fbfade5a8 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1762,8 +1762,13 @@ void xe_guc_submit_stop(struct xe_guc *guc)
 
 	mutex_lock(&guc->submission_state.lock);
 
-	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
+	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
+		/* Prevent redundant attempts to stop parallel queues */
+		if (q->guc->id != index)
+			continue;
+
 		guc_exec_queue_stop(guc, q);
+	}
 
 	mutex_unlock(&guc->submission_state.lock);
 
@@ -1801,8 +1806,13 @@ int xe_guc_submit_start(struct xe_guc *guc)
 
 	mutex_lock(&guc->submission_state.lock);
 	atomic_dec(&guc->submission_state.stopped);
-	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
+	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
+		/* Prevent redundant attempts to start parallel queues */
+		if (q->guc->id != index)
+			continue;
+
 		guc_exec_queue_start(q);
+	}
 	mutex_unlock(&guc->submission_state.lock);
 
 	wake_up_all(&guc->ct.wq);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-10-28 10:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20241028105050.3559169-1-sashal@kernel.org>
2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 28/32] drm/xe: Enlarge the invalidation timeout from 150 to 500 Sasha Levin
2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 29/32] drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout Sasha Levin
2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 30/32] drm/xe: Handle unreliable MMIO reads during forcewake Sasha Levin
2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 31/32] drm/xe/ufence: Prefetch ufence addr to catch bogus address Sasha Levin
2024-10-28 10:50 ` [PATCH AUTOSEL 6.11 32/32] drm/xe: Don't restart parallel queues multiple times on GT reset Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox