[PATCH v2 5/5] drm/xe/vf: Use marker to catch fixups during LRC creation

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Tomasz Lis <tomasz.lis@intel.com>
To: intel-xe@lists.freedesktop.org
Cc: "Michał Winiarski" <michal.winiarski@intel.com>,
	"Michał Wajdeczko" <michal.wajdeczko@intel.com>,
	"Piotr Piórkowski" <piotr.piorkowski@intel.com>,
	"Matthew Brost" <matthew.brost@intel.com>
Subject: [PATCH v2 5/5] drm/xe/vf: Use marker to catch fixups during LRC creation
Date: Thu, 19 Feb 2026 00:21:58 +0100	[thread overview]
Message-ID: <20260218232159.1726873-6-tomasz.lis@intel.com> (raw)
In-Reply-To: <20260218232159.1726873-1-tomasz.lis@intel.com>

When LRC is created during fixups, it may have invalid state. Ensure
that all such situations are caught, so that LRC creation can be
repeated.

Due to VM having arbitrarly set amount of CPU cores, it is possible
to limit the amount to 1. In such case, there is a possibility that
kernel will switch CPU contexts in a way which makes previously used
detection methods miss a VF migration recovery running in parallel
(by simply not switching to the LRC creation thread during recovery).

This possibility is not only theoretical, it was revealed by testing
that in a small percentage of specially crafted test cases, the
resulting LRC is damaged and causes GPU hang.

With the additional atomic value increased after fixups, any VF
migration that avoided the usual detection during LRC creation will
be caught.

Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_exec_queue.c        | 6 +++++-
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 7 +++++++
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h       | 1 +
 drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 2 ++
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index 2ebf25a35557..a8d26fece38a 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -308,15 +308,19 @@ static int __xe_exec_queue_init(struct xe_exec_queue *q, u32 exec_queue_flags)
 	 */
 	for (i = 0; i < q->width; ++i) {
 		struct xe_lrc *lrc;
+		int marker;
 
 		xe_gt_sriov_vf_wait_valid_default_lrc(q->gt);
+		marker = xe_vf_migration_fixups_complete_count(q->gt);
+
 		lrc = xe_lrc_create(q->hwe, q->vm, q->replay_state,
 				    xe_lrc_ring_size(), q->msix_vec, flags);
 		if (IS_ERR(lrc)) {
 			err = PTR_ERR(lrc);
 			goto err_lrc;
 		}
-		if (!xe_gt_vf_valid_default_lrc(q->gt)) {
+		if (!xe_gt_vf_valid_default_lrc(q->gt) ||
+		    marker != xe_vf_migration_fixups_complete_count(q->gt)) {
 			xe_lrc_put(lrc);
 			i--;
 			continue;
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index ff9fb9196486..240c53b07eb3 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -1254,6 +1254,11 @@ static size_t post_migration_scratch_size(struct xe_device *xe)
 	return max(xe_lrc_reg_size(xe), LRC_WA_BB_SIZE);
 }
 
+int xe_vf_migration_fixups_complete_count(struct xe_gt *gt)
+{
+	return atomic_read(&gt->sriov.vf.migration.fixups_complete);
+}
+
 static int vf_post_migration_fixups(struct xe_gt *gt)
 {
 	void *buf = gt->sriov.vf.migration.scratch;
@@ -1274,6 +1279,8 @@ static int vf_post_migration_fixups(struct xe_gt *gt)
 	if (err)
 		return err;
 
+	atomic_inc(&gt->sriov.vf.migration.fixups_complete);
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index 8c21b8ab2f16..4651c7f3335c 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -41,5 +41,6 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p);
 
 bool xe_gt_vf_valid_default_lrc(struct xe_gt *gt);
 void xe_gt_sriov_vf_wait_valid_default_lrc(struct xe_gt *gt);
+int xe_vf_migration_fixups_complete_count(struct xe_gt *gt);
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
index 8be181bf3cf3..41d6199e3508 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
@@ -54,6 +54,8 @@ struct xe_gt_sriov_vf_migration {
 	wait_queue_head_t wq;
 	/** @scratch: Scratch memory for VF recovery */
 	void *scratch;
+	/** @fixups_complete: Counts completed fixups stages */
+	atomic_t fixups_complete;
 	/** @debug: Debug hooks for delaying migration */
 	struct {
 		/**
-- 
2.25.1

next prev parent reply	other threads:[~2026-02-18 23:17 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-18 23:21 [PATCH v2 0/5] drm/xe/vf: Fix exec queue creation during post-migration recovery Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 1/5] drm/xe/queue: Call fini on exec queue creation fail Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 2/5] drm/xe/vf: Avoid LRC being freed while applying fixups Tomasz Lis
2026-02-19 19:00   ` Matthew Brost
2026-02-20 15:20     ` Lis, Tomasz
2026-02-20 16:20       ` Matthew Brost
2026-02-18 23:21 ` [PATCH v2 3/5] drm/xe/vf: Wait for default LRCs fixups before using Tomasz Lis
2026-02-19 20:16   ` Matthew Brost
2026-02-19 20:40     ` Matthew Brost
2026-02-20 17:20       ` Lis, Tomasz
2026-02-20 18:20         ` Matthew Brost
2026-02-18 23:21 ` [PATCH v2 4/5] drm/xe/vf: Redo LRC creation while in VF fixups Tomasz Lis
2026-02-18 23:21 ` Tomasz Lis [this message]
2026-02-19 20:33   ` [PATCH v2 5/5] drm/xe/vf: Use marker to catch fixups during LRC creation Matthew Brost
2026-02-20 16:43     ` Lis, Tomasz
2026-02-20 17:41       ` Matthew Brost
2026-02-18 23:34 ` ✓ CI.KUnit: success for drm/xe/vf: Fix exec queue creation during post-migration recovery (rev2) Patchwork
2026-02-19  0:35 ` ✓ Xe.CI.BAT: " Patchwork
2026-02-19  1:49 ` ✗ Xe.CI.FULL: failure " Patchwork

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:2ebf25a3555 dfblob:a8d26fece38 dfblob:ff9fb919648
dfblob:240c53b07eb dfblob:8c21b8ab2f1 dfblob:4651c7f3335
dfblob:8be181bf3cf dfblob:41d6199e350 )
 OR (
bs:"[PATCH v2 5/5] drm/xe/vf: Use marker to catch fixups during LRC creation" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260218232159.1726873-6-tomasz.lis@intel.com \
    --to=tomasz.lis@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=michal.winiarski@intel.com \
    --cc=piotr.piorkowski@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox