On 10/16/2025 11:55 PM, Michal Wajdeczko wrote:

On 10/16/2025 2:05 PM, Tomasz Lis wrote:
The check was done before GuC ABI version could be acquired.
Comparing only to zeros provides very stable results, though
not the ones expected.
instead of above sentence, better say that this was triggering:

<4> [174.830604] xe 0000:00:02.1: [drm] Assertion `gt->sriov.vf.guc_version.major` failed!
...
ok.


This change dislodged part of the VF migration support check
and moved it to after GuC handshake.
and describe your changes in imperative mood

[1] https://docs.kernel.org/process/submitting-patches.html#describe-your-changes
will do.

v2: Use xe_sriov_vf_ccs_migration_bb_needed()
you can keep change log under ---
ack.

Tested-by: Matthew Brost <matthew.brost@intel.com>
I guess above was true for # rev1
ack.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6349
Fixes: be5590c384f3 ("drm/xe/vf: Enable CCS save/restore only on supported GUC versions")
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 41 +++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_gt_sriov_vf.h |  1 +
 drivers/gpu/drm/xe/xe_guc.c         |  2 ++
 drivers/gpu/drm/xe/xe_sriov_vf.c    | 10 -------
 4 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
index 46518e629ba3..34c68de6e2f3 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
@@ -314,6 +314,47 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
 	return ret > 0 ? -EPROTO : ret;
 }
 
+static void vf_disable_migration(struct xe_gt *gt, const char *fmt, ...)
+{
+	struct xe_device *xe = gt_to_xe(gt);
+	struct va_format vaf;
+	va_list va_args;
+
+	xe_gt_assert(gt, IS_SRIOV_VF(xe));
+
+	va_start(va_args, fmt);
+	vaf.fmt = fmt;
+	vaf.va  = &va_args;
+	xe_gt_sriov_notice(gt, "migration disabled: %pV\n", &vaf);
+	va_end(va_args);
+
+	xe->sriov.vf.migration.enabled = false;
this looks like a layer violation

and we already have a function that wraps that at the device level

maybe just promote device-level vf_disable_migration(xe,...) from xe_sriov_vf.c
and call it from this gt-level place ?

hmm, but see below [2]

+}
+
+/**
+ * xe_gt_sriov_vf_guc_check_migration_support - Check for disable migration due to GuC.
+ * @gt: the &xe_gt struct instance linked to target GuC
+ *
+ * Performs late disable of VF migration feature in case GuC FW cannot support it.
+ */
+void xe_gt_sriov_vf_guc_check_migration_support(struct xe_gt *gt)
+{
+	struct xe_device *xe = gt_to_xe(gt);
+
+	if (!xe_sriov_vf_migration_supported(xe))
+		return;
+
+	if (xe_sriov_vf_ccs_migration_bb_needed(xe)) {
+		struct xe_uc_fw_version guc_version;
+
+		xe_gt_sriov_vf_guc_versions(gt, NULL, &guc_version);
+		if (MAKE_GUC_VER_STRUCT(guc_version) < MAKE_GUC_VER(1, 23, 0))
+			return vf_disable_migration(gt,
+				"CCS migration requires GuC ABI >= 1.23 but only %u.%u found",
+				guc_version.major, guc_version.minor);
since we split migration checks from one place,
this CCS GuC ABI condition shall be placed in sriov_vf_ccs.c subcomponent
will move.

+	}
+}
+
 /**
  * vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
  * @gt: the &xe_gt struct instance linked to target GuC
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
index af40276790fa..60a3b9b05b20 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
@@ -26,6 +26,7 @@ void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
 int xe_gt_sriov_vf_init_early(struct xe_gt *gt);
 int xe_gt_sriov_vf_init(struct xe_gt *gt);
 bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt);
+void xe_gt_sriov_vf_guc_check_migration_support(struct xe_gt *gt);
 
 u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
 u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index d94490979adc..3c4e64233b3a 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -713,6 +713,8 @@ static int vf_guc_init_noalloc(struct xe_guc *guc)
 	if (err)
 		return err;
 
+	xe_gt_sriov_vf_guc_check_migration_support(gt);
+
[2] so this is now going through these layers:

guc_vf    vf_guc_init_noalloc
gt_vf       xe_gt_sriov_vf_guc_check_migration_support
xe_vf         xe_sriov_vf_migration_supported
xe_vf_ccs     xe_sriov_vf_ccs_migration_bb_needed
gt_vf         xe_gt_sriov_vf_guc_versions
gt_vf         vf_disable_migration
xe_vf           xe->sriov.vf.migration.enabled

so maybe better leave this VF GuC init as-is and just make "late" checks
in xe_device_probe either in xe_sriov_init_late

xe        xe_sriov_init_late
xe_vf       xe_sriov_vf_init_late
xe_vf         xe_sriov_vf_migration_supported
xe_vf_ccs     xe_sriov_vf_ccs_init_late
xe_vf_ccs     xe_sriov_vf_ccs_migration_bb_needed
gt_vf           xe_gt_sriov_vf_guc_versions
xe_vf         vf_disable_migration

or after for_each_gt/xe_gt_init_early loop

xe        xe_device_probe
xe_vf       xe_sriov_vf_check_migration
xe_vf         xe_sriov_vf_migration_supported
xe_vf_ccs     xe_sriov_vf_ccs_init_late
xe_vf_ccs     xe_sriov_vf_ccs_migration_bb_needed
gt_vf           xe_gt_sriov_vf_guc_versions
xe_vf         vf_disable_migration

or just make it as part of the xe_sriov_vf_ccs_init()
since before that point CCS migration is not working either

The check can be done later than where I put it; but it needs to be before IRQs are enabled. Both kinds of these are enabled in `xe_gt_init()`:

MEMIRQs in gt_init_with_gt_forcewake->xe_uc_init->xe_guc_enable_communication

MMIO IRQs in gt_init_with_gt_forcewake->xe_irq_enable_hwe


The `xe_sriov_vf_ccs_init` is currently called after `xe_gt_init`. I'm not completely sure if this is correct placement .. it might be, if there are no CCS metadata set at that point.

GuC should have no problem assisting to migration without CCS metadata transfer, so that placement could be ok.

But that's definitely too late for figuring out whether we support migration at all.

So this eliminates the `xe_sriov_vf_init_late` and `xe_sriov_vf_ccs_init` options.

For the 2nd option - this can be done. But does it really make sense to put a single-platform workaround check directly in `xe_device_probe`?

I can do this - if you consider this option acceptable. Though personally I see no reason for trying to rip it out of `xe_gt_init_early`. It can be easily turned from per-gt check to single GT check (by checking only the primary GuC which will actually be responsible for scheduling the CCS save/restore BB execution), but that gives an option rather than a reason.

Any place between `xe_gt_init_early()` and `xe_gt_init()` is ok for me.



 	err = xe_gt_sriov_vf_query_config(gt);
 	if (err)
 		return err;
diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
index 911d5720917b..5fb042c05112 100644
--- a/drivers/gpu/drm/xe/xe_sriov_vf.c
+++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
@@ -163,16 +163,6 @@ static void vf_migration_init_early(struct xe_device *xe)
 		return vf_disable_migration(xe, "requires gfx version >= 20, but only %u found",
 					    GRAPHICS_VER(xe));
 
-	if (!IS_DGFX(xe)) {
-		struct xe_uc_fw_version guc_version;
-
-		xe_gt_sriov_vf_guc_versions(xe_device_get_gt(xe, 0), NULL, &guc_version);
-		if (MAKE_GUC_VER_STRUCT(guc_version) < MAKE_GUC_VER(1, 23, 0))
-			return vf_disable_migration(xe,
-						    "CCS migration requires GuC ABI >= 1.23 but only %u.%u found",
-						    guc_version.major, guc_version.minor);
-	}
-
 	xe->sriov.vf.migration.enabled = true;
 	xe_sriov_dbg(xe, "migration support enabled\n");
this would be non-reliable, as we might still disable migration later on

so we should either remove it completely (assuming its "enabled" until explicitly disabled)
or reverse the logic and use this flag instead:

	xe->sriov.vf.migration.disabled

will remove, it doesn't make sense here.

-Tomasz


 }