Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Wajdeczko <michal.wajdeczko@intel.com>
To: "Lis, Tomasz" <tomasz.lis@intel.com>, <intel-xe@lists.freedesktop.org>
Cc: "Michał Winiarski" <michal.winiarski@intel.com>,
	"Piotr Piórkowski" <piotr.piorkowski@intel.com>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Satyanarayana K V P" <satyanarayana.k.v.p@intel.com>
Subject: Re: [PATCH v3 2/5] drm/xe/vf: Fix GuC FW check for VF migration support
Date: Fri, 17 Oct 2025 17:40:56 +0200	[thread overview]
Message-ID: <c6acc97c-7292-4a84-a4c7-a410350398e6@intel.com> (raw)
In-Reply-To: <a705eb88-0795-41b8-becb-cc41e2479785@intel.com>



On 10/17/2025 5:31 PM, Lis, Tomasz wrote:
> 
> On 10/16/2025 11:55 PM, Michal Wajdeczko wrote:
>> On 10/16/2025 2:05 PM, Tomasz Lis wrote:
>>> The check was done before GuC ABI version could be acquired.
>>> Comparing only to zeros provides very stable results, though
>>> not the ones expected.
>> instead of above sentence, better say that this was triggering:
>>
>> <4> [174.830604] xe 0000:00:02.1: [drm] Assertion `gt->sriov.vf.guc_version.major` failed!
>> ...
> ok.
>>> This change dislodged part of the VF migration support check
>>> and moved it to after GuC handshake.
>> and describe your changes in imperative mood
>>
>> [1] https://docs.kernel.org/process/submitting-patches.html#describe-your-changes
> will do.
>>> v2: Use xe_sriov_vf_ccs_migration_bb_needed()
>> you can keep change log under ---
> ack.
>>> Tested-by: Matthew Brost <matthew.brost@intel.com>
>> I guess above was true for # rev1
> ack.
>>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6349
>>> Fixes: be5590c384f3 ("drm/xe/vf: Enable CCS save/restore only on supported GUC versions")
>>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>>> ---
>>>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 41 +++++++++++++++++++++++++++++
>>>  drivers/gpu/drm/xe/xe_gt_sriov_vf.h |  1 +
>>>  drivers/gpu/drm/xe/xe_guc.c         |  2 ++
>>>  drivers/gpu/drm/xe/xe_sriov_vf.c    | 10 -------
>>>  4 files changed, 44 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>>> index 46518e629ba3..34c68de6e2f3 100644
>>> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>>> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
>>> @@ -314,6 +314,47 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
>>>  	return ret > 0 ? -EPROTO : ret;
>>>  }
>>>  
>>> +static void vf_disable_migration(struct xe_gt *gt, const char *fmt, ...)
>>> +{
>>> +	struct xe_device *xe = gt_to_xe(gt);
>>> +	struct va_format vaf;
>>> +	va_list va_args;
>>> +
>>> +	xe_gt_assert(gt, IS_SRIOV_VF(xe));
>>> +
>>> +	va_start(va_args, fmt);
>>> +	vaf.fmt = fmt;
>>> +	vaf.va  = &va_args;
>>> +	xe_gt_sriov_notice(gt, "migration disabled: %pV\n", &vaf);
>>> +	va_end(va_args);
>>> +
>>> +	xe->sriov.vf.migration.enabled = false;
>> this looks like a layer violation
>>
>> and we already have a function that wraps that at the device level
>>
>> maybe just promote device-level vf_disable_migration(xe,...) from xe_sriov_vf.c
>> and call it from this gt-level place ?
>>
>> hmm, but see below [2]
>>
>>> +}
>>> +
>>> +/**
>>> + * xe_gt_sriov_vf_guc_check_migration_support - Check for disable migration due to GuC.
>>> + * @gt: the &xe_gt struct instance linked to target GuC
>>> + *
>>> + * Performs late disable of VF migration feature in case GuC FW cannot support it.
>>> + */
>>> +void xe_gt_sriov_vf_guc_check_migration_support(struct xe_gt *gt)
>>> +{
>>> +	struct xe_device *xe = gt_to_xe(gt);
>>> +
>>> +	if (!xe_sriov_vf_migration_supported(xe))
>>> +		return;
>>> +
>>> +	if (xe_sriov_vf_ccs_migration_bb_needed(xe)) {
>>> +		struct xe_uc_fw_version guc_version;
>>> +
>>> +		xe_gt_sriov_vf_guc_versions(gt, NULL, &guc_version);
>>> +		if (MAKE_GUC_VER_STRUCT(guc_version) < MAKE_GUC_VER(1, 23, 0))
>>> +			return vf_disable_migration(gt,
>>> +				"CCS migration requires GuC ABI >= 1.23 but only %u.%u found",
>>> +				guc_version.major, guc_version.minor);
>> since we split migration checks from one place,
>> this CCS GuC ABI condition shall be placed in sriov_vf_ccs.c subcomponent
> will move.
>>> +	}
>>> +}
>>> +
>>>  /**
>>>   * vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
>>>   * @gt: the &xe_gt struct instance linked to target GuC
>>> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
>>> index af40276790fa..60a3b9b05b20 100644
>>> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
>>> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
>>> @@ -26,6 +26,7 @@ void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
>>>  int xe_gt_sriov_vf_init_early(struct xe_gt *gt);
>>>  int xe_gt_sriov_vf_init(struct xe_gt *gt);
>>>  bool xe_gt_sriov_vf_recovery_pending(struct xe_gt *gt);
>>> +void xe_gt_sriov_vf_guc_check_migration_support(struct xe_gt *gt);
>>>  
>>>  u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
>>>  u16 xe_gt_sriov_vf_guc_ids(struct xe_gt *gt);
>>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>>> index d94490979adc..3c4e64233b3a 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>>> @@ -713,6 +713,8 @@ static int vf_guc_init_noalloc(struct xe_guc *guc)
>>>  	if (err)
>>>  		return err;
>>>  
>>> +	xe_gt_sriov_vf_guc_check_migration_support(gt);
>>> +
>> [2] so this is now going through these layers:
>>
>> guc_vf    vf_guc_init_noalloc
>> gt_vf       xe_gt_sriov_vf_guc_check_migration_support
>> xe_vf         xe_sriov_vf_migration_supported
>> xe_vf_ccs     xe_sriov_vf_ccs_migration_bb_needed
>> gt_vf         xe_gt_sriov_vf_guc_versions
>> gt_vf         vf_disable_migration
>> xe_vf           xe->sriov.vf.migration.enabled
>>
>> so maybe better leave this VF GuC init as-is and just make "late" checks
>> in xe_device_probe either in xe_sriov_init_late
>>
>> xe        xe_sriov_init_late
>> xe_vf       xe_sriov_vf_init_late
>> xe_vf         xe_sriov_vf_migration_supported
>> xe_vf_ccs     xe_sriov_vf_ccs_init_late
>> xe_vf_ccs     xe_sriov_vf_ccs_migration_bb_needed
>> gt_vf           xe_gt_sriov_vf_guc_versions
>> xe_vf         vf_disable_migration
>>
>> or after for_each_gt/xe_gt_init_early loop
>>
>> xe        xe_device_probe
>> xe_vf       xe_sriov_vf_check_migration
>> xe_vf         xe_sriov_vf_migration_supported
>> xe_vf_ccs     xe_sriov_vf_ccs_init_late
>> xe_vf_ccs     xe_sriov_vf_ccs_migration_bb_needed
>> gt_vf           xe_gt_sriov_vf_guc_versions
>> xe_vf         vf_disable_migration
>>
>> or just make it as part of the xe_sriov_vf_ccs_init()
>> since before that point CCS migration is not working either
> 
> The check can be done later than where I put it; but it needs to be before IRQs are enabled. Both kinds of these are enabled in `xe_gt_init()`:
> 
> MEMIRQs in gt_init_with_gt_forcewake->xe_uc_init->xe_guc_enable_communication
> 
> MMIO IRQs in gt_init_with_gt_forcewake->xe_irq_enable_hwe
> 
> 
> The `xe_sriov_vf_ccs_init` is currently called after `xe_gt_init`. I'm not completely sure if this is correct placement .. it might be, if there are no CCS metadata set at that point.
> 
> GuC should have no problem assisting to migration without CCS metadata transfer, so that placement could be ok.
> 
> But that's definitely too late for figuring out whether we support migration at all.
> 
> So this eliminates the `xe_sriov_vf_init_late` and `xe_sriov_vf_ccs_init` options.
> 
> For the 2nd option - this can be done. But does it really make sense to put a single-platform workaround check directly in `xe_device_probe`?
> 
> I can do this - if you consider this option acceptable. Though personally I see no reason for trying to rip it out of `xe_gt_init_early`. It can be easily turned from per-gt check to single GT check (by checking only the primary GuC which will actually be responsible for scheduling the CCS save/restore BB execution), but that gives an option rather than a reason.
> 
> Any place between `xe_gt_init_early()` and `xe_gt_init()` is ok for me.
> 

on second thought, shouldn't we make GuC ABI check on the PF instead ?
then on the VF we could assume that minimal GuC ABI version is guaranteed 
and just have simple assert near the H2G that we expect to be available ?

> 
>>>  	err = xe_gt_sriov_vf_query_config(gt);
>>>  	if (err)
>>>  		return err;
>>> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
>>> index 911d5720917b..5fb042c05112 100644
>>> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
>>> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
>>> @@ -163,16 +163,6 @@ static void vf_migration_init_early(struct xe_device *xe)
>>>  		return vf_disable_migration(xe, "requires gfx version >= 20, but only %u found",
>>>  					    GRAPHICS_VER(xe));
>>>  
>>> -	if (!IS_DGFX(xe)) {
>>> -		struct xe_uc_fw_version guc_version;
>>> -
>>> -		xe_gt_sriov_vf_guc_versions(xe_device_get_gt(xe, 0), NULL, &guc_version);
>>> -		if (MAKE_GUC_VER_STRUCT(guc_version) < MAKE_GUC_VER(1, 23, 0))
>>> -			return vf_disable_migration(xe,
>>> -						    "CCS migration requires GuC ABI >= 1.23 but only %u.%u found",
>>> -						    guc_version.major, guc_version.minor);
>>> -	}
>>> -
>>>  	xe->sriov.vf.migration.enabled = true;
>>>  	xe_sriov_dbg(xe, "migration support enabled\n");
>> this would be non-reliable, as we might still disable migration later on
>>
>> so we should either remove it completely (assuming its "enabled" until explicitly disabled)
>> or reverse the logic and use this flag instead:
>>
>> 	xe->sriov.vf.migration.disabled
> 
> will remove, it doesn't make sense here.
> 
> -Tomasz
> 
>>>  }


  reply	other threads:[~2025-10-17 15:41 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-16 12:05 [PATCH v3 0/5] drm/xe/vf: Minor fixes to post-migration recovery Tomasz Lis
2025-10-16 12:05 ` [PATCH v3 1/5] drm/xe/vf: Helper for telling whether CCS migration BBs are needed Tomasz Lis
2025-10-16 19:57   ` Michal Wajdeczko
2025-10-16 21:51     ` Lis, Tomasz
2025-10-16 12:05 ` [PATCH v3 2/5] drm/xe/vf: Fix GuC FW check for VF migration support Tomasz Lis
2025-10-16 21:55   ` Michal Wajdeczko
2025-10-17 15:31     ` Lis, Tomasz
2025-10-17 15:40       ` Michal Wajdeczko [this message]
2025-10-16 12:05 ` [PATCH v3 3/5] drm/xe/vf: Skip fixups on VF migration before getting GGTT info Tomasz Lis
2025-10-16 22:20   ` Michal Wajdeczko
2025-10-20 19:30     ` Lis, Tomasz
2025-10-16 12:05 ` [PATCH v3 4/5] drm/xe: Assert that VF will never use fixed placement of BOs Tomasz Lis
2025-10-16 12:05 ` [PATCH v3 5/5] drm/xe/vf: Do not disable VF migration on ATS-M Tomasz Lis
2025-10-16 22:25   ` Michal Wajdeczko
2025-10-20 19:22     ` Lis, Tomasz
2025-10-16 12:11 ` ✓ CI.KUnit: success for drm/xe/vf: Minor fixes to post-migration recovery (rev3) Patchwork
2025-10-16 12:57 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-17  8:23 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c6acc97c-7292-4a84-a4c7-a410350398e6@intel.com \
    --to=michal.wajdeczko@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=michal.winiarski@intel.com \
    --cc=piotr.piorkowski@intel.com \
    --cc=satyanarayana.k.v.p@intel.com \
    --cc=tomasz.lis@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox