Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	Michal Wajdeczko <michal.wajdeczko@intel.com>,
	Tomasz Lis <tomasz.lis@intel.com>
Subject: Re: [PATCH v2 1/2] drm/xe/vf: Introduce RESFIX start marker support
Date: Fri, 31 Oct 2025 13:10:13 -0700	[thread overview]
Message-ID: <aQUXpd0HkcFg2iZ0@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <aPqSDVJ4QR7kzT07@lstrano-desk.jf.intel.com>

On Thu, Oct 23, 2025 at 01:37:33PM -0700, Matthew Brost wrote:
> On Thu, Oct 23, 2025 at 09:06:11PM +0530, Satyanarayana K V P wrote:
> > Post migration, a marker is sent to the GUC prior to the start of resource
> > fixups to indicate start of resource fixups. The same marker is sent along
> > with RESFIX_DONE notification so that GUC can avoid submitting jobs to HW
> > in case of double migration.
> > 
> > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Tomasz Lis <tomasz.lis@intel.com>
> > 
> > ---
> > V1 -> V2:
> > - Squashed "Enable RESFIX start marker only on supported GUC
> > versions" commit into a single commit. (Matt B)
> > ---
> >  .../gpu/drm/xe/abi/guc_actions_sriov_abi.h    | 38 ++++++++
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.c           | 87 +++++++++++++++++--
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h     |  5 ++
> >  drivers/gpu/drm/xe/xe_sriov_vf.c              | 42 ++++++++-
> >  drivers/gpu/drm/xe/xe_sriov_vf_types.h        |  5 ++
> >  5 files changed, 169 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> > index 0b28659d94e9..b9141497bfd5 100644
> > --- a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> > +++ b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> > @@ -656,4 +656,42 @@
> >  #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_LEN		GUC_HXG_RESPONSE_MSG_MIN_LEN
> >  #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_0_USED	GUC_HXG_RESPONSE_MSG_0_DATA0
> >  
> > +/**
> > + * DOC: VF2GUC_NOTIFY_RESFIX_START
> > + *
> > + * This action is used by VF to notify the GuC that the VF KMD will be starting
> > + * post-migration recovery steps.
> > + *
> > + * This message must be sent as `MMIO HXG Message`_.
> > + *
> > + *  +---+-------+--------------------------------------------------------------+
> > + *  |   | Bits  | Description                                                  |
> > + *  +===+=======+==============================================================+
> > + *  | 0 |    31 | ORIGIN = GUC_HXG_ORIGIN_HOST_                                |
> > + *  |   +-------+--------------------------------------------------------------+
> > + *  |   | 30:28 | TYPE = GUC_HXG_TYPE_REQUEST_                                 |
> > + *  |   +-------+--------------------------------------------------------------+
> > + *  |   | 27:16 | DATA0 = MBZ                                                  |
> > + *  |   +-------+--------------------------------------------------------------+
> > + *  |   |  15:0 | ACTION = _`GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START` = 0x550F   |
> > + *  +---+-------+--------------------------------------------------------------+
> > + *
> > + *  +---+-------+--------------------------------------------------------------+
> > + *  |   | Bits  | Description                                                  |
> > + *  +===+=======+==============================================================+
> > + *  | 0 |    31 | ORIGIN = GUC_HXG_ORIGIN_GUC_                                 |
> > + *  |   +-------+--------------------------------------------------------------+
> > + *  |   | 30:28 | TYPE = GUC_HXG_TYPE_RESPONSE_SUCCESS_                        |
> > + *  |   +-------+--------------------------------------------------------------+
> > + *  |   |  27:0 | DATA0 = MBZ                                                  |
> > + *  +---+-------+--------------------------------------------------------------+
> > + */
> > +#define GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START          0x550Fu
> > +
> > +#define VF2GUC_NOTIFY_RESFIX_START_REQUEST_MSG_LEN     GUC_HXG_REQUEST_MSG_MIN_LEN
> > +#define VF2GUC_NOTIFY_RESFIX_START_REQUEST_MSG_0_MBZ   GUC_HXG_REQUEST_MSG_0_DATA0
> > +
> > +#define VF2GUC_NOTIFY_RESFIX_START_RESPONSE_MSG_LEN    GUC_HXG_RESPONSE_MSG_MIN_LEN
> > +#define VF2GUC_NOTIFY_RESFIX_START_RESPONSE_MSG_0_MBZ  GUC_HXG_RESPONSE_MSG_0_DATA0
> > +
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index d0b102ab6ce8..8c1448d6c81d 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -299,12 +299,55 @@ void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
> >  		*found = gt->sriov.vf.guc_version;
> >  }
> >  
> > -static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> > +static int guc_action_vf_notify_resfix_start(struct xe_guc *guc, u16 marker)
> >  {
> >  	u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> >  		FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> >  		FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> > -		FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE),
> > +		FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION,
> > +			   GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START) |
> > +		FIELD_PREP(GUC_HXG_REQUEST_MSG_0_DATA0, marker),
> > +	};
> > +	int ret;
> > +
> > +	ret = xe_guc_mmio_send(guc, request, ARRAY_SIZE(request));
> > +
> > +	return ret > 0 ? -EPROTO : ret;
> > +}
> > +
> > +/**
> > + * vf_notify_resfix_start - Notify GuC about start of resource fixups.
> > + * @gt: the &xe_gt struct instance linked to target GuC
> > + * @marker: marker to identify the migration.
> > + *
> > + * Returns: 0 if the operation completed successfully, or a negative error
> > + * code otherwise.
> > + */
> > +static int vf_notify_resfix_start(struct xe_gt *gt, u16 marker)
> > +{
> > +	struct xe_guc *guc = &gt->uc.guc;
> > +	int err;
> > +
> > +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > +
> > +	err = guc_action_vf_notify_resfix_start(guc, marker);
> > +	if (unlikely(err))
> > +		xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup start (%pe)\n",
> > +				ERR_PTR(err));
> > +	else
> > +		xe_gt_sriov_dbg_verbose(gt, "sent GuC resource fixup start\n");
> > +
> > +	return err;
> > +}
> > +
> > +static int guc_action_vf_notify_resfix_done(struct xe_guc *guc, u16 marker)
> > +{
> > +	u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> > +		FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> > +		FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> > +		FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION,
> > +			   GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE) |
> > +		FIELD_PREP(GUC_HXG_REQUEST_MSG_0_DATA0, marker),
> >  	};
> >  	int ret;
> >  
> > @@ -316,18 +359,19 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> >  /**
> >   * vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
> >   * @gt: the &xe_gt struct instance linked to target GuC
> > + * @marker: marker to identify the migration.
> >   *
> >   * Returns: 0 if the operation completed successfully, or a negative error
> >   * code otherwise.
> >   */
> > -static int vf_notify_resfix_done(struct xe_gt *gt)
> > +static int vf_notify_resfix_done(struct xe_gt *gt, u16 marker)
> >  {
> >  	struct xe_guc *guc = &gt->uc.guc;
> >  	int err;
> >  
> >  	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >  
> > -	err = guc_action_vf_notify_resfix_done(guc);
> > +	err = guc_action_vf_notify_resfix_done(guc, marker);
> >  	if (unlikely(err))
> >  		xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup done (%pe)\n",
> >  				ERR_PTR(err));
> > @@ -1183,7 +1227,7 @@ static void vf_post_migration_abort(struct xe_gt *gt)
> >  	xe_guc_submit_pause_abort(&gt->uc.guc);
> >  }
> >  
> > -static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> > +static int vf_post_migration_notify_resfix_done(struct xe_gt *gt, u16 marker)
> >  {
> >  	bool skip_resfix = false;
> >  
> > @@ -1206,12 +1250,27 @@ static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> >  	 */
> >  	xe_irq_resume(gt_to_xe(gt));
> >  
> > -	return vf_notify_resfix_done(gt);
> > +	return vf_notify_resfix_done(gt, marker);
> > +}
> > +
> > +static bool vf_resfix_start_marker_supported(struct xe_gt *gt)
> > +{
> > +	struct xe_device *xe = gt_to_xe(gt);
> > +
> > +	xe_gt_assert(gt, IS_SRIOV_VF(xe));
> > +	return xe->sriov.vf.migration.resfix_marker_enabled;
> > +}
> > +
> > +static u16 vf_post_migration_resfix_start_marker(struct xe_gt *gt)
> > +{
> > +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > +	return ++gt->sriov.vf.migration.resfix_marker;
> >  }
> >  
> >  static void vf_post_migration_recovery(struct xe_gt *gt)
> >  {
> >  	struct xe_device *xe = gt_to_xe(gt);
> > +	u16 marker = 0;
> >  	int err;
> >  	bool retry;
> >  
> > @@ -1227,13 +1286,27 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
> >  		goto fail;
> >  	}
> >  
> > +	/*
> > +	 * Increment the startup marker again if it overflows, since GUC
> > +	 * requires a non-zero marker to be set.
> > +	 */
> > +	if (vf_resfix_start_marker_supported(gt)) {
> > +		marker = vf_post_migration_resfix_start_marker(gt);
> > +		if (!marker)
> > +			marker = vf_post_migration_resfix_start_marker(gt);
> 
> Nit: Maybe just don't zero in vf_post_migration_resfix_start_marker.
> 
> > +	}
> > +
> > +	err = vf_notify_resfix_start(gt, marker);
> > +	if (err)
> > +		goto fail;
> > +
> >  	err = vf_post_migration_fixups(gt);
> >  	if (err)
> >  		goto fail;
> >  
> >  	vf_post_migration_rearm(gt);
> >  
> > -	err = vf_post_migration_notify_resfix_done(gt);
> > +	err = vf_post_migration_notify_resfix_done(gt, marker);
> >  	if (err && err != -EAGAIN)
> >  		goto fail;
> >  
> 
> Nit: You missed my comment skipping kickstart if err == -EAGAIN but I
> think that harmless, in existing code, and can be done in a follow up.

Actually the above comment is wrong too. We always need to run the
kickstart again to GuC submission backend state resets itself. Sorry for
the noise.

Matt

> 
> Anyways patch looks functionaly correct to me:
> Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> 
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > index 420b0e6089de..ccd850313328 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > @@ -60,6 +60,11 @@ struct xe_gt_sriov_vf_migration {
> >  	bool recovery_inprogress;
> >  	/** @ggtt_need_fixes: VF GGTT needs fixes */
> >  	bool ggtt_need_fixes;
> > +	/**
> > +	 * @resfix_marker: Marker sent to Guc prior to starting the
> > +	 * post‑migration.
> > +	 */
> > +	u16 resfix_marker;
> >  };
> >  
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> > index 39c829daa97c..10d6e43fffce 100644
> > --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> > @@ -55,7 +55,21 @@
> >   * When the VF driver is ready to continue operation on the newly connected
> >   * hardware, it sends `VF2GUC_NOTIFY_RESFIX_DONE` which causes it to
> >   * enter the long awaited `VF_RUNNING` state, and therefore start handling
> > - * CTB messages and scheduling workloads from the VF::
> > + * CTB messages and scheduling workloads from the VF.
> > + *
> > + * In scenarios involving double migration, the VF KMD may encounter situations
> > + * where it is instructed to re-migrate before having the opportunity to send
> > + * RESFIX_DONE for the initial migration. This can occur when the fix-up for the
> > + * prior migration is still underway, but the VF KMD is migrated again.
> > + * Consequently, this may lead to the possibility of sending two migration
> > + * notifications (i.e., pending fix-up for the first migration and a second
> > + * notification for the new migration). Upon receiving the first RES_FIX
> > + * notification, the GuC will resume VF submission on the GPU, potentially
> > + * resulting in undefined behavior, such as system hangs or crashes.
> > + *
> > + * To avoid these hangs, a new VF2GUC action `VF2GUC_NOTIFY_RESFIX_START` is
> > + * sent along with marker and when GUC receives the same marker with
> > + * `VF2GUC_NOTIFY_RESFIX_DONE`action, it starts scheduling work loads from VF::
> >   *
> >   *      PF                             GuC                              VF
> >   *     [ ]                              |                               |
> > @@ -102,6 +116,11 @@
> >   *      |                              [ ]        new VF provisioning  [ ]
> >   *      |                              [ ]---------------------------> [ ]
> >   *      |                               |                              [ ]
> > + *      |                               |   VF2GUC_NOTIFY_RESFIX_START [ ]
> > + *      |                              [ ] <---------------------------[ ]
> > + *      |                              [ ]                             [ ]
> > + *      |                              [ ]                     success [ ]
> > + *      |                              [ ]---------------------------> [ ]
> >   *      |                               |       VF driver applies post [ ]
> >   *      |                               |      migration fixups -------[ ]
> >   *      |                               |                       |      [ ]
> > @@ -169,6 +188,26 @@ static void vf_migration_init_early(struct xe_device *xe)
> >  
> >  }
> >  
> > +static void vf_resfix_start_marker_init_early(struct xe_device *xe)
> > +{
> > +	struct xe_gt *gt = xe_root_mmio_gt(xe);
> > +	struct xe_uc_fw_version guc_version;
> > +
> > +	if (xe->sriov.vf.migration.disabled)
> > +		return;
> > +
> > +	xe_gt_sriov_vf_guc_versions(gt, NULL, &guc_version);
> > +	if (MAKE_GUC_VER_STRUCT(guc_version) < MAKE_GUC_VER(1, 24, 10)) {
> > +		xe_sriov_notice(xe,
> > +				"Resfix start marker requires GUC ABI >= 1.24.10, but only %u.%u.%u found",
> > +				guc_version.major, guc_version.minor, guc_version.patch);
> > +		return;
> > +	}
> > +
> > +	xe->sriov.vf.migration.resfix_marker_enabled = true;
> > +	xe_sriov_dbg(xe, "migrate: Resfix start marker support is enabled\n");
> > +}
> > +
> >  /**
> >   * xe_sriov_vf_init_early - Initialize SR-IOV VF specific data.
> >   * @xe: the &xe_device to initialize
> > @@ -176,6 +215,7 @@ static void vf_migration_init_early(struct xe_device *xe)
> >  void xe_sriov_vf_init_early(struct xe_device *xe)
> >  {
> >  	vf_migration_init_early(xe);
> > +	vf_resfix_start_marker_init_early(xe);
> >  }
> >  
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> > index d5f72d667817..626c11a6dd1b 100644
> > --- a/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> > +++ b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> > @@ -38,6 +38,11 @@ struct xe_device_vf {
> >  		 * was turned off due to missing prerequisites
> >  		 */
> >  		bool disabled;
> > +		/**
> > +		 * @migration.resfix_marker_enabled: flag indicating if resfix marker
> > +		 * support was enabled or not due to missing prerequisites.
> > +		 */
> > +		bool resfix_marker_enabled;
> >  	} migration;
> >  
> >  	/** @ccs: VF CCS state data */
> > -- 
> > 2.51.0
> > 

  parent reply	other threads:[~2025-10-31 20:10 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-23 15:36 [PATCH v2 0/2] VF double migration Satyanarayana K V P
2025-10-23 15:36 ` [PATCH v2 1/2] drm/xe/vf: Introduce RESFIX start marker support Satyanarayana K V P
2025-10-23 20:37   ` Matthew Brost
2025-10-23 20:54     ` Matthew Brost
2025-10-31 20:10     ` Matthew Brost [this message]
2025-10-23 23:33   ` Michal Wajdeczko
2025-10-23 15:36 ` [PATCH v2 2/2] drm/xe/vf: Use fault injection for testing VF double migration feature Satyanarayana K V P
2025-10-24 11:52   ` Michal Wajdeczko
2025-10-23 17:18 ` ✓ CI.KUnit: success for VF double migration (rev2) Patchwork
2025-10-23 18:17 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-10-24  5:23 ` ✗ Xe.CI.Full: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQUXpd0HkcFg2iZ0@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=michal.wajdeczko@intel.com \
    --cc=satyanarayana.k.v.p@intel.com \
    --cc=tomasz.lis@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox