Re: [PATCH v2 12/34] drm/xe/vf: Make VF recovery run on per-GT worker

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v2 12/34] drm/xe/vf: Make VF recovery run on per-GT worker
Date: Wed, 24 Sep 2025 12:50:50 -0700	[thread overview]
Message-ID: <aNRLmmK5yqXa3xNb@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <55c5e870-6823-4fb3-80ab-0e6914d054d2@intel.com>

On Wed, Sep 24, 2025 at 12:49:25PM +0200, Michal Wajdeczko wrote:
> 
> 
> On 9/24/2025 3:15 AM, Matthew Brost wrote:
> > VF recovery is a per-GT operation, so it makes sense to isolate it to a
> 
> that was also my suggestion to make it per-GT, good to see it happen now
> 

+1

> > per-GT queue. Scheduling this operation on the same worker as the GT
> > reset and TDR not only aligns with this design but also helps avoid race
> > conditions, as those operations can also modify the queue state.
> 
> but while the recovery is per-GT, we should still protect against that
> one GT starts the recovery sooner then other GTs notice about it
> 

Yes. There is shared state in 2 places:

 - The GGTT shifting, this handled by [1] in my series.
 - CCS restore on iGPU (PTL) handled by [2] [3] in my series.

[1] https://patchwork.freedesktop.org/patch/676394/?series=154627&rev=2
[2] https://patchwork.freedesktop.org/patch/676393/?series=154627&rev=2
[3] https://patchwork.freedesktop.org/patch/676397/?series=154627&rev=2

> > 
> > v2:
> >  - Fix lockdep splat (Adam)
> >  - Use xe_sriov_vf_migration_supported helper
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.c       | 170 ++++++++++++++-
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.h       |   3 +-
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h |   7 +
> >  drivers/gpu/drm/xe/xe_sriov_vf.c          | 242 +---------------------
> >  drivers/gpu/drm/xe/xe_sriov_vf.h          |   1 -
> >  drivers/gpu/drm/xe/xe_sriov_vf_types.h    |   4 -
> >  6 files changed, 169 insertions(+), 258 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index c9d0e32e7a15..cfb71b749e52 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -25,11 +25,15 @@
> >  #include "xe_guc.h"
> >  #include "xe_guc_hxg_helpers.h"
> >  #include "xe_guc_relay.h"
> > +#include "xe_guc_submit.h"
> > +#include "xe_irq.h"
> >  #include "xe_lrc.h"
> >  #include "xe_memirq.h"
> >  #include "xe_mmio.h"
> > +#include "xe_pm.h"
> >  #include "xe_sriov.h"
> >  #include "xe_sriov_vf.h"
> > +#include "xe_tile_sriov_vf.h"
> >  #include "xe_uc_fw.h"
> >  #include "xe_wopcm.h"
> >  
> > @@ -314,7 +318,7 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> >   * Returns: 0 if the operation completed successfully, or a negative error
> >   * code otherwise.
> >   */
> > -int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt)
> > +static int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt)
> >  {
> >  	struct xe_guc *guc = &gt->uc.guc;
> >  	int err;
> > @@ -808,7 +812,7 @@ int xe_gt_sriov_vf_connect(struct xe_gt *gt)
> >   * xe_gt_sriov_vf_default_lrcs_hwsp_rebase - Update GGTT references in HWSP of default LRCs.
> >   * @gt: the &xe_gt struct instance
> >   */
> > -void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
> > +static void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
> >  {
> >  	struct xe_hw_engine *hwe;
> >  	enum xe_hw_engine_id id;
> > @@ -817,6 +821,26 @@ void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt)
> >  		xe_default_lrc_update_memirq_regs_with_address(hwe);
> >  }
> >  
> > +static void xe_gt_sriov_vf_start_migration_recovery(struct xe_gt *gt)
> 
> nit: if this is static then could be just:
> 
> 	vf_start_migration_recovery(gt)
> 

Sure.

> > +{
> > +	bool started;
> > +
> > +	xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > +
> > +	spin_lock(&gt->sriov.vf.migration.lock);
> > +
> > +	if (!gt->sriov.vf.migration.recovery_queued) {
> > +		gt->sriov.vf.migration.recovery_queued = true;
> > +		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true);
> > +
> > +		started = queue_work(gt->ordered_wq, &gt->sriov.vf.migration.worker);
> > +		xe_gt_sriov_info(gt, "VF migration recovery %s\n", started ?
> > +				 "scheduled" : "already in progress");
> 
> with this .recovery_queued flag, can we hit "already in progress" case?
> 

This was existing code so kept it but I'm really unclear how we can get
multiple resfix IRQs without the prior resfix flow being completed.
Maybe Tomasz can clear this one up? Ideally I'd like to remove multiple
resfix IRQ handled code if this is not possible.

> > +	}
> > +
> > +	spin_unlock(&gt->sriov.vf.migration.lock);
> > +}
> > +
> >  /**
> >   * xe_gt_sriov_vf_migrated_event_handler - Start a VF migration recovery,
> >   *   or just mark that a GuC is ready for it.
> > @@ -831,15 +855,8 @@ void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt)
> >  	xe_gt_assert(gt, IS_SRIOV_VF(xe));
> >  	xe_gt_assert(gt, xe_gt_sriov_vf_recovery_inprogress(gt));
> >  
> > -	set_bit(gt->info.id, &xe->sriov.vf.migration.gt_flags);
> > -	/*
> > -	 * We need to be certain that if all flags were set, at least one
> > -	 * thread will notice that and schedule the recovery.
> > -	 */
> > -	smp_mb__after_atomic();
> > -
> >  	xe_gt_sriov_info(gt, "ready for recovery after migration\n");
> > -	xe_sriov_vf_start_migration_recovery(xe);
> > +	xe_gt_sriov_vf_start_migration_recovery(gt);
> >  }
> >  
> >  static bool vf_is_negotiated(struct xe_gt *gt, u16 major, u16 minor)
> > @@ -1175,6 +1192,139 @@ void xe_gt_sriov_vf_print_version(struct xe_gt *gt, struct drm_printer *p)
> >  		   pf_version->major, pf_version->minor);
> >  }
> >  
> > +static void vf_post_migration_shutdown(struct xe_gt *gt)
> > +{
> > +	int ret = 0;
> > +
> > +	spin_lock_irq(&gt->sriov.vf.migration.lock);
> > +	gt->sriov.vf.migration.recovery_queued = false;
> > +	spin_unlock_irq(&gt->sriov.vf.migration.lock);
> > +
> > +	xe_guc_submit_pause(&gt->uc.guc);
> > +	ret |= xe_guc_submit_reset_block(&gt->uc.guc);
> 
> this |= seem unneeded
> 

This is existing code copy / pasted and is removed in [4].

[4] https://patchwork.freedesktop.org/patch/676382/?series=154627&rev=2

> > +
> > +	if (ret)
> > +		xe_gt_sriov_info(gt, "migration recovery encountered ongoing reset\n");
> 
> is this the only possible reason? maybe worth to add %pe ?
> 

Again this existing code and will be removed in [4].

So with above statements, I'd prefer just to leave as is.

> > +}
> > +
> > +static size_t post_migration_scratch_size(struct xe_device *xe)
> > +{
> > +	return max(xe_lrc_reg_size(xe), LRC_WA_BB_SIZE);
> > +}
> > +
> > +static int vf_post_migration_fixups(struct xe_gt *gt)
> > +{
> > +	s64 shift;
> > +	void *buf;
> > +	int err;
> > +
> > +	buf = kmalloc(post_migration_scratch_size(gt_to_xe(gt)), GFP_ATOMIC);
> > +	if (!buf)
> > +		return -ENOMEM;
> > +
> > +	err = xe_gt_sriov_vf_query_config(gt);
> > +	if (err)
> > +		goto out;
> > +
> > +	shift = xe_gt_sriov_vf_ggtt_shift(gt);
> > +	if (shift) {
> > +		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
> > +		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
> > +		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
> > +		if (err)
> > +			goto out;
> > +	}
> > +
> > +out:
> > +	kfree(buf);
> > +	return err;
> > +}
> > +
> > +static void vf_post_migration_kickstart(struct xe_gt *gt)
> > +{
> > +	/*
> > +	 * Make sure interrupts on the new HW are properly set. The GuC IRQ
> > +	 * must be working at this point, since the recovery did started,
> > +	 * but the rest was not enabled using the procedure from spec.
> > +	 */
> > +	xe_irq_resume(gt_to_xe(gt));
> > +
> > +	xe_guc_submit_reset_unblock(&gt->uc.guc);
> > +	xe_guc_submit_unpause(&gt->uc.guc);
> > +}
> > +
> > +static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> > +{
> > +	bool skip_resfix = false;
> > +
> > +	spin_lock_irq(&gt->sriov.vf.migration.lock);
> > +	if (gt->sriov.vf.migration.recovery_queued) {
> > +		skip_resfix = true;
> > +		xe_gt_sriov_dbg(gt, "another recovery imminent, skipped some notifications\n");
> > +	} else {
> > +		WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, false);
> > +	}
> > +	spin_unlock_irq(&gt->sriov.vf.migration.lock);
> > +
> > +	return skip_resfix ? -EAGAIN : xe_gt_sriov_vf_notify_resfix_done(gt);
> 
> nit: this looks cleaner:
> 
> 	if (skip_resfix)
> 		return -EAGAIN;
> 
> 	return xe_gt_sriov_vf_notify_resfix_done(gt);
> 

Sure.

> > +}
> > +
> > +static void vf_post_migration_recovery(struct xe_gt *gt)
> > +{
> > +	struct xe_device *xe = gt_to_xe(gt);
> > +	int err;
> > +
> > +	xe_gt_sriov_dbg(gt, "migration recovery in progress\n");
> > +
> > +	xe_pm_runtime_get(xe);
> > +	vf_post_migration_shutdown(gt);
> > +
> > +	if (!xe_sriov_vf_migration_supported(xe)) {
> > +		xe_gt_sriov_err(gt, "migration is not supported\n");
> > +		err = -ENOTRECOVERABLE;
> > +		goto fail;
> > +	}
> > +
> > +	err = vf_post_migration_fixups(gt);
> > +	if (err)
> > +		goto fail;
> > +
> > +	vf_post_migration_kickstart(gt);
> > +	err = vf_post_migration_notify_resfix_done(gt);
> > +	if (err && err != -EAGAIN)
> > +		goto fail;
> > +
> > +	xe_pm_runtime_put(xe);
> > +	xe_gt_sriov_notice(gt, "migration recovery ended\n");
> > +	return;
> > +fail:
> > +	xe_pm_runtime_put(xe);
> > +	xe_gt_sriov_err(gt, "migration recovery failed (%pe)\n", ERR_PTR(err));
> > +	xe_device_declare_wedged(xe);
> > +}
> > +
> > +static void migration_worker_func(struct work_struct *w)
> > +{
> > +	struct xe_gt *gt = container_of(w, struct xe_gt,
> > +					sriov.vf.migration.worker);
> > +
> > +	vf_post_migration_recovery(gt);
> > +}
> > +
> > +/**
> > + * xe_gt_sriov_vf_migration_init_early() - VF post migration init early
> > + * @gt: the &xe_gt
> > + */
> > +void xe_gt_sriov_vf_migration_init_early(struct xe_gt *gt)
> > +{
> > +	init_rwsem(&gt->sriov.vf.self_config.lock);
> > +	spin_lock_init(&gt->sriov.vf.migration.lock);
> > +	INIT_WORK(&gt->sriov.vf.migration.worker, migration_worker_func);
> > +
> > +	if (!xe_sriov_vf_migration_supported(gt_to_xe(gt)))
> > +		xe_gt_sriov_info(gt, "migration not supported by this module version\n");
> 
> we likely don't want to repeat that message on every GT
> 

So move this to xe_sriov_vf_init_early?

> > +}
> > +
> >  /**
> >   * xe_gt_sriov_vf_recovery_inprogress() - VF post migration recovery in progress
> >   * @gt: the &xe_gt
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > index bb5f8eace19b..2ac6775b52f0 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.h
> > @@ -21,10 +21,9 @@ void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
> >  int xe_gt_sriov_vf_query_config(struct xe_gt *gt);
> >  int xe_gt_sriov_vf_connect(struct xe_gt *gt);
> >  int xe_gt_sriov_vf_query_runtime(struct xe_gt *gt);
> > -void xe_gt_sriov_vf_default_lrcs_hwsp_rebase(struct xe_gt *gt);
> > -int xe_gt_sriov_vf_notify_resfix_done(struct xe_gt *gt);
> >  void xe_gt_sriov_vf_migrated_event_handler(struct xe_gt *gt);
> >  
> > +void xe_gt_sriov_vf_migration_init_early(struct xe_gt *gt);
> >  bool xe_gt_sriov_vf_recovery_inprogress(struct xe_gt *gt);
> >  
> >  u32 xe_gt_sriov_vf_gmdid(struct xe_gt *gt);
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > index 7b10b8e1e10e..53680a2f188a 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > @@ -8,6 +8,7 @@
> >  
> >  #include <linux/rwsem.h>
> >  #include <linux/types.h>
> > +#include <linux/workqueue.h>
> >  #include "xe_uc_fw_types.h"
> >  
> >  /**
> > @@ -53,6 +54,12 @@ struct xe_gt_sriov_vf_runtime {
> >   * xe_gt_sriov_vf_migration - VF migration data.
> >   */
> >  struct xe_gt_sriov_vf_migration {
> > +	/** @migration: VF migration recovery worker */
> > +	struct work_struct worker;
> > +	/** @lock: Protects recovery_queued */
> > +	spinlock_t lock;
> > +	/** @recovery_queued: VF post migration recovery in queued */
> > +	bool recovery_queued;
> >  	/** @recovery_inprogress: VF post migration recovery in progress */
> >  	bool recovery_inprogress;
> >  };
> > diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> > index da064a1e7419..7d91553c4acc 100644
> > --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> > @@ -6,21 +6,12 @@
> >  #include <drm/drm_debugfs.h>
> >  #include <drm/drm_managed.h>
> >  
> > -#include "xe_assert.h"
> > -#include "xe_device.h"
> >  #include "xe_gt.h"
> > -#include "xe_gt_sriov_printk.h"
> >  #include "xe_gt_sriov_vf.h"
> >  #include "xe_guc.h"
> > -#include "xe_guc_submit.h"
> > -#include "xe_irq.h"
> > -#include "xe_lrc.h"
> > -#include "xe_pm.h"
> > -#include "xe_sriov.h"
> >  #include "xe_sriov_printk.h"
> >  #include "xe_sriov_vf.h"
> >  #include "xe_sriov_vf_ccs.h"
> > -#include "xe_tile_sriov_vf.h"
> >  
> >  /**
> >   * DOC: VF restore procedure in PF KMD and VF KMD
> > @@ -158,8 +149,6 @@ static void vf_disable_migration(struct xe_device *xe, const char *fmt, ...)
> >  	xe->sriov.vf.migration.enabled = false;
> >  }
> >  
> > -static void migration_worker_func(struct work_struct *w);
> > -
> >  static void vf_migration_init_early(struct xe_device *xe)
> >  {
> >  	/*
> > @@ -184,8 +173,6 @@ static void vf_migration_init_early(struct xe_device *xe)
> >  						    guc_version.major, guc_version.minor);
> >  	}
> >  
> > -	INIT_WORK(&xe->sriov.vf.migration.worker, migration_worker_func);
> > -
> >  	xe->sriov.vf.migration.enabled = true;
> >  	xe_sriov_dbg(xe, "migration support enabled\n");
> >  }
> > @@ -200,238 +187,11 @@ void xe_sriov_vf_init_early(struct xe_device *xe)
> >  	unsigned int id;
> >  
> >  	for_each_gt(gt, xe, id)
> > -		init_rwsem(&gt->sriov.vf.self_config.lock);
> > +		xe_gt_sriov_vf_migration_init_early(gt);
> 
> still, this should be called from gt_init_early kind of functions
> 

Kinda a nit that I'm not convinced is worth while to have
xe_sriov_vf_init_early and then xe_gt_sriov_vf_migration_init_early
called in gt_init_early...

Matt

> >  
> >  	vf_migration_init_early(xe);
> >  }
> >  
> > -/**
> > - * vf_post_migration_shutdown - Stop the driver activities after VF migration.
> > - * @xe: the &xe_device struct instance
> > - *
> > - * After this VM is migrated and assigned to a new VF, it is running on a new
> > - * hardware, and therefore many hardware-dependent states and related structures
> > - * require fixups. Without fixups, the hardware cannot do any work, and therefore
> > - * all GPU pipelines are stalled.
> > - * Stop some of kernel activities to make the fixup process faster.
> > - */
> > -static void vf_post_migration_shutdown(struct xe_device *xe)
> > -{
> > -	struct xe_gt *gt;
> > -	unsigned int id;
> > -	int ret = 0;
> > -
> > -	for_each_gt(gt, xe, id) {
> > -		xe_guc_submit_pause(&gt->uc.guc);
> > -		ret |= xe_guc_submit_reset_block(&gt->uc.guc);
> > -	}
> > -
> > -	if (ret)
> > -		drm_info(&xe->drm, "migration recovery encountered ongoing reset\n");
> > -}
> > -
> > -/**
> > - * vf_post_migration_kickstart - Re-start the driver activities under new hardware.
> > - * @xe: the &xe_device struct instance
> > - *
> > - * After we have finished with all post-migration fixups, restart the driver
> > - * activities to continue feeding the GPU with workloads.
> > - */
> > -static void vf_post_migration_kickstart(struct xe_device *xe)
> > -{
> > -	struct xe_gt *gt;
> > -	unsigned int id;
> > -
> > -	/*
> > -	 * Make sure interrupts on the new HW are properly set. The GuC IRQ
> > -	 * must be working at this point, since the recovery did started,
> > -	 * but the rest was not enabled using the procedure from spec.
> > -	 */
> > -	xe_irq_resume(xe);
> > -
> > -	for_each_gt(gt, xe, id) {
> > -		xe_guc_submit_reset_unblock(&gt->uc.guc);
> > -		xe_guc_submit_unpause(&gt->uc.guc);
> > -	}
> > -}
> > -
> > -static bool gt_vf_post_migration_needed(struct xe_gt *gt)
> > -{
> > -	return test_bit(gt->info.id, &gt_to_xe(gt)->sriov.vf.migration.gt_flags);
> > -}
> > -
> > -/*
> > - * Notify GuCs marked in flags about resource fixups apply finished.
> > - * @xe: the &xe_device struct instance
> > - * @gt_flags: flags marking to which GTs the notification shall be sent
> > - */
> > -static int vf_post_migration_notify_resfix_done(struct xe_device *xe, unsigned long gt_flags)
> > -{
> > -	struct xe_gt *gt;
> > -	unsigned int id;
> > -	int err = 0;
> > -
> > -	for_each_gt(gt, xe, id) {
> > -		if (!test_bit(id, &gt_flags))
> > -			continue;
> > -		/* skip asking GuC for RESFIX exit if new recovery request arrived */
> > -		if (gt_vf_post_migration_needed(gt))
> > -			continue;
> > -		err = xe_gt_sriov_vf_notify_resfix_done(gt);
> > -		if (err)
> > -			break;
> > -		clear_bit(id, &gt_flags);
> > -	}
> > -
> > -	if (gt_flags && !err)
> > -		drm_dbg(&xe->drm, "another recovery imminent, skipped some notifications\n");
> > -	return err;
> > -}
> > -
> > -static int vf_get_next_migrated_gt_id(struct xe_device *xe)
> > -{
> > -	struct xe_gt *gt;
> > -	unsigned int id;
> > -
> > -	for_each_gt(gt, xe, id) {
> > -		if (test_and_clear_bit(id, &xe->sriov.vf.migration.gt_flags))
> > -			return id;
> > -	}
> > -	return -1;
> > -}
> > -
> > -static size_t post_migration_scratch_size(struct xe_device *xe)
> > -{
> > -	return max(xe_lrc_reg_size(xe), LRC_WA_BB_SIZE);
> > -}
> > -
> > -/**
> > - * Perform post-migration fixups on a single GT.
> > - *
> > - * After migration, GuC needs to be re-queried for VF configuration to check
> > - * if it matches previous provisioning. Most of VF provisioning shall be the
> > - * same, except GGTT range, since GGTT is not virtualized per-VF. If GGTT
> > - * range has changed, we have to perform fixups - shift all GGTT references
> > - * used anywhere within the driver. After the fixups in this function succeed,
> > - * it is allowed to ask the GuC bound to this GT to continue normal operation.
> > - *
> > - * Returns: 0 if the operation completed successfully, or a negative error
> > - * code otherwise.
> > - */
> > -static int gt_vf_post_migration_fixups(struct xe_gt *gt)
> > -{
> > -	s64 shift;
> > -	void *buf;
> > -	int err;
> > -
> > -	buf = kmalloc(post_migration_scratch_size(gt_to_xe(gt)), GFP_KERNEL);
> > -	if (!buf)
> > -		return -ENOMEM;
> > -
> > -	err = xe_gt_sriov_vf_query_config(gt);
> > -	if (err)
> > -		goto out;
> > -
> > -	shift = xe_gt_sriov_vf_ggtt_shift(gt);
> > -	if (shift) {
> > -		xe_tile_sriov_vf_fixup_ggtt_nodes(gt_to_tile(gt), shift);
> > -		xe_gt_sriov_vf_default_lrcs_hwsp_rebase(gt);
> > -		err = xe_guc_contexts_hwsp_rebase(&gt->uc.guc, buf);
> > -		if (err)
> > -			goto out;
> > -	}
> > -
> > -out:
> > -	kfree(buf);
> > -	return err;
> > -}
> > -
> > -static void vf_post_migration_recovery(struct xe_device *xe)
> > -{
> > -	unsigned long fixed_gts = 0;
> > -	int id, err;
> > -
> > -	drm_dbg(&xe->drm, "migration recovery in progress\n");
> > -	xe_pm_runtime_get(xe);
> > -	vf_post_migration_shutdown(xe);
> > -
> > -	if (!xe_sriov_vf_migration_supported(xe)) {
> > -		xe_sriov_err(xe, "migration is not supported\n");
> > -		err = -ENOTRECOVERABLE;
> > -		goto fail;
> > -	}
> > -
> > -	while (id = vf_get_next_migrated_gt_id(xe), id >= 0) {
> > -		struct xe_gt *gt = xe_device_get_gt(xe, id);
> > -
> > -		err = gt_vf_post_migration_fixups(gt);
> > -		if (err)
> > -			goto fail;
> > -
> > -		set_bit(id, &fixed_gts);
> > -	}
> > -
> > -	vf_post_migration_kickstart(xe);
> > -	err = vf_post_migration_notify_resfix_done(xe, fixed_gts);
> > -	if (err)
> > -		goto fail;
> > -
> > -	xe_pm_runtime_put(xe);
> > -	drm_notice(&xe->drm, "migration recovery ended\n");
> > -	return;
> > -fail:
> > -	xe_pm_runtime_put(xe);
> > -	drm_err(&xe->drm, "migration recovery failed (%pe)\n", ERR_PTR(err));
> > -	xe_device_declare_wedged(xe);
> > -}
> > -
> > -static void migration_worker_func(struct work_struct *w)
> > -{
> > -	struct xe_device *xe = container_of(w, struct xe_device,
> > -					    sriov.vf.migration.worker);
> > -
> > -	vf_post_migration_recovery(xe);
> > -}
> > -
> > -/*
> > - * Check if post-restore recovery is coming on any of GTs.
> > - * @xe: the &xe_device struct instance
> > - *
> > - * Return: True if migration recovery worker will soon be running. Any worker currently
> > - * executing does not affect the result.
> > - */
> > -static bool vf_ready_to_recovery_on_any_gts(struct xe_device *xe)
> > -{
> > -	struct xe_gt *gt;
> > -	unsigned int id;
> > -
> > -	for_each_gt(gt, xe, id) {
> > -		if (test_bit(id, &xe->sriov.vf.migration.gt_flags))
> > -			return true;
> > -	}
> > -	return false;
> > -}
> > -
> > -/**
> > - * xe_sriov_vf_start_migration_recovery - Start VF migration recovery.
> > - * @xe: the &xe_device to start recovery on
> > - *
> > - * This function shall be called only by VF.
> > - */
> > -void xe_sriov_vf_start_migration_recovery(struct xe_device *xe)
> > -{
> > -	bool started;
> > -
> > -	xe_assert(xe, IS_SRIOV_VF(xe));
> > -
> > -	if (!vf_ready_to_recovery_on_any_gts(xe))
> > -		return;
> > -
> > -	started = queue_work(xe->sriov.wq, &xe->sriov.vf.migration.worker);
> > -	drm_info(&xe->drm, "VF migration recovery %s\n", started ?
> > -		 "scheduled" : "already in progress");
> > -}
> > -
> >  /**
> >   * xe_sriov_vf_init_late() - SR-IOV VF late initialization functions.
> >   * @xe: the &xe_device to initialize
> > diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.h b/drivers/gpu/drm/xe/xe_sriov_vf.h
> > index 9e752105ec2a..4df95266b261 100644
> > --- a/drivers/gpu/drm/xe/xe_sriov_vf.h
> > +++ b/drivers/gpu/drm/xe/xe_sriov_vf.h
> > @@ -13,7 +13,6 @@ struct xe_device;
> >  
> >  void xe_sriov_vf_init_early(struct xe_device *xe);
> >  int xe_sriov_vf_init_late(struct xe_device *xe);
> > -void xe_sriov_vf_start_migration_recovery(struct xe_device *xe);
> >  bool xe_sriov_vf_migration_supported(struct xe_device *xe);
> >  void xe_sriov_vf_debugfs_register(struct xe_device *xe, struct dentry *root);
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> > index 426cc5841958..6a0fd0f5463e 100644
> > --- a/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> > +++ b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> > @@ -33,10 +33,6 @@ struct xe_device_vf {
> >  
> >  	/** @migration: VF Migration state data */
> >  	struct {
> > -		/** @migration.worker: VF migration recovery worker */
> > -		struct work_struct worker;
> > -		/** @migration.gt_flags: Per-GT request flags for VF migration recovery */
> > -		unsigned long gt_flags;
> >  		/**
> >  		 * @migration.enabled: flag indicating if migration support
> >  		 * was enabled or not due to missing prerequisites
>

next prev parent reply	other threads:[~2025-09-24 19:51 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-24  1:15 [PATCH v2 00/34] VF migration redesign Matthew Brost
2025-09-24  1:15 ` [PATCH v2 01/34] drm/xe/vf: Lock querying GGTT config during driver init Matthew Brost
2025-09-24  9:29   ` Michal Wajdeczko
2025-09-24 20:23     ` Matthew Brost
2025-09-30  0:42       ` Lis, Tomasz
2025-09-24  1:15 ` [PATCH v2 02/34] Revert "drm/xe/vf: Rebase exec queue parallel commands during migration recovery" Matthew Brost
2025-09-24  9:32   ` Michal Wajdeczko
2025-09-24 20:17     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 03/34] Revert "drm/xe/vf: Post migration, repopulate ring area for pending request" Matthew Brost
2025-09-24  1:15 ` [PATCH v2 04/34] Revert "drm/xe/vf: Fixup CTB send buffer messages after migration" Matthew Brost
2025-09-24  1:15 ` [PATCH v2 05/34] drm/xe: Save off position in ring in which a job was programmed Matthew Brost
2025-09-24  1:15 ` [PATCH v2 06/34] drm/xe/guc: Track pending-enable source in submission state Matthew Brost
2025-09-24  1:15 ` [PATCH v2 07/34] drm/xe: Track LR jobs in DRM scheduler pending list Matthew Brost
2025-09-24  1:15 ` [PATCH v2 08/34] drm/xe: Don't change LRC ring head on job resubmission Matthew Brost
2025-09-24 15:14   ` Lis, Tomasz
2025-09-25 16:12     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 09/34] drm/xe: Make LRC W/A scratch buffer usage consistent Matthew Brost
2025-09-24 14:23   ` Lis, Tomasz
2025-09-24 18:01   ` Lucas De Marchi
2025-09-25 20:25     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 10/34] drm/xe/guc: Document GuC submission backend Matthew Brost
2025-09-24  9:35   ` Michal Wajdeczko
2025-09-24 20:20     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 11/34] drm/xe/vf: Add xe_gt_sriov_vf_recovery_inprogress helper Matthew Brost
2025-09-24 10:14   ` Michal Wajdeczko
2025-09-24 19:39     ` Matthew Brost
2025-09-24 20:12       ` Michal Wajdeczko
2025-09-24 20:30         ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 12/34] drm/xe/vf: Make VF recovery run on per-GT worker Matthew Brost
2025-09-24 10:49   ` Michal Wajdeczko
2025-09-24 19:50     ` Matthew Brost [this message]
2025-09-24 20:21       ` Michal Wajdeczko
2025-09-24 20:35         ` Matthew Brost
2025-09-25 16:27           ` Lis, Tomasz
2025-09-25 16:56             ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 13/34] drm/xe/vf: Abort H2G sends during VF post-migration recovery Matthew Brost
2025-09-24 11:00   ` Michal Wajdeczko
2025-09-24 20:01     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 14/34] drm/xe/vf: Remove memory allocations from VF post migration recovery Matthew Brost
2025-09-26  1:35   ` Lis, Tomasz
2025-09-26  1:43     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 15/34] drm/xe/vf: Close multi-GT GGTT shift race Matthew Brost
2025-09-26  2:33   ` Lis, Tomasz
2025-09-26 19:09     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 16/34] drm/xe/vf: Teardown VF post migration worker on driver unload Matthew Brost
2025-09-26 15:40   ` Lis, Tomasz
2025-09-26 19:13     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 17/34] drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery Matthew Brost
2025-09-27  2:59   ` Lis, Tomasz
2025-09-27 22:33     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 18/34] drm/xe/vf: Wakeup in GuC backend on " Matthew Brost
2025-09-25 19:06   ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 19/34] drm/xe/vf: Extra debug on GGTT shift Matthew Brost
2025-09-27  3:16   ` Lis, Tomasz
2025-09-27 11:06   ` Michal Wajdeczko
2025-09-27 22:56     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 20/34] drm/xe/vf: Use GUC_HXG_TYPE_EVENT for GuC context register Matthew Brost
2025-09-24 11:15   ` Michal Wajdeczko
2025-09-24 20:16     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 21/34] drm/xe/vf: Stop and flush CTs in VF post migration recovery Matthew Brost
2025-09-24 11:21   ` Michal Wajdeczko
2025-09-24 20:12     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 22/34] drm/xe/vf: Reset TLB invalidations during " Matthew Brost
2025-09-27  3:43   ` Lis, Tomasz
2025-09-27 22:29     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 23/34] drm/xe/vf: Kickstart after resfix in " Matthew Brost
2025-09-27 11:21   ` Lis, Tomasz
2025-09-24  1:15 ` [PATCH v2 24/34] drm/xe/vf: Start CTs before resfix " Matthew Brost
2025-09-24 11:50   ` Michal Wajdeczko
2025-09-24 20:10     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 25/34] drm/xe/vf: Abort VF post migration recovery on failure Matthew Brost
2025-09-27 11:54   ` Lis, Tomasz
2025-09-27 22:38     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 26/34] drm/xe/vf: Replay GuC submission state on pause / unpause Matthew Brost
2025-09-27 13:33   ` Lis, Tomasz
2025-09-27 23:11     ` Matthew Brost
2025-09-24  1:15 ` [PATCH v2 27/34] drm/xe: Move queue init before LRC creation Matthew Brost
2025-09-24  1:15 ` [PATCH v2 28/34] drm/xe/vf: Add debug prints for GuC replaying state during VF recovery Matthew Brost
2025-09-24  1:15 ` [PATCH v2 29/34] drm/xe/vf: Workaround for race condition in GuC firmware during VF pause Matthew Brost
2025-09-24  1:15 ` [PATCH v2 30/34] drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups Matthew Brost
2025-09-24  1:15 ` [PATCH v2 31/34] drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF Matthew Brost
2025-09-24  1:15 ` [PATCH v2 32/34] drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL Matthew Brost
2025-09-24  1:16 ` [PATCH v2 33/34] drm/xe/vf: Rebase CCS save/restore BB GGTT addresses Matthew Brost
2025-09-24  4:04   ` K V P, Satyanarayana
2025-09-24  6:32     ` Matthew Brost
2025-09-24  6:36       ` K V P, Satyanarayana
2025-09-24  1:16 ` [PATCH v2 34/34] drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC Matthew Brost
2025-09-24  1:29 ` ✓ CI.KUnit: success for VF migration redesign (rev2) Patchwork
2025-09-24  2:14 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-09-24  7:37 ` ✗ Xe.CI.Full: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aNRLmmK5yqXa3xNb@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=michal.wajdeczko@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox