From: Matthew Brost <matthew.brost@intel.com>
To: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
Michal Wajdeczko <michal.wajdeczko@intel.com>,
Tomasz Lis <tomasz.lis@intel.com>
Subject: Re: [PATCH v2 1/2] drm/xe/vf: Introduce RESFIX start marker support
Date: Fri, 31 Oct 2025 13:10:13 -0700 [thread overview]
Message-ID: <aQUXpd0HkcFg2iZ0@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <aPqSDVJ4QR7kzT07@lstrano-desk.jf.intel.com>
On Thu, Oct 23, 2025 at 01:37:33PM -0700, Matthew Brost wrote:
> On Thu, Oct 23, 2025 at 09:06:11PM +0530, Satyanarayana K V P wrote:
> > Post migration, a marker is sent to the GUC prior to the start of resource
> > fixups to indicate start of resource fixups. The same marker is sent along
> > with RESFIX_DONE notification so that GUC can avoid submitting jobs to HW
> > in case of double migration.
> >
> > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Tomasz Lis <tomasz.lis@intel.com>
> >
> > ---
> > V1 -> V2:
> > - Squashed "Enable RESFIX start marker only on supported GUC
> > versions" commit into a single commit. (Matt B)
> > ---
> > .../gpu/drm/xe/abi/guc_actions_sriov_abi.h | 38 ++++++++
> > drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 87 +++++++++++++++++--
> > drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 5 ++
> > drivers/gpu/drm/xe/xe_sriov_vf.c | 42 ++++++++-
> > drivers/gpu/drm/xe/xe_sriov_vf_types.h | 5 ++
> > 5 files changed, 169 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> > index 0b28659d94e9..b9141497bfd5 100644
> > --- a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> > +++ b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> > @@ -656,4 +656,42 @@
> > #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN
> > #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_0_USED GUC_HXG_RESPONSE_MSG_0_DATA0
> >
> > +/**
> > + * DOC: VF2GUC_NOTIFY_RESFIX_START
> > + *
> > + * This action is used by VF to notify the GuC that the VF KMD will be starting
> > + * post-migration recovery steps.
> > + *
> > + * This message must be sent as `MMIO HXG Message`_.
> > + *
> > + * +---+-------+--------------------------------------------------------------+
> > + * | | Bits | Description |
> > + * +===+=======+==============================================================+
> > + * | 0 | 31 | ORIGIN = GUC_HXG_ORIGIN_HOST_ |
> > + * | +-------+--------------------------------------------------------------+
> > + * | | 30:28 | TYPE = GUC_HXG_TYPE_REQUEST_ |
> > + * | +-------+--------------------------------------------------------------+
> > + * | | 27:16 | DATA0 = MBZ |
> > + * | +-------+--------------------------------------------------------------+
> > + * | | 15:0 | ACTION = _`GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START` = 0x550F |
> > + * +---+-------+--------------------------------------------------------------+
> > + *
> > + * +---+-------+--------------------------------------------------------------+
> > + * | | Bits | Description |
> > + * +===+=======+==============================================================+
> > + * | 0 | 31 | ORIGIN = GUC_HXG_ORIGIN_GUC_ |
> > + * | +-------+--------------------------------------------------------------+
> > + * | | 30:28 | TYPE = GUC_HXG_TYPE_RESPONSE_SUCCESS_ |
> > + * | +-------+--------------------------------------------------------------+
> > + * | | 27:0 | DATA0 = MBZ |
> > + * +---+-------+--------------------------------------------------------------+
> > + */
> > +#define GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START 0x550Fu
> > +
> > +#define VF2GUC_NOTIFY_RESFIX_START_REQUEST_MSG_LEN GUC_HXG_REQUEST_MSG_MIN_LEN
> > +#define VF2GUC_NOTIFY_RESFIX_START_REQUEST_MSG_0_MBZ GUC_HXG_REQUEST_MSG_0_DATA0
> > +
> > +#define VF2GUC_NOTIFY_RESFIX_START_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN
> > +#define VF2GUC_NOTIFY_RESFIX_START_RESPONSE_MSG_0_MBZ GUC_HXG_RESPONSE_MSG_0_DATA0
> > +
> > #endif
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index d0b102ab6ce8..8c1448d6c81d 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -299,12 +299,55 @@ void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
> > *found = gt->sriov.vf.guc_version;
> > }
> >
> > -static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> > +static int guc_action_vf_notify_resfix_start(struct xe_guc *guc, u16 marker)
> > {
> > u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> > FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> > FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> > - FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE),
> > + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION,
> > + GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START) |
> > + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_DATA0, marker),
> > + };
> > + int ret;
> > +
> > + ret = xe_guc_mmio_send(guc, request, ARRAY_SIZE(request));
> > +
> > + return ret > 0 ? -EPROTO : ret;
> > +}
> > +
> > +/**
> > + * vf_notify_resfix_start - Notify GuC about start of resource fixups.
> > + * @gt: the &xe_gt struct instance linked to target GuC
> > + * @marker: marker to identify the migration.
> > + *
> > + * Returns: 0 if the operation completed successfully, or a negative error
> > + * code otherwise.
> > + */
> > +static int vf_notify_resfix_start(struct xe_gt *gt, u16 marker)
> > +{
> > + struct xe_guc *guc = >->uc.guc;
> > + int err;
> > +
> > + xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > +
> > + err = guc_action_vf_notify_resfix_start(guc, marker);
> > + if (unlikely(err))
> > + xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup start (%pe)\n",
> > + ERR_PTR(err));
> > + else
> > + xe_gt_sriov_dbg_verbose(gt, "sent GuC resource fixup start\n");
> > +
> > + return err;
> > +}
> > +
> > +static int guc_action_vf_notify_resfix_done(struct xe_guc *guc, u16 marker)
> > +{
> > + u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> > + FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> > + FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> > + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION,
> > + GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE) |
> > + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_DATA0, marker),
> > };
> > int ret;
> >
> > @@ -316,18 +359,19 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> > /**
> > * vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
> > * @gt: the &xe_gt struct instance linked to target GuC
> > + * @marker: marker to identify the migration.
> > *
> > * Returns: 0 if the operation completed successfully, or a negative error
> > * code otherwise.
> > */
> > -static int vf_notify_resfix_done(struct xe_gt *gt)
> > +static int vf_notify_resfix_done(struct xe_gt *gt, u16 marker)
> > {
> > struct xe_guc *guc = >->uc.guc;
> > int err;
> >
> > xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> >
> > - err = guc_action_vf_notify_resfix_done(guc);
> > + err = guc_action_vf_notify_resfix_done(guc, marker);
> > if (unlikely(err))
> > xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup done (%pe)\n",
> > ERR_PTR(err));
> > @@ -1183,7 +1227,7 @@ static void vf_post_migration_abort(struct xe_gt *gt)
> > xe_guc_submit_pause_abort(>->uc.guc);
> > }
> >
> > -static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> > +static int vf_post_migration_notify_resfix_done(struct xe_gt *gt, u16 marker)
> > {
> > bool skip_resfix = false;
> >
> > @@ -1206,12 +1250,27 @@ static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> > */
> > xe_irq_resume(gt_to_xe(gt));
> >
> > - return vf_notify_resfix_done(gt);
> > + return vf_notify_resfix_done(gt, marker);
> > +}
> > +
> > +static bool vf_resfix_start_marker_supported(struct xe_gt *gt)
> > +{
> > + struct xe_device *xe = gt_to_xe(gt);
> > +
> > + xe_gt_assert(gt, IS_SRIOV_VF(xe));
> > + return xe->sriov.vf.migration.resfix_marker_enabled;
> > +}
> > +
> > +static u16 vf_post_migration_resfix_start_marker(struct xe_gt *gt)
> > +{
> > + xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> > + return ++gt->sriov.vf.migration.resfix_marker;
> > }
> >
> > static void vf_post_migration_recovery(struct xe_gt *gt)
> > {
> > struct xe_device *xe = gt_to_xe(gt);
> > + u16 marker = 0;
> > int err;
> > bool retry;
> >
> > @@ -1227,13 +1286,27 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
> > goto fail;
> > }
> >
> > + /*
> > + * Increment the startup marker again if it overflows, since GUC
> > + * requires a non-zero marker to be set.
> > + */
> > + if (vf_resfix_start_marker_supported(gt)) {
> > + marker = vf_post_migration_resfix_start_marker(gt);
> > + if (!marker)
> > + marker = vf_post_migration_resfix_start_marker(gt);
>
> Nit: Maybe just don't zero in vf_post_migration_resfix_start_marker.
>
> > + }
> > +
> > + err = vf_notify_resfix_start(gt, marker);
> > + if (err)
> > + goto fail;
> > +
> > err = vf_post_migration_fixups(gt);
> > if (err)
> > goto fail;
> >
> > vf_post_migration_rearm(gt);
> >
> > - err = vf_post_migration_notify_resfix_done(gt);
> > + err = vf_post_migration_notify_resfix_done(gt, marker);
> > if (err && err != -EAGAIN)
> > goto fail;
> >
>
> Nit: You missed my comment skipping kickstart if err == -EAGAIN but I
> think that harmless, in existing code, and can be done in a follow up.
Actually the above comment is wrong too. We always need to run the
kickstart again to GuC submission backend state resets itself. Sorry for
the noise.
Matt
>
> Anyways patch looks functionaly correct to me:
> Reviewed-by: Matthew Brost <matthew.brost@intel.com>
>
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > index 420b0e6089de..ccd850313328 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> > @@ -60,6 +60,11 @@ struct xe_gt_sriov_vf_migration {
> > bool recovery_inprogress;
> > /** @ggtt_need_fixes: VF GGTT needs fixes */
> > bool ggtt_need_fixes;
> > + /**
> > + * @resfix_marker: Marker sent to Guc prior to starting the
> > + * post‑migration.
> > + */
> > + u16 resfix_marker;
> > };
> >
> > /**
> > diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> > index 39c829daa97c..10d6e43fffce 100644
> > --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> > @@ -55,7 +55,21 @@
> > * When the VF driver is ready to continue operation on the newly connected
> > * hardware, it sends `VF2GUC_NOTIFY_RESFIX_DONE` which causes it to
> > * enter the long awaited `VF_RUNNING` state, and therefore start handling
> > - * CTB messages and scheduling workloads from the VF::
> > + * CTB messages and scheduling workloads from the VF.
> > + *
> > + * In scenarios involving double migration, the VF KMD may encounter situations
> > + * where it is instructed to re-migrate before having the opportunity to send
> > + * RESFIX_DONE for the initial migration. This can occur when the fix-up for the
> > + * prior migration is still underway, but the VF KMD is migrated again.
> > + * Consequently, this may lead to the possibility of sending two migration
> > + * notifications (i.e., pending fix-up for the first migration and a second
> > + * notification for the new migration). Upon receiving the first RES_FIX
> > + * notification, the GuC will resume VF submission on the GPU, potentially
> > + * resulting in undefined behavior, such as system hangs or crashes.
> > + *
> > + * To avoid these hangs, a new VF2GUC action `VF2GUC_NOTIFY_RESFIX_START` is
> > + * sent along with marker and when GUC receives the same marker with
> > + * `VF2GUC_NOTIFY_RESFIX_DONE`action, it starts scheduling work loads from VF::
> > *
> > * PF GuC VF
> > * [ ] | |
> > @@ -102,6 +116,11 @@
> > * | [ ] new VF provisioning [ ]
> > * | [ ]---------------------------> [ ]
> > * | | [ ]
> > + * | | VF2GUC_NOTIFY_RESFIX_START [ ]
> > + * | [ ] <---------------------------[ ]
> > + * | [ ] [ ]
> > + * | [ ] success [ ]
> > + * | [ ]---------------------------> [ ]
> > * | | VF driver applies post [ ]
> > * | | migration fixups -------[ ]
> > * | | | [ ]
> > @@ -169,6 +188,26 @@ static void vf_migration_init_early(struct xe_device *xe)
> >
> > }
> >
> > +static void vf_resfix_start_marker_init_early(struct xe_device *xe)
> > +{
> > + struct xe_gt *gt = xe_root_mmio_gt(xe);
> > + struct xe_uc_fw_version guc_version;
> > +
> > + if (xe->sriov.vf.migration.disabled)
> > + return;
> > +
> > + xe_gt_sriov_vf_guc_versions(gt, NULL, &guc_version);
> > + if (MAKE_GUC_VER_STRUCT(guc_version) < MAKE_GUC_VER(1, 24, 10)) {
> > + xe_sriov_notice(xe,
> > + "Resfix start marker requires GUC ABI >= 1.24.10, but only %u.%u.%u found",
> > + guc_version.major, guc_version.minor, guc_version.patch);
> > + return;
> > + }
> > +
> > + xe->sriov.vf.migration.resfix_marker_enabled = true;
> > + xe_sriov_dbg(xe, "migrate: Resfix start marker support is enabled\n");
> > +}
> > +
> > /**
> > * xe_sriov_vf_init_early - Initialize SR-IOV VF specific data.
> > * @xe: the &xe_device to initialize
> > @@ -176,6 +215,7 @@ static void vf_migration_init_early(struct xe_device *xe)
> > void xe_sriov_vf_init_early(struct xe_device *xe)
> > {
> > vf_migration_init_early(xe);
> > + vf_resfix_start_marker_init_early(xe);
> > }
> >
> > /**
> > diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> > index d5f72d667817..626c11a6dd1b 100644
> > --- a/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> > +++ b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> > @@ -38,6 +38,11 @@ struct xe_device_vf {
> > * was turned off due to missing prerequisites
> > */
> > bool disabled;
> > + /**
> > + * @migration.resfix_marker_enabled: flag indicating if resfix marker
> > + * support was enabled or not due to missing prerequisites.
> > + */
> > + bool resfix_marker_enabled;
> > } migration;
> >
> > /** @ccs: VF CCS state data */
> > --
> > 2.51.0
> >
next prev parent reply other threads:[~2025-10-31 20:10 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-23 15:36 [PATCH v2 0/2] VF double migration Satyanarayana K V P
2025-10-23 15:36 ` [PATCH v2 1/2] drm/xe/vf: Introduce RESFIX start marker support Satyanarayana K V P
2025-10-23 20:37 ` Matthew Brost
2025-10-23 20:54 ` Matthew Brost
2025-10-31 20:10 ` Matthew Brost [this message]
2025-10-23 23:33 ` Michal Wajdeczko
2025-10-23 15:36 ` [PATCH v2 2/2] drm/xe/vf: Use fault injection for testing VF double migration feature Satyanarayana K V P
2025-10-24 11:52 ` Michal Wajdeczko
2025-10-23 17:18 ` ✓ CI.KUnit: success for VF double migration (rev2) Patchwork
2025-10-23 18:17 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-10-24 5:23 ` ✗ Xe.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aQUXpd0HkcFg2iZ0@lstrano-desk.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=michal.wajdeczko@intel.com \
--cc=satyanarayana.k.v.p@intel.com \
--cc=tomasz.lis@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox