From: Michal Wajdeczko <michal.wajdeczko@intel.com>
To: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>,
<intel-xe@lists.freedesktop.org>
Cc: Matthew Brost <matthew.brost@intel.com>,
Tomasz Lis <tomasz.lis@intel.com>
Subject: Re: [PATCH v3 1/2] drm/xe/vf: Introduce RESFIX start marker support
Date: Tue, 11 Nov 2025 16:06:55 +0100 [thread overview]
Message-ID: <bf4d6547-2d53-4bf6-af94-6ed5cf74f311@intel.com> (raw)
In-Reply-To: <20251107141015.29051-5-satyanarayana.k.v.p@intel.com>
On 11/7/2025 3:10 PM, Satyanarayana K V P wrote:
> In scenarios involving double migration, the VF KMD may encounter
> situations where it is instructed to re-migrate before having the
> opportunity to send RESFIX_DONE for the initial migration. This can occur
> when the fix-up for the prior migration is still underway, but the VF KMD
> is migrated again.
>
> Consequently, this may lead to the possibility of sending two migration
> notifications (i.e., pending fix-up for the first migration and a second
> notification for the new migration). Upon receiving the first RES_FIX
> notification, the GuC will resume VF submission on the GPU, potentially
> resulting in undefined behavior, such as system hangs or crashes.
>
> To avoid this, post migration, a marker is sent to the GUC prior to the
> start of resource fixups to indicate start of resource fixups. The same
> marker is sent along with RESFIX_DONE notification so that GUC can avoid
> submitting jobs to HW in case of double migration.
>
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Tomasz Lis <tomasz.lis@intel.com>
>
> ---
> V2 -> V3:
> - Fixed review comments (Michal W).
> - Updated commit message.
> - Fixed CI.BAT issues.
> - Added helper function to assert on unsupported GUC versions.
>
> V1 -> V2:
> - Squashed "Enable RESFIX start marker only on supported GUC
> versions" commit into a single commit. (Matt B)
> ---
> .../gpu/drm/xe/abi/guc_actions_sriov_abi.h | 40 ++++++++
> drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 98 +++++++++++++++++--
> drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 5 +
> drivers/gpu/drm/xe/xe_sriov_vf.c | 46 ++++++++-
> drivers/gpu/drm/xe/xe_sriov_vf_types.h | 5 +
> 5 files changed, 185 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> index 0b28659d94e9..8bc74cbc1c35 100644
> --- a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> +++ b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> @@ -656,4 +656,44 @@
> #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN
> #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_0_USED GUC_HXG_RESPONSE_MSG_0_DATA0
>
> +/**
> + * DOC: VF2GUC_NOTIFY_RESFIX_START
in the GuC spec there is no "NOTIFY", so this should be just:
VF2GUC_RESFIX_START
here and in all below defs
> + *
> + * This action is used by VF to notify the GuC that the VF KMD will be starting
> + * post-migration recovery steps.
> + *
> + * This message must be sent as `MMIO HXG Message`_.
> + *
> + * Available since GuC version 70.54.0 (VF 1.27.0)
this is VF only action, only VF ABI version is relevant
we might mention FW version in the commit message
> + *
> + * +---+-------+--------------------------------------------------------------+
> + * | | Bits | Description |
> + * +===+=======+==============================================================+
> + * | 0 | 31 | ORIGIN = GUC_HXG_ORIGIN_HOST_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 30:28 | TYPE = GUC_HXG_TYPE_REQUEST_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 27:16 | DATA0 = MARKER |
we might want to add "MARKER - can't be zero"
btw, we might want to update (in separate patch) the VF2GUC_RESFIX_DONE documentation, with:
* | +-------+--------------------------------------------------------------+
- * | | 27:16 | DATA0 = MBZ |
+ * | | 27:16 | DATA0 = MBZ (only for ABI < 1.27.0) |
+ * | +-------+--------------------------------------------------------------+
+ * | | 27:16 | DATA0 = MARKER (for ABI >= 1.27.0) see VF2GUC_RESFIX_START_ |
* | +-------+--------------------------------------------------------------+
- * | | 15:0 | ACTION = _`GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE` = 0x5508 |
+ * | | 15:0 | ACTION = _`GUC_ACTION_VF2GUC_RESFIX_DONE` = 0x5508 |
* +---+-------+--------------------------------------------------------------+
> + * | +-------+--------------------------------------------------------------+
> + * | | 15:0 | ACTION = _`GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START` = 0x550F |
> + * +---+-------+--------------------------------------------------------------+
> + *
> + * +---+-------+--------------------------------------------------------------+
> + * | | Bits | Description |
> + * +===+=======+==============================================================+
> + * | 0 | 31 | ORIGIN = GUC_HXG_ORIGIN_GUC_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 30:28 | TYPE = GUC_HXG_TYPE_RESPONSE_SUCCESS_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 27:0 | DATA0 = MBZ |
> + * +---+-------+--------------------------------------------------------------+
> + */
> +#define GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START 0x550Fu
> +> +#define VF2GUC_NOTIFY_RESFIX_START_REQUEST_MSG_LEN GUC_HXG_REQUEST_MSG_MIN_LEN
> +#define VF2GUC_NOTIFY_RESFIX_START_REQUEST_MSG_0_MARKER GUC_HXG_REQUEST_MSG_0_DATA0
> +
> +#define VF2GUC_NOTIFY_RESFIX_START_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN
> +#define VF2GUC_NOTIFY_RESFIX_START_RESPONSE_MSG_0_MBZ GUC_HXG_RESPONSE_MSG_0_DATA0
> +
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index d0b102ab6ce8..17f06cd63527 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -299,15 +299,69 @@ void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
> *found = gt->sriov.vf.guc_version;
> }
>
> -static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> +/**
> + * When the marker is non-zero, the GUC compatibility version must be >= 1.27.0.
> + * When the marker is zero, the version must be < 1.27.0 — compatible with
> + * older GUCs that support sending RESFIX_DONE.
I'm not sure we would need this, as in the pending PF patch [1] we will enforce 70.54.0 as a minimum baseline for save/restore
and while there might different PF than our Xe, we might also want to claim readiness for save/restore only for > 1.27 on the VF side
[1] https://patchwork.freedesktop.org/patch/687116/?series=155785&rev=5
> + */
> +static inline void guc_resfix_marker_assert_not_supported(struct xe_gt *gt, u16 marker)
> +{
> + if (marker)
> + xe_gt_assert(gt, (GUC_SUBMIT_VER(>->uc.guc) >=
> + MAKE_GUC_VER(1, 27, 0)));
> + else
> + xe_gt_assert(gt, (GUC_SUBMIT_VER(>->uc.guc) <
> + MAKE_GUC_VER(1, 27, 0)));
> +}
> +
> +static int guc_action_vf_notify_resfix_start(struct xe_guc *guc, u16 marker)
> {
> u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> - FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE),
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION,
> + GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START) |
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_DATA0, marker),
use
VF2GUC_RESFIX_START_REQUEST_MSG_0_MARKER
> };
> int ret;
>
> + guc_resfix_marker_assert_not_supported(guc_to_gt(guc), marker);
START action is only available from 1.27 so it's sufficient to have:
xe_gt_assert(gt, GUC_SUBMIT_VER(>->uc.guc) >= MAKE_GUC_VER(1, 27, 0));
but maybe, since other guc_action() functions are just simple wrappers without any extra enforcement,
move that assert to the caller ? or just drop it completely since we shouldn't be migrated on older GuC?
> +
> + ret = xe_guc_mmio_send(guc, request, ARRAY_SIZE(request));
> +
> + return ret > 0 ? -EPROTO : ret;
> +}
> +
> +static int vf_notify_resfix_start(struct xe_gt *gt, u16 marker)
> +{
> + struct xe_guc *guc = >->uc.guc;
> + int err;
> +
> + xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> +
> + xe_gt_sriov_dbg(guc_to_gt(guc), "Sending resfix start marker %u\n", marker);
> +
> + err = guc_action_vf_notify_resfix_start(guc, marker);
> + if (unlikely(err))
> + xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup start (%pe)\n",
> + ERR_PTR(err));
> +
> + return err;
> +}
> +
> +static int guc_action_vf_notify_resfix_done(struct xe_guc *guc, u16 marker)
> +{
> + u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> + FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> + FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION,
> + GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE) |
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_DATA0, marker),
we need to update/add definition for
VF2GUC_RESFIX_DONE_REQUEST_MSG_0_MARKER
> + };
> + int ret;
> +
> + guc_resfix_marker_assert_not_supported(guc_to_gt(guc), marker);
maybe move asserts to the caller?
> +
> ret = xe_guc_mmio_send(guc, request, ARRAY_SIZE(request));
>
> return ret > 0 ? -EPROTO : ret;
> @@ -316,18 +370,19 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> /**
> * vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
> * @gt: the &xe_gt struct instance linked to target GuC
> + * @marker: marker to identify the migration.
> *
> * Returns: 0 if the operation completed successfully, or a negative error
> * code otherwise.
> */
> -static int vf_notify_resfix_done(struct xe_gt *gt)
> +static int vf_notify_resfix_done(struct xe_gt *gt, u16 marker)
> {
> struct xe_guc *guc = >->uc.guc;
> int err;
>
> xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>
> - err = guc_action_vf_notify_resfix_done(guc);
> + err = guc_action_vf_notify_resfix_done(guc, marker);
> if (unlikely(err))
> xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup done (%pe)\n",
> ERR_PTR(err));
> @@ -1183,7 +1238,7 @@ static void vf_post_migration_abort(struct xe_gt *gt)
> xe_guc_submit_pause_abort(>->uc.guc);
> }
>
> -static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> +static int vf_post_migration_notify_resfix_done(struct xe_gt *gt, u16 marker)
> {
> bool skip_resfix = false;
>
> @@ -1206,12 +1261,27 @@ static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> */
> xe_irq_resume(gt_to_xe(gt));
>
> - return vf_notify_resfix_done(gt);
> + return vf_notify_resfix_done(gt, marker);
> +}
> +
> +static bool vf_resfix_start_marker_supported(struct xe_gt *gt)
> +{
> + struct xe_device *xe = gt_to_xe(gt);
> +
> + xe_gt_assert(gt, IS_SRIOV_VF(xe));
> + return xe->sriov.vf.migration.resfix_marker_enabled;
> +}
> +
> +static u16 vf_post_migration_resfix_start_marker(struct xe_gt *gt)
> +{
> + xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> + return ++gt->sriov.vf.migration.resfix_marker;
> }
>
> static void vf_post_migration_recovery(struct xe_gt *gt)
> {
> struct xe_device *xe = gt_to_xe(gt);
> + u16 marker = 0;
> int err;
> bool retry;
>
> @@ -1227,13 +1297,27 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
> goto fail;
> }
>
> + /*
> + * Increment the startup marker again if it overflows, since GUC
> + * requires a non-zero marker to be set.
> + */
> + if (vf_resfix_start_marker_supported(gt)) {
> + marker = vf_post_migration_resfix_start_marker(gt);
> + if (!marker)
> + marker = vf_post_migration_resfix_start_marker(gt);
> +
> + err = vf_notify_resfix_start(gt, marker);
> + if (err)
> + goto fail;
> + }
> +
> err = vf_post_migration_fixups(gt);
> if (err)
> goto fail;
>
> vf_post_migration_rearm(gt);
>
> - err = vf_post_migration_notify_resfix_done(gt);
> + err = vf_post_migration_notify_resfix_done(gt, marker);
> if (err && err != -EAGAIN)
> goto fail;
>
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index 420b0e6089de..5707bb808d80 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -52,6 +52,11 @@ struct xe_gt_sriov_vf_migration {
> wait_queue_head_t wq;
> /** @scratch: Scratch memory for VF recovery */
> void *scratch;
> + /**
> + * @resfix_marker: Marker sent to Guc prior to starting the
> + * post‑migration.
... sent on start and on end of post-migration steps
> + */
> + u16 resfix_marker;
> /** @recovery_teardown: VF post migration recovery is being torn down */
> bool recovery_teardown;
> /** @recovery_queued: VF post migration recovery in queued */
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> index 39c829daa97c..bdde6867dcd9 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -55,7 +55,21 @@
> * When the VF driver is ready to continue operation on the newly connected
> * hardware, it sends `VF2GUC_NOTIFY_RESFIX_DONE` which causes it to
> * enter the long awaited `VF_RUNNING` state, and therefore start handling
> - * CTB messages and scheduling workloads from the VF::
> + * CTB messages and scheduling workloads from the VF.
> + *
> + * In scenarios involving double migration, the VF KMD may encounter situations
... In some scenarios, the VF driver ...
> + * where it is instructed to re-migrate before having the opportunity to send
> + * RESFIX_DONE for the initial migration. This can occur when the fix-up for the
> + * prior migration is still underway, but the VF KMD is migrated again.
> + * Consequently, this may lead to the possibility of sending two migration
> + * notifications (i.e., pending fix-up for the first migration and a second
> + * notification for the new migration). Upon receiving the first RES_FIX
> + * notification, the GuC will resume VF submission on the GPU, potentially
> + * resulting in undefined behavior, such as system hangs or crashes.
> + *
> + * To avoid these hangs, a new VF2GUC action `VF2GUC_NOTIFY_RESFIX_START` is
> + * sent along with marker and when GUC receives the same marker with
> + * `VF2GUC_NOTIFY_RESFIX_DONE`action, it starts scheduling work loads from VF::
hmm, I'm not sure we need to keep the discussion/rationale as part of the kernel-doc of the actual flow.
maybe just document here those new steps, and explain the double-migration problem only in the cover/commit message
> *
> * PF GuC VF
> * [ ] | |
> @@ -102,6 +116,11 @@
> * | [ ] new VF provisioning [ ]
> * | [ ]---------------------------> [ ]
> * | | [ ]
> + * | | VF2GUC_NOTIFY_RESFIX_START [ ]
> + * | [ ] <---------------------------[ ]
> + * | [ ] [ ]
> + * | [ ] success [ ]
> + * | [ ]---------------------------> [ ]
> * | | VF driver applies post [ ]
> * | | migration fixups -------[ ]
> * | | | [ ]
> @@ -114,7 +133,9 @@
> * | [ ]------- VF_RUNNING [ ]
> * | [ ] | [ ]
> * | [ ] <----- [ ]
> - * | [ ] success [ ]
> + * | [ ] success (on marker match) [ ]
> + * | [ ]---------------------------> [ ]
> + * | [ ] Error (on marker mismatch) [ ]
> * | [ ]---------------------------> [ ]
note that in case of double-migration we expect dedicated VF_MIGRATED state/error
> * | | |
> * | | |
> @@ -169,6 +190,26 @@ static void vf_migration_init_early(struct xe_device *xe)
>
> }
>
> +static void vf_resfix_start_marker_init(struct xe_device *xe)
> +{
> + struct xe_gt *gt = xe_root_mmio_gt(xe);
> + struct xe_uc_fw_version guc_version;
> +
> + if (xe->sriov.vf.migration.disabled)
> + return;
> +
> + xe_gt_sriov_vf_guc_versions(gt, NULL, &guc_version);
> + if (MAKE_GUC_VER_STRUCT(guc_version) < MAKE_GUC_VER(1, 27, 0)) {
> + xe_sriov_notice(xe,
> + "Resfix start marker requires GUC ABI >= 1.27.0, but only %u.%u.%u found",
> + guc_version.major, guc_version.minor, guc_version.patch);
shouldn't we call xe_sriov_vf_migration_disable() instead ?
> + return;
> + }
> +
> + xe->sriov.vf.migration.resfix_marker_enabled = true;
> + xe_sriov_dbg(xe, "migrate: Resfix start marker support is enabled\n");
> +}
> +
> /**
> * xe_sriov_vf_init_early - Initialize SR-IOV VF specific data.
> * @xe: the &xe_device to initialize
> @@ -188,6 +229,7 @@ void xe_sriov_vf_init_early(struct xe_device *xe)
> */
> int xe_sriov_vf_init_late(struct xe_device *xe)
> {
> + vf_resfix_start_marker_init(xe);
> return xe_sriov_vf_ccs_init(xe);
> }
>
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> index d5f72d667817..626c11a6dd1b 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> @@ -38,6 +38,11 @@ struct xe_device_vf {
> * was turned off due to missing prerequisites
> */
> bool disabled;
> + /**
> + * @migration.resfix_marker_enabled: flag indicating if resfix marker
> + * support was enabled or not due to missing prerequisites.
> + */
> + bool resfix_marker_enabled;
> } migration;
>
> /** @ccs: VF CCS state data */
next prev parent reply other threads:[~2025-11-11 15:07 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-07 14:10 [PATCH v3 0/2] VF double migration Satyanarayana K V P
2025-11-07 14:10 ` [PATCH v3 1/2] drm/xe/vf: Introduce RESFIX start marker support Satyanarayana K V P
2025-11-11 15:06 ` Michal Wajdeczko [this message]
2025-11-07 14:10 ` [PATCH v3 2/2] drm/xe/vf: Add debugfs entries to test VF double migration Satyanarayana K V P
2025-11-11 15:26 ` Michal Wajdeczko
2025-11-07 15:03 ` ✓ CI.KUnit: success for VF double migration (rev3) Patchwork
2025-11-07 15:46 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-11-09 1:31 ` ✗ Xe.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bf4d6547-2d53-4bf6-af94-6ed5cf74f311@intel.com \
--to=michal.wajdeczko@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=satyanarayana.k.v.p@intel.com \
--cc=tomasz.lis@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox