From: Michal Wajdeczko <michal.wajdeczko@intel.com>
To: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>,
<intel-xe@lists.freedesktop.org>
Cc: Matthew Brost <matthew.brost@intel.com>,
Tomasz Lis <tomasz.lis@intel.com>
Subject: Re: [PATCH v4 2/3] drm/xe/vf: Introduce RESFIX start marker support
Date: Wed, 19 Nov 2025 18:24:45 +0100 [thread overview]
Message-ID: <dfab58cc-2d70-446c-9041-b980e64da10e@intel.com> (raw)
In-Reply-To: <20251118114116.3429730-3-satyanarayana.k.v.p@intel.com>
On 11/18/2025 12:41 PM, Satyanarayana K V P wrote:
> In scenarios involving double migration, the VF KMD may encounter
> situations where it is instructed to re-migrate before having the
> opportunity to send RESFIX_DONE for the initial migration. This can occur
> when the fix-up for the prior migration is still underway, but the VF KMD
> is migrated again.
>
> Consequently, this may lead to the possibility of sending two migration
> notifications (i.e., pending fix-up for the first migration and a second
> notification for the new migration). Upon receiving the first RES_FIX
> notification, the GuC will resume VF submission on the GPU, potentially
> resulting in undefined behavior, such as system hangs or crashes.
>
> To avoid this, post migration, a marker is sent to the GUC prior to the
> start of resource fixups to indicate start of resource fixups. The same
> marker is sent along with RESFIX_DONE notification so that GUC can avoid
> submitting jobs to HW in case of double migration.
>
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Tomasz Lis <tomasz.lis@intel.com>
>
> ---
> V3 -> V4:
> - Updated RESFIX_DONE action name and documenation part. (Michal W)
> - Enable resfxi_start marked by default as sav/restore is gated on
> Guc version 70.54.0
>
> V2 -> V3:
> - Fixed review comments (Michal W).
> - Updated commit message.
> - Fixed CI.BAT issues.
> - Added helper function to assert on unsupported GUC versions.
> - Updated RESFIX_DONE action name and documenation part.
>
> V1 -> V2:
> - Squashed "Enable RESFIX start marker only on supported GUC
> versions" commit into a single commit. (Matt B)
> ---
> .../gpu/drm/xe/abi/guc_actions_sriov_abi.h | 60 +++++++++++---
> drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 80 ++++++++++++++-----
> drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 5 ++
> drivers/gpu/drm/xe/xe_sriov_vf.c | 16 +++-
> 4 files changed, 131 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> index 0b28659d94e9..1d84ce07b201 100644
> --- a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> +++ b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> @@ -502,13 +502,15 @@
> #define VF2GUC_VF_RESET_RESPONSE_MSG_0_MBZ GUC_HXG_RESPONSE_MSG_0_DATA0
>
> /**
> - * DOC: VF2GUC_NOTIFY_RESFIX_DONE
> + * DOC: VF2GUC_RESFIX_DONE
> *
> - * This action is used by VF to notify the GuC that the VF KMD has completed
> + * This action is used by VF to inform the GuC that the VF KMD has completed
> * post-migration recovery steps.
please mention that from 1.27 it shall only be sent after posting RESFIX_START
and that both @MARKER fields must match
> *
> * This message must be sent as `MMIO HXG Message`_.
> *
> + * Available since GuC VF compatibility 1.27.0.
hmm, actually RESFIX_DONE is also available prior 1.27,
just a meaning of the DATA0 has changed
maybe:
* Updated since GuC VF compatibility 1.27.0.
> + *
> * +---+-------+--------------------------------------------------------------+
> * | | Bits | Description |
> * +===+=======+==============================================================+
> @@ -516,9 +518,9 @@
> * | +-------+--------------------------------------------------------------+
> * | | 30:28 | TYPE = GUC_HXG_TYPE_REQUEST_ |
> * | +-------+--------------------------------------------------------------+
> - * | | 27:16 | DATA0 = MBZ |
> + * | | 27:16 | DATA0 = MARKER - can't be zero |
and then we can keep legacy definition for the record:
* | +-------+--------------------------------------------------------------+
- * | | 27:16 | DATA0 = MBZ |
+ * | | 27:16 | DATA0 = MARKER = MBZ (only prior 1.27.0) |
* | +-------+--------------------------------------------------------------+
+ * | | 27:16 | DATA0 = MARKER - can't be zero (1.27.0+) |
+ * | +-------+--------------------------------------------------------------+
> * | +-------+--------------------------------------------------------------+
> - * | | 15:0 | ACTION = _`GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE` = 0x5508 |
> + * | | 15:0 | ACTION = _`GUC_ACTION_VF2GUC_RESFIX_DONE` = 0x5508 |
> * +---+-------+--------------------------------------------------------------+
> *
> * +---+-------+--------------------------------------------------------------+
> @@ -531,13 +533,13 @@
> * | | 27:0 | DATA0 = MBZ |
> * +---+-------+--------------------------------------------------------------+
> */
> -#define GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE 0x5508u
> +#define GUC_ACTION_VF2GUC_RESFIX_DONE 0x5508u
>
> -#define VF2GUC_NOTIFY_RESFIX_DONE_REQUEST_MSG_LEN GUC_HXG_REQUEST_MSG_MIN_LEN
> -#define VF2GUC_NOTIFY_RESFIX_DONE_REQUEST_MSG_0_MBZ GUC_HXG_REQUEST_MSG_0_DATA0
> +#define VF2GUC_RESFIX_DONE_REQUEST_MSG_LEN GUC_HXG_REQUEST_MSG_MIN_LEN
> +#define VF2GUC_RESFIX_DONE_REQUEST_MSG_0_MARKER GUC_HXG_REQUEST_MSG_0_DATA0
>
> -#define VF2GUC_NOTIFY_RESFIX_DONE_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN
> -#define VF2GUC_NOTIFY_RESFIX_DONE_RESPONSE_MSG_0_MBZ GUC_HXG_RESPONSE_MSG_0_DATA0
> +#define VF2GUC_RESFIX_DONE_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN
> +#define VF2GUC_RESFIX_DONE_RESPONSE_MSG_0_MBZ GUC_HXG_RESPONSE_MSG_0_DATA0
>
> /**
> * DOC: VF2GUC_QUERY_SINGLE_KLV
> @@ -656,4 +658,44 @@
> #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN
> #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_0_USED GUC_HXG_RESPONSE_MSG_0_DATA0
>
> +/**
> + * DOC: VF2GUC_RESFIX_START
> + *
> + * This action is used by VF to inform the GuC that the VF KMD will be starting
> + * post-migration recovery fixups.
please mention that @MARKER sent here must later match the MARKER posted in the
VF2GUC_RESFIX_DONE_ message
> + *
> + * This message must be sent as `MMIO HXG Message`_.
> + *
> + * Available since GuC VF compatibility 1.27.0.
> + *
> + * +---+-------+--------------------------------------------------------------+
> + * | | Bits | Description |
> + * +===+=======+==============================================================+
> + * | 0 | 31 | ORIGIN = GUC_HXG_ORIGIN_HOST_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 30:28 | TYPE = GUC_HXG_TYPE_REQUEST_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 27:16 | DATA0 = MARKER - can't be zero |
> + * | +-------+--------------------------------------------------------------+
> + * | | 15:0 | ACTION = _`GUC_ACTION_VF2GUC_RESFIX_START` = 0x550F |
> + * +---+-------+--------------------------------------------------------------+
> + *
> + * +---+-------+--------------------------------------------------------------+
> + * | | Bits | Description |
> + * +===+=======+==============================================================+
> + * | 0 | 31 | ORIGIN = GUC_HXG_ORIGIN_GUC_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 30:28 | TYPE = GUC_HXG_TYPE_RESPONSE_SUCCESS_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 27:0 | DATA0 = MBZ |
> + * +---+-------+--------------------------------------------------------------+
> + */
> +#define GUC_ACTION_VF2GUC_RESFIX_START 0x550Fu
> +
> +#define VF2GUC_RESFIX_START_REQUEST_MSG_LEN GUC_HXG_REQUEST_MSG_MIN_LEN
> +#define VF2GUC_RESFIX_START_REQUEST_MSG_0_MARKER GUC_HXG_REQUEST_MSG_0_DATA0
> +
> +#define VF2GUC_RESFIX_START_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN
> +#define VF2GUC_RESFIX_START_RESPONSE_MSG_0_MBZ GUC_HXG_RESPONSE_MSG_0_DATA0
> +
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 4c73a077d314..08c00b773a13 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -299,12 +299,13 @@ void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
> *found = gt->sriov.vf.guc_version;
> }
>
> -static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> +static int guc_action_vf_notify_resfix_start(struct xe_guc *guc, u16 marker)
> {
> u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> - FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE),
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_RESFIX_START) |
> + FIELD_PREP(VF2GUC_RESFIX_START_REQUEST_MSG_0_MARKER, marker),
> };
> int ret;
>
> @@ -313,30 +314,54 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> return ret > 0 ? -EPROTO : ret;
> }
>
> -/**
> - * vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
> - * @gt: the &xe_gt struct instance linked to target GuC
> - *
> - * Returns: 0 if the operation completed successfully, or a negative error
> - * code otherwise.
> - */
> -static int vf_notify_resfix_done(struct xe_gt *gt)
> +static int vf_notify_resfix_start(struct xe_gt *gt, u16 marker)
> {
> struct xe_guc *guc = >->uc.guc;
> int err;
>
> xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>
> - err = guc_action_vf_notify_resfix_done(guc);
> + xe_gt_sriov_dbg(guc_to_gt(guc), "Sending resfix start marker %u\n", marker);
shouldn't this be xe_gt_sriov_dbg_verbose() instead?
> +
> + err = guc_action_vf_notify_resfix_start(guc, marker);
> if (unlikely(err))
> - xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup done (%pe)\n",
> + xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup start(%pe)\n",
add space between "start" and "(%pe)"
> ERR_PTR(err));
> - else
> - xe_gt_sriov_dbg_verbose(gt, "sent GuC resource fixup done\n");
>
> return err;
> }
>
> +static int guc_action_vf_notify_resfix_done(struct xe_guc *guc, u16 marker)
> +{
> + u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> + FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> + FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_RESFIX_DONE) |
> + FIELD_PREP(VF2GUC_RESFIX_DONE_REQUEST_MSG_0_MARKER, marker),
> + };
> + int ret;
> +
> + ret = xe_guc_mmio_send(guc, request, ARRAY_SIZE(request));
> +
> + return ret > 0 ? -EPROTO : ret;
> +}
> +
> +static int vf_notify_resfix_done(struct xe_gt *gt, u16 marker)
> +{
> + struct xe_guc *guc = >->uc.guc;
> + int err;
> +
> + xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> +
> + xe_gt_sriov_dbg(guc_to_gt(guc), "Sending resfix done marker %u\n", marker);
dbg_verbose ?
> +
> + err = guc_action_vf_notify_resfix_done(guc, marker);
> + if (unlikely(err))
> + xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup done (%pe)\n",
hmm, it's not only about that _we_ failed, it could be that _GuC_
encountered some errors, as there is ERROR_RESFIX_FAILED, so maybe:
"Recovery failed at GuC FIXUP_DONE step (%pe)"
> + ERR_PTR(err));
> + return err;
> +}
> +
> static int guc_action_query_single_klv(struct xe_guc *guc, u32 key,
> u32 *value, u32 value_len)
> {
> @@ -1183,7 +1208,7 @@ static void vf_post_migration_abort(struct xe_gt *gt)
> xe_guc_submit_pause_abort(>->uc.guc);
> }
>
> -static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> +static int vf_post_migration_notify_resfix_done(struct xe_gt *gt, u16 marker)
> {
> bool skip_resfix = false;
>
> @@ -1206,14 +1231,21 @@ static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> */
> xe_irq_resume(gt_to_xe(gt));
>
> - return vf_notify_resfix_done(gt);
> + return vf_notify_resfix_done(gt, marker);
> +}
> +
> +static u16 vf_post_migration_resfix_start_marker(struct xe_gt *gt)
> +{
> + xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> + return ++gt->sriov.vf.migration.resfix_marker;
should we protect that with lock?
also see below
> }
>
> static void vf_post_migration_recovery(struct xe_gt *gt)
> {
> struct xe_device *xe = gt_to_xe(gt);
> - int err;
> + u16 marker;
> bool retry;
> + int err;
>
> xe_gt_sriov_dbg(gt, "migration recovery in progress\n");
>
> @@ -1227,13 +1259,25 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
> goto fail;
> }
>
> + /*
> + * Increment the startup marker again if it overflows, since GUC
> + * requires a non-zero marker to be set.
> + */
> + marker = vf_post_migration_resfix_start_marker(gt);
> + if (!marker)
> + marker = vf_post_migration_resfix_start_marker(gt);
this "overflow" logic shall be in vf_post_migration_resfix_start_marker()
OTOH by looking at the expected flow, maybe we don't need to track this
marker at all, as it should be sufficient to always pass the same const
non-zero value, GuC will just compare it with 0 anyway
and we send RESFIX_START/DONE only from within this worker, so we will
never have two parallel recovery sequences which would warrant different
markers
> +
> + err = vf_notify_resfix_start(gt, marker);
> + if (err)
> + goto fail;
> +
> err = vf_post_migration_fixups(gt);
> if (err)
> goto fail;
>
> vf_post_migration_rearm(gt);
>
> - err = vf_post_migration_notify_resfix_done(gt);
> + err = vf_post_migration_notify_resfix_done(gt, marker);
> if (err && err != -EAGAIN)
> goto fail;
>
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index 420b0e6089de..66c0062a42c6 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -52,6 +52,11 @@ struct xe_gt_sriov_vf_migration {
> wait_queue_head_t wq;
> /** @scratch: Scratch memory for VF recovery */
> void *scratch;
> + /**
> + * @resfix_marker: Marker sent on start and on end of post-migration
> + * steps.
> + */
> + u16 resfix_marker;
> /** @recovery_teardown: VF post migration recovery is being torn down */
> bool recovery_teardown;
> /** @recovery_queued: VF post migration recovery in queued */
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> index b73498097df5..64b2ddabd3f9 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -49,11 +49,13 @@
> *
> * As soon as Virtual GPU of the VM starts, the VF driver within receives
> * the MIGRATED interrupt and schedules post-migration recovery worker.
> - * That worker queries GuC for new provisioning (using MMIO communication),
> + * That worker sends `VF2GUC_NOTIFY_RESFIX_START` action along with non-zero
drop NOTIFY tag and use trailing _ to create a link:
VF2GUC_RESFIX_START_
> + * marker, queries GuC for new provisioning (using MMIO communication),
> * and applies fixups to any non-virtualized resources used by the VF.
> *
> * When the VF driver is ready to continue operation on the newly connected
> - * hardware, it sends `VF2GUC_NOTIFY_RESFIX_DONE` which causes it to
> + * hardware, it sends `VF2GUC_NOTIFY_RESFIX_DONE` action along with the same
> + * marker which was sent with `VF2GUC_NOTIFY_RESFIX_START` which causes it to
ditto
> * enter the long awaited `VF_RUNNING` state, and therefore start handling
> * CTB messages and scheduling workloads from the VF::
> *
> @@ -102,6 +104,11 @@
> * | [ ] new VF provisioning [ ]
> * | [ ]---------------------------> [ ]
> * | | [ ]
> + * | | VF2GUC_NOTIFY_RESFIX_START [ ]
ditto, drop NOTIFY
> + * | [ ] <---------------------------[ ]
> + * | [ ] [ ]
> + * | [ ] success [ ]
> + * | [ ]---------------------------> [ ]
> * | | VF driver applies post [ ]
> * | | migration fixups -------[ ]
> * | | | [ ]
> @@ -114,7 +121,10 @@
> * | [ ]------- VF_RUNNING [ ]
> * | [ ] | [ ]
> * | [ ] <----- [ ]
> - * | [ ] success [ ]
> + * | [ ] success (on marker match) [ ]
> + * | [ ]---------------------------> [ ]
> + * | [ ] error (on marker match) [ ]
> + * | [ ] ERROR_RESFIX_MARKER_MISMATCH[ ]
this error is about bad programming, not worth mentioning here
for the double-migration case, we expect STATUS_VF_MIGRATED instead
and in case of error/double migration, VF will not be moved to RUNNING state
> * | [ ]---------------------------> [ ]
> * | | |
> * | | |
next prev parent reply other threads:[~2025-11-19 17:24 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-18 11:41 [PATCH v4 0/3] VF double migration Satyanarayana K V P
2025-11-18 11:41 ` [PATCH v4 1/3] drm/xe/vf: Enable VF migration only on supported GUC versions Satyanarayana K V P
2025-11-19 14:47 ` Michal Wajdeczko
2025-11-18 11:41 ` [PATCH v4 2/3] drm/xe/vf: Introduce RESFIX start marker support Satyanarayana K V P
2025-11-19 17:24 ` Michal Wajdeczko [this message]
2025-11-19 17:38 ` Matthew Brost
2025-11-20 13:33 ` K V P, Satyanarayana
2025-11-18 11:41 ` [PATCH v4 3/3] drm/xe/vf: Add debugfs entries to test VF double migration Satyanarayana K V P
2025-11-19 17:51 ` Michal Wajdeczko
2025-11-20 13:35 ` K V P, Satyanarayana
2025-11-18 12:31 ` ✓ CI.KUnit: success for VF double migration (rev4) Patchwork
2025-11-18 13:09 ` ✓ Xe.CI.BAT: " Patchwork
2025-11-18 15:19 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dfab58cc-2d70-446c-9041-b980e64da10e@intel.com \
--to=michal.wajdeczko@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=satyanarayana.k.v.p@intel.com \
--cc=tomasz.lis@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox