From: Matthew Brost <matthew.brost@intel.com>
To: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
Michal Wajdeczko <michal.wajdeczko@intel.com>,
Tomasz Lis <tomasz.lis@intel.com>
Subject: Re: [PATCH v2 1/2] drm/xe/vf: Introduce RESFIX start marker support
Date: Thu, 23 Oct 2025 13:37:33 -0700 [thread overview]
Message-ID: <aPqSDVJ4QR7kzT07@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <20251023153616.3790-5-satyanarayana.k.v.p@intel.com>
On Thu, Oct 23, 2025 at 09:06:11PM +0530, Satyanarayana K V P wrote:
> Post migration, a marker is sent to the GUC prior to the start of resource
> fixups to indicate start of resource fixups. The same marker is sent along
> with RESFIX_DONE notification so that GUC can avoid submitting jobs to HW
> in case of double migration.
>
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Tomasz Lis <tomasz.lis@intel.com>
>
> ---
> V1 -> V2:
> - Squashed "Enable RESFIX start marker only on supported GUC
> versions" commit into a single commit. (Matt B)
> ---
> .../gpu/drm/xe/abi/guc_actions_sriov_abi.h | 38 ++++++++
> drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 87 +++++++++++++++++--
> drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 5 ++
> drivers/gpu/drm/xe/xe_sriov_vf.c | 42 ++++++++-
> drivers/gpu/drm/xe/xe_sriov_vf_types.h | 5 ++
> 5 files changed, 169 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> index 0b28659d94e9..b9141497bfd5 100644
> --- a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> +++ b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> @@ -656,4 +656,42 @@
> #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN
> #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_0_USED GUC_HXG_RESPONSE_MSG_0_DATA0
>
> +/**
> + * DOC: VF2GUC_NOTIFY_RESFIX_START
> + *
> + * This action is used by VF to notify the GuC that the VF KMD will be starting
> + * post-migration recovery steps.
> + *
> + * This message must be sent as `MMIO HXG Message`_.
> + *
> + * +---+-------+--------------------------------------------------------------+
> + * | | Bits | Description |
> + * +===+=======+==============================================================+
> + * | 0 | 31 | ORIGIN = GUC_HXG_ORIGIN_HOST_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 30:28 | TYPE = GUC_HXG_TYPE_REQUEST_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 27:16 | DATA0 = MBZ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 15:0 | ACTION = _`GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START` = 0x550F |
> + * +---+-------+--------------------------------------------------------------+
> + *
> + * +---+-------+--------------------------------------------------------------+
> + * | | Bits | Description |
> + * +===+=======+==============================================================+
> + * | 0 | 31 | ORIGIN = GUC_HXG_ORIGIN_GUC_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 30:28 | TYPE = GUC_HXG_TYPE_RESPONSE_SUCCESS_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 27:0 | DATA0 = MBZ |
> + * +---+-------+--------------------------------------------------------------+
> + */
> +#define GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START 0x550Fu
> +
> +#define VF2GUC_NOTIFY_RESFIX_START_REQUEST_MSG_LEN GUC_HXG_REQUEST_MSG_MIN_LEN
> +#define VF2GUC_NOTIFY_RESFIX_START_REQUEST_MSG_0_MBZ GUC_HXG_REQUEST_MSG_0_DATA0
> +
> +#define VF2GUC_NOTIFY_RESFIX_START_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN
> +#define VF2GUC_NOTIFY_RESFIX_START_RESPONSE_MSG_0_MBZ GUC_HXG_RESPONSE_MSG_0_DATA0
> +
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index d0b102ab6ce8..8c1448d6c81d 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -299,12 +299,55 @@ void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
> *found = gt->sriov.vf.guc_version;
> }
>
> -static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> +static int guc_action_vf_notify_resfix_start(struct xe_guc *guc, u16 marker)
> {
> u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> - FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE),
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION,
> + GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START) |
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_DATA0, marker),
> + };
> + int ret;
> +
> + ret = xe_guc_mmio_send(guc, request, ARRAY_SIZE(request));
> +
> + return ret > 0 ? -EPROTO : ret;
> +}
> +
> +/**
> + * vf_notify_resfix_start - Notify GuC about start of resource fixups.
> + * @gt: the &xe_gt struct instance linked to target GuC
> + * @marker: marker to identify the migration.
> + *
> + * Returns: 0 if the operation completed successfully, or a negative error
> + * code otherwise.
> + */
> +static int vf_notify_resfix_start(struct xe_gt *gt, u16 marker)
> +{
> + struct xe_guc *guc = >->uc.guc;
> + int err;
> +
> + xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> +
> + err = guc_action_vf_notify_resfix_start(guc, marker);
> + if (unlikely(err))
> + xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup start (%pe)\n",
> + ERR_PTR(err));
> + else
> + xe_gt_sriov_dbg_verbose(gt, "sent GuC resource fixup start\n");
> +
> + return err;
> +}
> +
> +static int guc_action_vf_notify_resfix_done(struct xe_guc *guc, u16 marker)
> +{
> + u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> + FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> + FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION,
> + GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE) |
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_DATA0, marker),
> };
> int ret;
>
> @@ -316,18 +359,19 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> /**
> * vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
> * @gt: the &xe_gt struct instance linked to target GuC
> + * @marker: marker to identify the migration.
> *
> * Returns: 0 if the operation completed successfully, or a negative error
> * code otherwise.
> */
> -static int vf_notify_resfix_done(struct xe_gt *gt)
> +static int vf_notify_resfix_done(struct xe_gt *gt, u16 marker)
> {
> struct xe_guc *guc = >->uc.guc;
> int err;
>
> xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>
> - err = guc_action_vf_notify_resfix_done(guc);
> + err = guc_action_vf_notify_resfix_done(guc, marker);
> if (unlikely(err))
> xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup done (%pe)\n",
> ERR_PTR(err));
> @@ -1183,7 +1227,7 @@ static void vf_post_migration_abort(struct xe_gt *gt)
> xe_guc_submit_pause_abort(>->uc.guc);
> }
>
> -static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> +static int vf_post_migration_notify_resfix_done(struct xe_gt *gt, u16 marker)
> {
> bool skip_resfix = false;
>
> @@ -1206,12 +1250,27 @@ static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> */
> xe_irq_resume(gt_to_xe(gt));
>
> - return vf_notify_resfix_done(gt);
> + return vf_notify_resfix_done(gt, marker);
> +}
> +
> +static bool vf_resfix_start_marker_supported(struct xe_gt *gt)
> +{
> + struct xe_device *xe = gt_to_xe(gt);
> +
> + xe_gt_assert(gt, IS_SRIOV_VF(xe));
> + return xe->sriov.vf.migration.resfix_marker_enabled;
> +}
> +
> +static u16 vf_post_migration_resfix_start_marker(struct xe_gt *gt)
> +{
> + xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> + return ++gt->sriov.vf.migration.resfix_marker;
> }
>
> static void vf_post_migration_recovery(struct xe_gt *gt)
> {
> struct xe_device *xe = gt_to_xe(gt);
> + u16 marker = 0;
> int err;
> bool retry;
>
> @@ -1227,13 +1286,27 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
> goto fail;
> }
>
> + /*
> + * Increment the startup marker again if it overflows, since GUC
> + * requires a non-zero marker to be set.
> + */
> + if (vf_resfix_start_marker_supported(gt)) {
> + marker = vf_post_migration_resfix_start_marker(gt);
> + if (!marker)
> + marker = vf_post_migration_resfix_start_marker(gt);
Nit: Maybe just don't zero in vf_post_migration_resfix_start_marker.
> + }
> +
> + err = vf_notify_resfix_start(gt, marker);
> + if (err)
> + goto fail;
> +
> err = vf_post_migration_fixups(gt);
> if (err)
> goto fail;
>
> vf_post_migration_rearm(gt);
>
> - err = vf_post_migration_notify_resfix_done(gt);
> + err = vf_post_migration_notify_resfix_done(gt, marker);
> if (err && err != -EAGAIN)
> goto fail;
>
Nit: You missed my comment skipping kickstart if err == -EAGAIN but I
think that harmless, in existing code, and can be done in a follow up.
Anyways patch looks functionaly correct to me:
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index 420b0e6089de..ccd850313328 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -60,6 +60,11 @@ struct xe_gt_sriov_vf_migration {
> bool recovery_inprogress;
> /** @ggtt_need_fixes: VF GGTT needs fixes */
> bool ggtt_need_fixes;
> + /**
> + * @resfix_marker: Marker sent to Guc prior to starting the
> + * post‑migration.
> + */
> + u16 resfix_marker;
> };
>
> /**
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> index 39c829daa97c..10d6e43fffce 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -55,7 +55,21 @@
> * When the VF driver is ready to continue operation on the newly connected
> * hardware, it sends `VF2GUC_NOTIFY_RESFIX_DONE` which causes it to
> * enter the long awaited `VF_RUNNING` state, and therefore start handling
> - * CTB messages and scheduling workloads from the VF::
> + * CTB messages and scheduling workloads from the VF.
> + *
> + * In scenarios involving double migration, the VF KMD may encounter situations
> + * where it is instructed to re-migrate before having the opportunity to send
> + * RESFIX_DONE for the initial migration. This can occur when the fix-up for the
> + * prior migration is still underway, but the VF KMD is migrated again.
> + * Consequently, this may lead to the possibility of sending two migration
> + * notifications (i.e., pending fix-up for the first migration and a second
> + * notification for the new migration). Upon receiving the first RES_FIX
> + * notification, the GuC will resume VF submission on the GPU, potentially
> + * resulting in undefined behavior, such as system hangs or crashes.
> + *
> + * To avoid these hangs, a new VF2GUC action `VF2GUC_NOTIFY_RESFIX_START` is
> + * sent along with marker and when GUC receives the same marker with
> + * `VF2GUC_NOTIFY_RESFIX_DONE`action, it starts scheduling work loads from VF::
> *
> * PF GuC VF
> * [ ] | |
> @@ -102,6 +116,11 @@
> * | [ ] new VF provisioning [ ]
> * | [ ]---------------------------> [ ]
> * | | [ ]
> + * | | VF2GUC_NOTIFY_RESFIX_START [ ]
> + * | [ ] <---------------------------[ ]
> + * | [ ] [ ]
> + * | [ ] success [ ]
> + * | [ ]---------------------------> [ ]
> * | | VF driver applies post [ ]
> * | | migration fixups -------[ ]
> * | | | [ ]
> @@ -169,6 +188,26 @@ static void vf_migration_init_early(struct xe_device *xe)
>
> }
>
> +static void vf_resfix_start_marker_init_early(struct xe_device *xe)
> +{
> + struct xe_gt *gt = xe_root_mmio_gt(xe);
> + struct xe_uc_fw_version guc_version;
> +
> + if (xe->sriov.vf.migration.disabled)
> + return;
> +
> + xe_gt_sriov_vf_guc_versions(gt, NULL, &guc_version);
> + if (MAKE_GUC_VER_STRUCT(guc_version) < MAKE_GUC_VER(1, 24, 10)) {
> + xe_sriov_notice(xe,
> + "Resfix start marker requires GUC ABI >= 1.24.10, but only %u.%u.%u found",
> + guc_version.major, guc_version.minor, guc_version.patch);
> + return;
> + }
> +
> + xe->sriov.vf.migration.resfix_marker_enabled = true;
> + xe_sriov_dbg(xe, "migrate: Resfix start marker support is enabled\n");
> +}
> +
> /**
> * xe_sriov_vf_init_early - Initialize SR-IOV VF specific data.
> * @xe: the &xe_device to initialize
> @@ -176,6 +215,7 @@ static void vf_migration_init_early(struct xe_device *xe)
> void xe_sriov_vf_init_early(struct xe_device *xe)
> {
> vf_migration_init_early(xe);
> + vf_resfix_start_marker_init_early(xe);
> }
>
> /**
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> index d5f72d667817..626c11a6dd1b 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> @@ -38,6 +38,11 @@ struct xe_device_vf {
> * was turned off due to missing prerequisites
> */
> bool disabled;
> + /**
> + * @migration.resfix_marker_enabled: flag indicating if resfix marker
> + * support was enabled or not due to missing prerequisites.
> + */
> + bool resfix_marker_enabled;
> } migration;
>
> /** @ccs: VF CCS state data */
> --
> 2.51.0
>
next prev parent reply other threads:[~2025-10-23 20:37 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-23 15:36 [PATCH v2 0/2] VF double migration Satyanarayana K V P
2025-10-23 15:36 ` [PATCH v2 1/2] drm/xe/vf: Introduce RESFIX start marker support Satyanarayana K V P
2025-10-23 20:37 ` Matthew Brost [this message]
2025-10-23 20:54 ` Matthew Brost
2025-10-31 20:10 ` Matthew Brost
2025-10-23 23:33 ` Michal Wajdeczko
2025-10-23 15:36 ` [PATCH v2 2/2] drm/xe/vf: Use fault injection for testing VF double migration feature Satyanarayana K V P
2025-10-24 11:52 ` Michal Wajdeczko
2025-10-23 17:18 ` ✓ CI.KUnit: success for VF double migration (rev2) Patchwork
2025-10-23 18:17 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-10-24 5:23 ` ✗ Xe.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aPqSDVJ4QR7kzT07@lstrano-desk.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=michal.wajdeczko@intel.com \
--cc=satyanarayana.k.v.p@intel.com \
--cc=tomasz.lis@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox