From: Michal Wajdeczko <michal.wajdeczko@intel.com>
To: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>,
<intel-xe@lists.freedesktop.org>
Cc: Matthew Brost <matthew.brost@intel.com>,
Tomasz Lis <tomasz.lis@intel.com>
Subject: Re: [PATCH v2 1/2] drm/xe/vf: Introduce RESFIX start marker support
Date: Fri, 24 Oct 2025 01:33:12 +0200 [thread overview]
Message-ID: <cd44e3b5-cd3d-4297-a1a3-f417163e8562@intel.com> (raw)
In-Reply-To: <20251023153616.3790-5-satyanarayana.k.v.p@intel.com>
On 10/23/2025 5:36 PM, Satyanarayana K V P wrote:
> Post migration, a marker is sent to the GUC prior to the start of resource
> fixups to indicate start of resource fixups. The same marker is sent along
> with RESFIX_DONE notification so that GUC can avoid submitting jobs to HW
> in case of double migration.
it might be still unclear "why" currently it is not working
try to better describe initial problem first
(done message could be sent without VF notice 2nd migration and need for new fixups),
not just the potential solution
(use of the start message with marker)
>
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Tomasz Lis <tomasz.lis@intel.com>
>
> ---
> V1 -> V2:
> - Squashed "Enable RESFIX start marker only on supported GUC
> versions" commit into a single commit. (Matt B)
> ---
> .../gpu/drm/xe/abi/guc_actions_sriov_abi.h | 38 ++++++++
> drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 87 +++++++++++++++++--
> drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 5 ++
> drivers/gpu/drm/xe/xe_sriov_vf.c | 42 ++++++++-
> drivers/gpu/drm/xe/xe_sriov_vf_types.h | 5 ++
> 5 files changed, 169 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> index 0b28659d94e9..b9141497bfd5 100644
> --- a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> +++ b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h
> @@ -656,4 +656,42 @@
> #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN
> #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_0_USED GUC_HXG_RESPONSE_MSG_0_DATA0
>
> +/**
> + * DOC: VF2GUC_NOTIFY_RESFIX_START
> + *
> + * This action is used by VF to notify the GuC that the VF KMD will be starting
> + * post-migration recovery steps.
since this is new H2G action, IMO we shall say here from which VF ABI version it is available
* Available since GuC version 70.xx.yy (VF 1.aa.bb)
but I can't find those numbers yet
> + *
> + * This message must be sent as `MMIO HXG Message`_.
> + *
> + * +---+-------+--------------------------------------------------------------+
> + * | | Bits | Description |
> + * +===+=======+==============================================================+
> + * | 0 | 31 | ORIGIN = GUC_HXG_ORIGIN_HOST_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 30:28 | TYPE = GUC_HXG_TYPE_REQUEST_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 27:16 | DATA0 = MBZ |
from the code it looks that this should be
* | | 27:16 | DATA0 = MARKER (....) |
> + * | +-------+--------------------------------------------------------------+
> + * | | 15:0 | ACTION = _`GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START` = 0x550F |
> + * +---+-------+--------------------------------------------------------------+
> + *
> + * +---+-------+--------------------------------------------------------------+
> + * | | Bits | Description |
> + * +===+=======+==============================================================+
> + * | 0 | 31 | ORIGIN = GUC_HXG_ORIGIN_GUC_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 30:28 | TYPE = GUC_HXG_TYPE_RESPONSE_SUCCESS_ |
> + * | +-------+--------------------------------------------------------------+
> + * | | 27:0 | DATA0 = MBZ |
> + * +---+-------+--------------------------------------------------------------+
> + */
> +#define GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START 0x550Fu
> +
> +#define VF2GUC_NOTIFY_RESFIX_START_REQUEST_MSG_LEN GUC_HXG_REQUEST_MSG_MIN_LEN
> +#define VF2GUC_NOTIFY_RESFIX_START_REQUEST_MSG_0_MBZ GUC_HXG_REQUEST_MSG_0_DATA0
and this shall be
VF2GUC_NOTIFY_RESFIX_START_REQUEST_MSG_0_MARKER
> +
> +#define VF2GUC_NOTIFY_RESFIX_START_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN
> +#define VF2GUC_NOTIFY_RESFIX_START_RESPONSE_MSG_0_MBZ GUC_HXG_RESPONSE_MSG_0_DATA0
> +
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index d0b102ab6ce8..8c1448d6c81d 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -299,12 +299,55 @@ void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt,
> *found = gt->sriov.vf.guc_version;
> }
>
> -static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> +static int guc_action_vf_notify_resfix_start(struct xe_guc *guc, u16 marker)
> {
> u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> - FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE),
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION,
> + GUC_ACTION_VF2GUC_NOTIFY_RESFIX_START) |
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_DATA0, marker),
> + };
> + int ret;
we should assert that negotiated ABI version supports that new action
> +
> + ret = xe_guc_mmio_send(guc, request, ARRAY_SIZE(request));
> +
> + return ret > 0 ? -EPROTO : ret;
> +}
> +
> +/**
> + * vf_notify_resfix_start - Notify GuC about start of resource fixups.
> + * @gt: the &xe_gt struct instance linked to target GuC
> + * @marker: marker to identify the migration.
> + *
> + * Returns: 0 if the operation completed successfully, or a negative error
> + * code otherwise.
> + */
no need to document trivial static functions,
there was a reason to make then small/trivial and thus self-documenting
> +static int vf_notify_resfix_start(struct xe_gt *gt, u16 marker)
> +{
> + struct xe_guc *guc = >->uc.guc;
> + int err;
> +
> + xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> +
put this dbg_verbose() here and include marker value in the log
> + err = guc_action_vf_notify_resfix_start(guc, marker);
> + if (unlikely(err))
> + xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup start (%pe)\n",
> + ERR_PTR(err));
> + else
> + xe_gt_sriov_dbg_verbose(gt, "sent GuC resource fixup start\n");
> +
> + return err;
> +}
> +
> +static int guc_action_vf_notify_resfix_done(struct xe_guc *guc, u16 marker)
> +{
> + u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = {
> + FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) |
> + FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) |
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION,
> + GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE) |
> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_DATA0, marker),
note that not all GuCs will support non-zero marker
we shall assert that non-zero marker is only used on newer ABI
and zero on old ABI
unless this new ABI is Xe baseline (but AFAIK it's not)
> };
> int ret;
>
> @@ -316,18 +359,19 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc)
> /**
> * vf_notify_resfix_done - Notify GuC about resource fixups apply completed.
> * @gt: the &xe_gt struct instance linked to target GuC
> + * @marker: marker to identify the migration.
> *
> * Returns: 0 if the operation completed successfully, or a negative error
> * code otherwise.
> */
> -static int vf_notify_resfix_done(struct xe_gt *gt)
> +static int vf_notify_resfix_done(struct xe_gt *gt, u16 marker)
> {
> struct xe_guc *guc = >->uc.guc;
> int err;
>
> xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
>
> - err = guc_action_vf_notify_resfix_done(guc);
> + err = guc_action_vf_notify_resfix_done(guc, marker);
> if (unlikely(err))
> xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup done (%pe)\n",
> ERR_PTR(err));
> @@ -1183,7 +1227,7 @@ static void vf_post_migration_abort(struct xe_gt *gt)
> xe_guc_submit_pause_abort(>->uc.guc);
> }
>
> -static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> +static int vf_post_migration_notify_resfix_done(struct xe_gt *gt, u16 marker)
> {
> bool skip_resfix = false;
>
> @@ -1206,12 +1250,27 @@ static int vf_post_migration_notify_resfix_done(struct xe_gt *gt)
> */
> xe_irq_resume(gt_to_xe(gt));
>
> - return vf_notify_resfix_done(gt);
> + return vf_notify_resfix_done(gt, marker);
> +}
> +
> +static bool vf_resfix_start_marker_supported(struct xe_gt *gt)
> +{
> + struct xe_device *xe = gt_to_xe(gt);
> +
> + xe_gt_assert(gt, IS_SRIOV_VF(xe));
> + return xe->sriov.vf.migration.resfix_marker_enabled;
> +}
> +
> +static u16 vf_post_migration_resfix_start_marker(struct xe_gt *gt)
> +{
> + xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt)));
> + return ++gt->sriov.vf.migration.resfix_marker;
> }
>
> static void vf_post_migration_recovery(struct xe_gt *gt)
> {
> struct xe_device *xe = gt_to_xe(gt);
> + u16 marker = 0;
> int err;
> bool retry;
>
> @@ -1227,13 +1286,27 @@ static void vf_post_migration_recovery(struct xe_gt *gt)
> goto fail;
> }
>
> + /*
> + * Increment the startup marker again if it overflows, since GUC
> + * requires a non-zero marker to be set.
> + */
> + if (vf_resfix_start_marker_supported(gt)) {
> + marker = vf_post_migration_resfix_start_marker(gt);
> + if (!marker)
> + marker = vf_post_migration_resfix_start_marker(gt);
> + }
> +
> + err = vf_notify_resfix_start(gt, marker);
> + if (err)
> + goto fail;
> +
> err = vf_post_migration_fixups(gt);
> if (err)
> goto fail;
>
> vf_post_migration_rearm(gt);
>
> - err = vf_post_migration_notify_resfix_done(gt);
> + err = vf_post_migration_notify_resfix_done(gt, marker);
> if (err && err != -EAGAIN)
> goto fail;
>
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> index 420b0e6089de..ccd850313328 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h
> @@ -60,6 +60,11 @@ struct xe_gt_sriov_vf_migration {
> bool recovery_inprogress;
> /** @ggtt_need_fixes: VF GGTT needs fixes */
> bool ggtt_need_fixes;
> + /**
> + * @resfix_marker: Marker sent to Guc prior to starting the
> + * post‑migration.
> + */
> + u16 resfix_marker;
> };
>
> /**
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> index 39c829daa97c..10d6e43fffce 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -55,7 +55,21 @@
> * When the VF driver is ready to continue operation on the newly connected
> * hardware, it sends `VF2GUC_NOTIFY_RESFIX_DONE` which causes it to
> * enter the long awaited `VF_RUNNING` state, and therefore start handling
> - * CTB messages and scheduling workloads from the VF::
> + * CTB messages and scheduling workloads from the VF.
> + *
> + * In scenarios involving double migration, the VF KMD may encounter situations
> + * where it is instructed to re-migrate before having the opportunity to send
> + * RESFIX_DONE for the initial migration. This can occur when the fix-up for the
> + * prior migration is still underway, but the VF KMD is migrated again.
> + * Consequently, this may lead to the possibility of sending two migration
> + * notifications (i.e., pending fix-up for the first migration and a second
> + * notification for the new migration). Upon receiving the first RES_FIX
> + * notification, the GuC will resume VF submission on the GPU, potentially
> + * resulting in undefined behavior, such as system hangs or crashes.
> + *
> + * To avoid these hangs, a new VF2GUC action `VF2GUC_NOTIFY_RESFIX_START` is
> + * sent along with marker and when GUC receives the same marker with
> + * `VF2GUC_NOTIFY_RESFIX_DONE`action, it starts scheduling work loads from VF::
> *
> * PF GuC VF
> * [ ] | |
> @@ -102,6 +116,11 @@
> * | [ ] new VF provisioning [ ]
> * | [ ]---------------------------> [ ]
> * | | [ ]
> + * | | VF2GUC_NOTIFY_RESFIX_START [ ]
> + * | [ ] <---------------------------[ ]
> + * | [ ] [ ]
> + * | [ ] success [ ]
> + * | [ ]---------------------------> [ ]
you may also show below the flow when GuC rejects RESFIX_DONE due to a marker mismatch
> * | | VF driver applies post [ ]
> * | | migration fixups -------[ ]
> * | | | [ ]
> @@ -169,6 +188,26 @@ static void vf_migration_init_early(struct xe_device *xe)
>
> }
>
> +static void vf_resfix_start_marker_init_early(struct xe_device *xe)
> +{
> + struct xe_gt *gt = xe_root_mmio_gt(xe);
> + struct xe_uc_fw_version guc_version;
> +
> + if (xe->sriov.vf.migration.disabled)
> + return;
> +
> + xe_gt_sriov_vf_guc_versions(gt, NULL, &guc_version);
as CI already noticed, this could be too early to check ABI
> + if (MAKE_GUC_VER_STRUCT(guc_version) < MAKE_GUC_VER(1, 24, 10)) {
> + xe_sriov_notice(xe,
> + "Resfix start marker requires GUC ABI >= 1.24.10, but only %u.%u.%u found",
> + guc_version.major, guc_version.minor, guc_version.patch);
hmm, are you sure about these versions ? I can't find it in 1.24.11
> + return;
> + }
> +
> + xe->sriov.vf.migration.resfix_marker_enabled = true;
> + xe_sriov_dbg(xe, "migrate: Resfix start marker support is enabled\n");
> +}
> +
> /**
> * xe_sriov_vf_init_early - Initialize SR-IOV VF specific data.
> * @xe: the &xe_device to initialize
> @@ -176,6 +215,7 @@ static void vf_migration_init_early(struct xe_device *xe)
> void xe_sriov_vf_init_early(struct xe_device *xe)
> {
> vf_migration_init_early(xe);
> + vf_resfix_start_marker_init_early(xe);
> }
>
> /**
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> index d5f72d667817..626c11a6dd1b 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf_types.h
> @@ -38,6 +38,11 @@ struct xe_device_vf {
> * was turned off due to missing prerequisites
> */
> bool disabled;
> + /**
> + * @migration.resfix_marker_enabled: flag indicating if resfix marker
> + * support was enabled or not due to missing prerequisites.
> + */
> + bool resfix_marker_enabled;
> } migration;
>
> /** @ccs: VF CCS state data */
next prev parent reply other threads:[~2025-10-23 23:33 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-23 15:36 [PATCH v2 0/2] VF double migration Satyanarayana K V P
2025-10-23 15:36 ` [PATCH v2 1/2] drm/xe/vf: Introduce RESFIX start marker support Satyanarayana K V P
2025-10-23 20:37 ` Matthew Brost
2025-10-23 20:54 ` Matthew Brost
2025-10-31 20:10 ` Matthew Brost
2025-10-23 23:33 ` Michal Wajdeczko [this message]
2025-10-23 15:36 ` [PATCH v2 2/2] drm/xe/vf: Use fault injection for testing VF double migration feature Satyanarayana K V P
2025-10-24 11:52 ` Michal Wajdeczko
2025-10-23 17:18 ` ✓ CI.KUnit: success for VF double migration (rev2) Patchwork
2025-10-23 18:17 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-10-24 5:23 ` ✗ Xe.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cd44e3b5-cd3d-4297-a1a3-f417163e8562@intel.com \
--to=michal.wajdeczko@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=satyanarayana.k.v.p@intel.com \
--cc=tomasz.lis@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox