On 30-Nov-25 1:31 AM, Michal Wajdeczko wrote: > > On 11/28/2025 2:30 PM, Satyanarayana K V P wrote: >> In scenarios involving double migration, the VF KMD may encounter >> situations where it is instructed to re-migrate before having the >> opportunity to send RESFIX_DONE for the initial migration. This can occur >> when the fix-up for the prior migration is still underway, but the VF KMD >> is migrated again. >> >> Consequently, this may lead to the possibility of sending two migration >> notifications (i.e., pending fix-up for the first migration and a second >> notification for the new migration). Upon receiving the first RES_FIX >> notification, the GuC will resume VF submission on the GPU, potentially >> resulting in undefined behavior, such as system hangs or crashes. >> >> To avoid this, post migration, a marker is sent to the GUC prior to the >> start of resource fixups to indicate start of resource fixups. The same >> marker is sent along with RESFIX_DONE notification so that GUC can avoid >> submitting jobs to HW in case of double migration. >> >> Signed-off-by: Satyanarayana K V P >> Cc: Michal Wajdeczko >> Cc: Matthew Brost >> Cc: Tomasz Lis >> >> --- >> V6 -> V7: >> - Fixed review comments (Michal W). >> - Made resfix_start marker width to u8. >> - Removed XE_GUC_RESPONSE_VF_MIGRATED handling in xe_guc_mmio_send_recv() >> function and moved to seperate patch. >> >> V5 -> V6: >> - Fixed review comments (Michal W). >> - Updated resfix_done and res_fix_start function names. >> - Handled XE_GUC_RESPONSE_VF_MIGRATED error case received from GuC. >> - Remove skip_resfix error when another migration is in queue. >> >> V4 -> V5: >> - Fixed review comments (Michal W). >> - Fixed minor debug log levels and documentation part. >> - Moved complete marker logic to vf_post_migration_resfix_start_marker() >> >> V3 -> V4: >> - Updated RESFIX_DONE action name and documenation part. (Michal W) >> - Enable resfxi_start marked by default as sav/restore is gated on >> Guc version 70.54.0 >> >> V2 -> V3: >> - Fixed review comments (Michal W). >> - Updated commit message. >> - Fixed CI.BAT issues. >> - Added helper function to assert on unsupported GUC versions. >> - Updated RESFIX_DONE action name and documenation part. >> >> V1 -> V2: >> - Squashed "Enable RESFIX start marker only on supported GUC >> versions" commit into a single commit. (Matt B) >> --- >> .../gpu/drm/xe/abi/guc_actions_sriov_abi.h | 67 +++++++++++-- >> drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 94 ++++++++++++------- >> drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h | 5 + >> drivers/gpu/drm/xe/xe_sriov_vf.c | 64 ++++++++++++- >> 4 files changed, 184 insertions(+), 46 deletions(-) >> >> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h >> index 0b28659d94e9..d9f21202e1a9 100644 >> --- a/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h >> +++ b/drivers/gpu/drm/xe/abi/guc_actions_sriov_abi.h >> @@ -502,13 +502,17 @@ >> #define VF2GUC_VF_RESET_RESPONSE_MSG_0_MBZ GUC_HXG_RESPONSE_MSG_0_DATA0 >> >> /** >> - * DOC: VF2GUC_NOTIFY_RESFIX_DONE >> + * DOC: VF2GUC_RESFIX_DONE >> * >> - * This action is used by VF to notify the GuC that the VF KMD has completed >> - * post-migration recovery steps. >> + * This action is used by VF to inform the GuC that the VF KMD has completed >> + * post-migration recovery steps. From GuC VF compatibility 1.27.0 onwards, it >> + * shall only be sent after posting RESFIX_START and that both @MARKER fields >> + * must match. >> * >> * This message must be sent as `MMIO HXG Message`_. >> * >> + * Updated since GuC VF compatibility 1.27.0. >> + * >> * +---+-------+--------------------------------------------------------------+ >> * | | Bits | Description | >> * +===+=======+==============================================================+ >> @@ -516,9 +520,11 @@ >> * | +-------+--------------------------------------------------------------+ >> * | | 30:28 | TYPE = GUC_HXG_TYPE_REQUEST_ | >> * | +-------+--------------------------------------------------------------+ >> - * | | 27:16 | DATA0 = MBZ | >> + * | | 27:16 | DATA0 = MARKER = MBZ (only prior 1.27.0) | >> * | +-------+--------------------------------------------------------------+ >> - * | | 15:0 | ACTION = _`GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE` = 0x5508 | >> + * | | 27:16 | DATA0 = MARKER - can't be zero (1.27.0+) | >> + * | +-------+--------------------------------------------------------------+ >> + * | | 15:0 | ACTION = _`GUC_ACTION_VF2GUC_RESFIX_DONE` = 0x5508 | >> * +---+-------+--------------------------------------------------------------+ >> * >> * +---+-------+--------------------------------------------------------------+ >> @@ -531,13 +537,13 @@ >> * | | 27:0 | DATA0 = MBZ | >> * +---+-------+--------------------------------------------------------------+ >> */ >> -#define GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE 0x5508u >> +#define GUC_ACTION_VF2GUC_RESFIX_DONE 0x5508u >> >> -#define VF2GUC_NOTIFY_RESFIX_DONE_REQUEST_MSG_LEN GUC_HXG_REQUEST_MSG_MIN_LEN >> -#define VF2GUC_NOTIFY_RESFIX_DONE_REQUEST_MSG_0_MBZ GUC_HXG_REQUEST_MSG_0_DATA0 >> +#define VF2GUC_RESFIX_DONE_REQUEST_MSG_LEN GUC_HXG_REQUEST_MSG_MIN_LEN >> +#define VF2GUC_RESFIX_DONE_REQUEST_MSG_0_MARKER GUC_HXG_REQUEST_MSG_0_DATA0 >> >> -#define VF2GUC_NOTIFY_RESFIX_DONE_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN >> -#define VF2GUC_NOTIFY_RESFIX_DONE_RESPONSE_MSG_0_MBZ GUC_HXG_RESPONSE_MSG_0_DATA0 >> +#define VF2GUC_RESFIX_DONE_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN >> +#define VF2GUC_RESFIX_DONE_RESPONSE_MSG_0_MBZ GUC_HXG_RESPONSE_MSG_0_DATA0 >> >> /** >> * DOC: VF2GUC_QUERY_SINGLE_KLV >> @@ -656,4 +662,45 @@ >> #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN >> #define PF2GUC_SAVE_RESTORE_VF_RESPONSE_MSG_0_USED GUC_HXG_RESPONSE_MSG_0_DATA0 >> >> +/** >> + * DOC: VF2GUC_RESFIX_START >> + * >> + * This action is used by VF to inform the GuC that the VF KMD will be starting >> + * post-migration recovery fixups. The @MARKER sent with this action must match >> + * with the MARKER posted in the VF2GUC_RESFIX_DONE message. >> + * >> + * This message must be sent as `MMIO HXG Message`_. >> + * >> + * Available since GuC VF compatibility 1.27.0. >> + * >> + * +---+-------+--------------------------------------------------------------+ >> + * | | Bits | Description | >> + * +===+=======+==============================================================+ >> + * | 0 | 31 | ORIGIN = GUC_HXG_ORIGIN_HOST_ | >> + * | +-------+--------------------------------------------------------------+ >> + * | | 30:28 | TYPE = GUC_HXG_TYPE_REQUEST_ | >> + * | +-------+--------------------------------------------------------------+ >> + * | | 27:16 | DATA0 = MARKER - can't be zero | >> + * | +-------+--------------------------------------------------------------+ >> + * | | 15:0 | ACTION = _`GUC_ACTION_VF2GUC_RESFIX_START` = 0x550F | >> + * +---+-------+--------------------------------------------------------------+ >> + * >> + * +---+-------+--------------------------------------------------------------+ >> + * | | Bits | Description | >> + * +===+=======+==============================================================+ >> + * | 0 | 31 | ORIGIN = GUC_HXG_ORIGIN_GUC_ | >> + * | +-------+--------------------------------------------------------------+ >> + * | | 30:28 | TYPE = GUC_HXG_TYPE_RESPONSE_SUCCESS_ | >> + * | +-------+--------------------------------------------------------------+ >> + * | | 27:0 | DATA0 = MBZ | >> + * +---+-------+--------------------------------------------------------------+ >> + */ >> +#define GUC_ACTION_VF2GUC_RESFIX_START 0x550Fu >> + >> +#define VF2GUC_RESFIX_START_REQUEST_MSG_LEN GUC_HXG_REQUEST_MSG_MIN_LEN >> +#define VF2GUC_RESFIX_START_REQUEST_MSG_0_MARKER GUC_HXG_REQUEST_MSG_0_DATA0 >> + >> +#define VF2GUC_RESFIX_START_RESPONSE_MSG_LEN GUC_HXG_RESPONSE_MSG_MIN_LEN >> +#define VF2GUC_RESFIX_START_RESPONSE_MSG_0_MBZ GUC_HXG_RESPONSE_MSG_0_DATA0 >> + >> #endif >> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c >> index 97c29c55f885..fd7dd4a4739d 100644 >> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c >> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c >> @@ -299,12 +299,13 @@ void xe_gt_sriov_vf_guc_versions(struct xe_gt *gt, >> *found = gt->sriov.vf.guc_version; >> } >> >> -static int guc_action_vf_notify_resfix_done(struct xe_guc *guc) >> +static int guc_action_vf_resfix_start(struct xe_guc *guc, u16 marker) >> { >> u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = { >> FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) | >> FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) | >> - FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE), >> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_RESFIX_START) | >> + FIELD_PREP(VF2GUC_RESFIX_START_REQUEST_MSG_0_MARKER, marker), >> }; >> int ret; >> >> @@ -313,28 +314,41 @@ static int guc_action_vf_notify_resfix_done(struct xe_guc *guc) >> return ret > 0 ? -EPROTO : ret; >> } >> >> -/** >> - * vf_notify_resfix_done - Notify GuC about resource fixups apply completed. >> - * @gt: the &xe_gt struct instance linked to target GuC >> - * >> - * Returns: 0 if the operation completed successfully, or a negative error >> - * code otherwise. >> - */ >> -static int vf_notify_resfix_done(struct xe_gt *gt) >> +static int vf_resfix_start(struct xe_gt *gt, u16 marker) >> { >> struct xe_guc *guc = >->uc.guc; >> - int err; >> >> xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt))); >> >> - err = guc_action_vf_notify_resfix_done(guc); >> - if (unlikely(err)) >> - xe_gt_sriov_err(gt, "Failed to notify GuC about resource fixup done (%pe)\n", >> - ERR_PTR(err)); >> - else >> - xe_gt_sriov_dbg_verbose(gt, "sent GuC resource fixup done\n"); >> + xe_gt_sriov_dbg_verbose(gt, "Sending resfix start marker %u\n", marker); >> >> - return err; >> + return guc_action_vf_resfix_start(guc, marker); >> +} >> + >> +static int guc_action_vf_resfix_done(struct xe_guc *guc, u16 marker) >> +{ >> + u32 request[GUC_HXG_REQUEST_MSG_MIN_LEN] = { >> + FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_HOST) | >> + FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_REQUEST) | >> + FIELD_PREP(GUC_HXG_REQUEST_MSG_0_ACTION, GUC_ACTION_VF2GUC_RESFIX_DONE) | >> + FIELD_PREP(VF2GUC_RESFIX_DONE_REQUEST_MSG_0_MARKER, marker), >> + }; >> + int ret; >> + >> + ret = xe_guc_mmio_send(guc, request, ARRAY_SIZE(request)); >> + >> + return ret > 0 ? -EPROTO : ret; >> +} >> + >> +static int vf_resfix_done(struct xe_gt *gt, u16 marker) >> +{ >> + struct xe_guc *guc = >->uc.guc; >> + >> + xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt))); >> + >> + xe_gt_sriov_dbg_verbose(gt, "Sending resfix done marker %u\n", marker); >> + >> + return guc_action_vf_resfix_done(guc, marker); >> } >> >> static int guc_action_query_single_klv(struct xe_guc *guc, u32 key, >> @@ -1183,22 +1197,15 @@ static void vf_post_migration_abort(struct xe_gt *gt) >> xe_guc_submit_pause_abort(>->uc.guc); >> } >> >> -static int vf_post_migration_notify_resfix_done(struct xe_gt *gt) >> +static int vf_post_migration_resfix_done(struct xe_gt *gt, u16 marker) >> { >> - bool skip_resfix = false; >> - >> spin_lock_irq(>->sriov.vf.migration.lock); >> - if (gt->sriov.vf.migration.recovery_queued) { >> - skip_resfix = true; >> - xe_gt_sriov_dbg(gt, "another recovery imminent, resfix skipped\n"); >> - } else { >> + if (gt->sriov.vf.migration.recovery_queued) >> + xe_gt_sriov_dbg(gt, "another recovery imminent\n"); > with this new flow, which includes sending both RESFIX_START/DONE messages, > do we still need to track 'recovery_queued' flag separately and print info > about the 'imminent' recovery? Yes. It is still needed.  GT1 fixups must occur only after GT0 fixups are complete. If GT0 is in |recovery_inprogress|, GT1 fixups are blocked. We use |recovery_queued| to check whether any  additional fixups are still pending for GT0 or not. >> + else >> WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, false); >> - } >> spin_unlock_irq(>->sriov.vf.migration.lock); >> >> - if (skip_resfix) >> - return -EAGAIN; >> - >> /* >> * Make sure interrupts on the new HW are properly set. The GuC IRQ >> * must be working at this point, since the recovery did started, >> @@ -1206,14 +1213,26 @@ static int vf_post_migration_notify_resfix_done(struct xe_gt *gt) >> */ >> xe_irq_resume(gt_to_xe(gt)); > hmm, shouldn't this IRQ re-enabling be part of the kickstart() step called later? > then we will keep them off in case of failing at sending RESFIX_DONE Moved to vf_post_migration_rearm() before we enable restart CTB. >> >> - return vf_notify_resfix_done(gt); >> + return vf_resfix_done(gt, marker); >> +} >> + >> +static u16 vf_post_migration_resfix_start_marker(struct xe_gt *gt) > nit: this is not just a 'start' marker, nor fixed value, so maybe: > > vf_post_migration_next_resfix_marker() ? Fixed in new revision. > >> +{ >> + xe_gt_assert(gt, IS_SRIOV_VF(gt_to_xe(gt))); >> + >> + BUILD_BUG_ON(1 + ((typeof(gt->sriov.vf.migration.resfix_marker))~0) > >> + FIELD_MAX(VF2GUC_RESFIX_START_REQUEST_MSG_0_MARKER)); > is it correctly aligned? Fixed in revision. >> + >> + /* add 1 to avoid zero-marker */ >> + return 1 + gt->sriov.vf.migration.resfix_marker++; >> } >> >> static void vf_post_migration_recovery(struct xe_gt *gt) >> { >> struct xe_device *xe = gt_to_xe(gt); >> - int err; >> + u16 marker; >> bool retry; >> + int err; >> >> xe_gt_sriov_dbg(gt, "migration recovery in progress\n"); >> >> @@ -1227,14 +1246,23 @@ static void vf_post_migration_recovery(struct xe_gt *gt) >> goto fail; >> } >> >> + marker = vf_post_migration_resfix_start_marker(gt); >> + >> + err = vf_resfix_start(gt, marker); > all private helpers called here have vf_post_migration prefix except this one > > so maybe this step should be called vf_post_migration_resfix_start() instead > where you can call lower level helpers if needed Fixed in revision. > >> + if (unlikely(err)) { >> + xe_gt_sriov_err(gt, "Recovery failed at GuC RESFIX_START step (%pe)\n", >> + ERR_PTR(err)); >> + goto fail; >> + } >> + >> err = vf_post_migration_fixups(gt); >> if (err) >> goto fail; >> >> vf_post_migration_rearm(gt); >> >> - err = vf_post_migration_notify_resfix_done(gt); >> - if (err && err != -EAGAIN) >> + err = vf_post_migration_resfix_done(gt, marker); >> + if (err) >> goto fail; >> >> vf_post_migration_kickstart(gt); >> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h >> index 420b0e6089de..db2f8b3ed3e9 100644 >> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h >> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf_types.h >> @@ -52,6 +52,11 @@ struct xe_gt_sriov_vf_migration { >> wait_queue_head_t wq; >> /** @scratch: Scratch memory for VF recovery */ >> void *scratch; >> + /** >> + * @resfix_marker: Marker sent on start and on end of post-migration >> + * steps. >> + */ >> + u8 resfix_marker; >> /** @recovery_teardown: VF post migration recovery is being torn down */ >> bool recovery_teardown; >> /** @recovery_queued: VF post migration recovery in queued */ >> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c >> index d56b8cfea50b..1827d77852a4 100644 >> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c >> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c >> @@ -49,11 +49,13 @@ >> * >> * As soon as Virtual GPU of the VM starts, the VF driver within receives >> * the MIGRATED interrupt and schedules post-migration recovery worker. >> - * That worker queries GuC for new provisioning (using MMIO communication), >> + * That worker sends `VF2GUC_RESFIX_START` action along with non-zero >> + * marker, queries GuC for new provisioning (using MMIO communication), >> * and applies fixups to any non-virtualized resources used by the VF. >> * >> * When the VF driver is ready to continue operation on the newly connected >> - * hardware, it sends `VF2GUC_NOTIFY_RESFIX_DONE` which causes it to >> + * hardware, it sends `VF2GUC_RESFIX_DONE` action along with the same >> + * marker which was sent with `VF2GUC_RESFIX_START` which causes it to >> * enter the long awaited `VF_RUNNING` state, and therefore start handling >> * CTB messages and scheduling workloads from the VF:: >> * >> @@ -102,12 +104,17 @@ >> * | [ ] new VF provisioning [ ] >> * | [ ]---------------------------> [ ] >> * | | [ ] >> + * | | VF2GUC_RESFIX_START [ ] >> + * | [ ] <---------------------------[ ] >> + * | [ ] [ ] >> + * | [ ] success [ ] >> + * | [ ]---------------------------> [ ] >> * | | VF driver applies post [ ] >> * | | migration fixups -------[ ] >> * | | | [ ] >> * | | -----> [ ] >> * | | [ ] >> - * | | VF2GUC_NOTIFY_RESFIX_DONE [ ] >> + * | | VF2GUC_RESFIX_DONE [ ] >> * | [ ] <---------------------------[ ] >> * | [ ] [ ] >> * | [ ] GuC sets new VF state to [ ] >> @@ -118,6 +125,57 @@ >> * | [ ]---------------------------> [ ] >> * | | | >> * | | | >> + * >> + * Handling of VF double migration flow is shown below:: >> + * >> + * GuC1 VF >> + * | | >> + * | [ ]<--- start fixups >> + * | VF2GUC_RESFIX_START(marker) [ ] >> + * [ ] <-------------------------------------------[ ] >> + * [ ] [ ] >> + * [ ]---\ [ ] >> + * [ ] store marker [ ] >> + * [ ]<--/ [ ] >> + * [ ] [ ] >> + * [ ] success [ ] >> + * [ ] ------------------------------------------> [ ] >> + * | [ ] >> + * | [ ]---\ >> + * | [ ] do fixups >> + * | [ ]<--/ >> + * | [ ] >> + * : : >> + * -------------- VF paused / saved ---------------- > from here > >> + * | | > (and lifeline for GuC1 shall end here) > >> + * >> + * GuC2 >> + * | >> + * : : >> + * ----------------- VF restored ------------------ >> + * | | >> + * [ ] | >> + * [ ]---\ | >> + * [ ] reset marker | >> + * [ ]<--/ | >> + * [ ] | >> + * ----------------- VF resumed ------------------ > up to here, there should be no lifeline for the VF Fixed in revision. -Satya. > >> + * | [ ] >> + * | [ ] >> + * | VF2GUC_RESFIX_DONE(marker) [ ] >> + * [ ] <-------------------------------------------[ ] >> + * [ ] [ ] >> + * [ ]---\ [ ] >> + * [ ] check marker [ ] >> + * [ ] (mismatch) [ ] >> + * [ ]<--/ [ ] >> + * [ ] [ ] >> + * [ ] RESPONSE_VF_MIGRATED [ ] >> + * [ ] ------------------------------------------> [ ] >> + * | [ ]---\ >> + * | [ ] reschedule fixups >> + * | [ ]<--/ >> + * | | >> */ >> >> /**