From: "Lis, Tomasz" <tomasz.lis@intel.com>
To: "Michał Winiarski" <michal.winiarski@intel.com>
Cc: intel-xe@lists.freedesktop.org,
"Michał Wajdeczko" <michal.wajdeczko@intel.com>,
"Piotr Piórkowski" <piotr.piorkowski@intel.com>,
"Matthew Brost" <matthew.brost@intel.com>,
"Lucas De Marchi" <lucas.demarchi@intel.com>
Subject: Re: [PATCH v3 4/7] drm/xe: Block reset while recovering from VF migration
Date: Tue, 3 Jun 2025 22:23:49 +0200 [thread overview]
Message-ID: <89380a6c-74eb-475b-a856-074f191c1622@intel.com> (raw)
In-Reply-To: <75aq2gastxnu4l5c3izbvniqzpwyfk45ux5oedfrwg3ir4kh3h@fw5zpvvek36d>
[-- Attachment #1: Type: text/plain, Size: 7174 bytes --]
On 28.05.2025 22:02, Michał Winiarski wrote:
> On Tue, May 20, 2025 at 01:19:22AM +0200, Tomasz Lis wrote:
>> Resetting GuC during recovery could interfere with the recovery
>> process. Such reset might be also triggered without justification,
>> due to migration taking time, rather than due to the workload not
>> progressing.
>>
>> Doing GuC reset during the recovery would cause exit of RESFIX state,
>> and therefore continuation of GuC work while fixups are still being
>> applied. To avoid that, reset needs to be blocked during the recovery.
>>
>> This patch blocks the reset during recovery. Reset request in that
>> time range will be dropped.
>>
>> In case a reset procedure already started while the recovery is
>> triggered, there isn't much we can do - we cannot wait for it to
>> finish as it involves waiting for hardware, and we can't be sure
>> at which exact point of the reset procedure the GPU got switched.
>> Therefore, the rare cases where migration happens while reset is
>> in progress, are still dangerous. Resets are not a part of the
>> standard flow, and cause unfinished workloads - that will happen
>> during the reset interrupted by migration as well, so it doesn't
>> diverge that much from what normally happens during such resets.
>>
>> Signed-off-by: Tomasz Lis<tomasz.lis@intel.com>
>> Cc: Michal Wajdeczko<michal.wajdeczko@intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_guc_submit.c | 26 ++++++++++++++++++++++++--
>> drivers/gpu/drm/xe/xe_guc_submit.h | 2 ++
>> drivers/gpu/drm/xe/xe_sriov_vf.c | 12 ++++++++++--
>> 3 files changed, 36 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>> index 6f280333de13..69ccfb2e1cff 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>> @@ -1761,7 +1761,11 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
>> }
>> }
>>
>> -int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>> +/**
>> + * xe_guc_submit_reset_block - Disallow reset calls on given GuC.
>> + * @guc: the &xe_guc struct instance
>> + */
>> +int xe_guc_submit_reset_block(struct xe_guc *guc)
>> {
>> int ret;
>>
>> @@ -1774,6 +1778,24 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>> */
>> ret = atomic_fetch_or(1, &guc->submission_state.stopped);
>> smp_wmb();
>> +
>> + return ret;
>> +}
>> +
>> +/**
>> + * xe_guc_submit_reset_unblock - Allow back reset calls on given GuC.
>> + * @guc: the &xe_guc struct instance
>> + */
>> +void xe_guc_submit_reset_unblock(struct xe_guc *guc)
>> +{
>> + atomic_dec(&guc->submission_state.stopped);
>> +}
>> +
>> +int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>> +{
>> + int ret;
>> +
>> + ret = xe_guc_submit_reset_block(guc);
>> wake_up_all(&guc->ct.wq);
>>
>> return ret;
>> @@ -1849,7 +1871,7 @@ int xe_guc_submit_start(struct xe_guc *guc)
>> xe_gt_assert(guc_to_gt(guc), xe_guc_read_stopped(guc) == 1);
>>
>> mutex_lock(&guc->submission_state.lock);
>> - atomic_dec(&guc->submission_state.stopped);
>> + xe_guc_submit_reset_unblock(guc);
>> xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
>> /* Prevent redundant attempts to start parallel queues */
>> if (q->guc->id != index)
>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
>> index f1cf271492ae..2c2d2936440d 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_submit.h
>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.h
>> @@ -20,6 +20,8 @@ void xe_guc_submit_stop(struct xe_guc *guc);
>> int xe_guc_submit_start(struct xe_guc *guc);
>> void xe_guc_submit_pause(struct xe_guc *guc);
>> void xe_guc_submit_unpause(struct xe_guc *guc);
>> +int xe_guc_submit_reset_block(struct xe_guc *guc);
>> +void xe_guc_submit_reset_unblock(struct xe_guc *guc);
>> void xe_guc_submit_wedge(struct xe_guc *guc);
>>
>> int xe_guc_read_stopped(struct xe_guc *guc);
>> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
>> index fcd82a0fda48..82b3dd57de73 100644
>> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
>> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
>> @@ -150,9 +150,15 @@ static void vf_post_migration_shutdown(struct xe_device *xe)
>> {
>> struct xe_gt *gt;
>> unsigned int id;
>> + int ret = 0;
>>
>> - for_each_gt(gt, xe, id)
>> + for_each_gt(gt, xe, id) {
>> xe_guc_submit_pause(>->uc.guc);
>> + ret |= xe_guc_submit_reset_block(>->uc.guc);
>> + }
>> +
>> + if (ret)
>> + drm_info(&xe->drm, "migration recovery encountered ongoing reset\n");
> I'd suggest debug level, as it doesn't seem all that useful in general.
> But I guess you want to have a trace that this happened and the handling
> of this scenario is somewhat iffy...
Yes, reset and post-migration recovery interfere, leading to dangerous
corner cases. If they meet, it's better if we know that from the log.
Reset is a rare, exceptional procedure, so we don't expect this to be
logged much.
>> }
>>
>> /**
>> @@ -170,8 +176,10 @@ static void vf_post_migration_kickstart(struct xe_device *xe)
>> /* make sure interrupts on the new HW are properly set */
>> xe_irq_resume(xe);
>>
>> - for_each_gt(gt, xe, id)
>> + for_each_gt(gt, xe, id) {
>> + xe_guc_submit_reset_unblock(>->uc.guc);
> If we were doing migration recovery during reset, we'll change the state
> of guc->submission_state.stopped (btw... is it possible to decrement it
> below 0 now?).
Yes, the decrement should be fixed. Not sure why
xe_guc_submit_reset_block() doesn't do that by itself, maybe it was just
a simplification due to it being used only during reset. But - the
further analysis below makes it irrelevant.
> There are "some" places where we have asserts that expect a particular
> state, and some places where we use that state as an event (combined
> with a waitqueue). Should we wake up the waiters at some point? And
> perhaps synchronize with the places which depend on being in a "stopped"
> state (if at all possible)?
After getting an understanding of all the wait places, it looks to me
that my implementation is just wrong.
We can not re-use the `xe_guc_submit_reset_block()` here, regardless if
async or not, because this leads to all
waits for GuC responses being terminated, usually with an error. This is
not acceptable as, in case of migration,
related responses from GuC will come as soon as the recovery is done.
Even if we won't wake up the waits within migration recovery, the check
may still be triggered somewhere else,
or a new wait call may happen and bail out of wait immediately, leaving
us with detached G2H response, or maybe
even freeing an object while HW still accesses it (ie. a preempted queue
may be sent back to HW by GuC on
RESFIX_DONE before GuC reads all pending G2H commands, the context
becomes active but on KMD side it
bailed out from waiting to be disabled and got freed).
I will figure out a better way to block the reset, one which doesn't
mess with existing waits.
-Tomasz
> Thanks,
> -Michał
>
>> xe_guc_submit_unpause(>->uc.guc);
>> + }
>> }
>>
>> /**
>> --
>> 2.25.1
>>
[-- Attachment #2: Type: text/html, Size: 9256 bytes --]
next prev parent reply other threads:[~2025-06-03 20:24 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-19 23:19 [PATCH v3 0/7] drm/xe/vf: Post-migration recovery of queues and jobs Tomasz Lis
2025-05-19 23:19 ` [PATCH v3 1/7] drm/xe/sa: Avoid caching GGTT address within the manager Tomasz Lis
2025-05-19 23:19 ` [PATCH v3 2/7] drm/xe/vf: Finish RESFIX by reset if CTB not enabled Tomasz Lis
2025-05-27 11:56 ` K V P, Satyanarayana
2025-05-27 14:14 ` Lis, Tomasz
2025-05-19 23:19 ` [PATCH v3 3/7] drm/xe/vf: Pause submissions during RESFIX fixups Tomasz Lis
2025-05-27 13:10 ` K V P, Satyanarayana
2025-05-27 14:28 ` Lis, Tomasz
2025-05-28 20:16 ` Michał Winiarski
2025-05-31 0:05 ` Lis, Tomasz
2025-05-19 23:19 ` [PATCH v3 4/7] drm/xe: Block reset while recovering from VF migration Tomasz Lis
2025-05-28 20:02 ` Michał Winiarski
2025-06-03 20:23 ` Lis, Tomasz [this message]
2025-05-19 23:19 ` [PATCH v3 5/7] drm/xe/vf: Rebase HWSP of all contexts after migration Tomasz Lis
2025-05-27 13:45 ` K V P, Satyanarayana
2025-05-28 12:49 ` Michał Winiarski
2025-06-03 20:23 ` Lis, Tomasz
2025-05-19 23:19 ` [PATCH v3 6/7] drm/xe/vf: Rebase MEMIRQ structures for " Tomasz Lis
2025-05-27 14:06 ` K V P, Satyanarayana
2025-05-28 10:44 ` Michał Winiarski
2025-05-29 1:19 ` Lis, Tomasz
2025-05-19 23:19 ` [PATCH v3 7/7] drm/xe/vf: Post migration, repopulate ring area for pending request Tomasz Lis
2025-05-28 10:54 ` Michał Winiarski
2025-05-30 23:03 ` Lis, Tomasz
2025-05-20 0:00 ` ✓ CI.Patch_applied: success for drm/xe/vf: Post-migration recovery of queues and jobs (rev3) Patchwork
2025-05-20 0:00 ` ✗ CI.checkpatch: warning " Patchwork
2025-05-20 0:01 ` ✓ CI.KUnit: success " Patchwork
2025-05-20 0:11 ` ✓ CI.Build: " Patchwork
2025-05-20 0:14 ` ✓ CI.Hooks: " Patchwork
2025-05-20 0:15 ` ✓ CI.checksparse: " Patchwork
2025-05-20 0:39 ` ✓ Xe.CI.BAT: " Patchwork
2025-05-20 12:47 ` ✗ Xe.CI.Full: failure " Patchwork
2025-05-27 5:49 ` ✗ CI.Patch_applied: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=89380a6c-74eb-475b-a856-074f191c1622@intel.com \
--to=tomasz.lis@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
--cc=matthew.brost@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=michal.winiarski@intel.com \
--cc=piotr.piorkowski@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox