From: "Summers, Stuart" <stuart.summers@intel.com>
To: "Brost, Matthew" <matthew.brost@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH 7/7] drm/xe: Check for GuC responses on disabling scheduling
Date: Fri, 3 Oct 2025 19:42:26 +0000 [thread overview]
Message-ID: <0e4a67db333a000b722e114ec52b8c8000db32e2.camel@intel.com> (raw)
In-Reply-To: <aOAmTW76SjtPpS73@lstrano-desk.jf.intel.com>
On Fri, 2025-10-03 at 12:38 -0700, Matthew Brost wrote:
> On Fri, Oct 03, 2025 at 12:58:37PM -0600, Summers, Stuart wrote:
> > On Fri, 2025-10-03 at 11:54 -0700, Matthew Brost wrote:
> > > On Thu, Oct 02, 2025 at 11:04:44PM +0000, Stuart Summers wrote:
> > > > In the event the GuC becomes unresponsive during a scheduling
> > > > disable event, we still want the driver to be able to recover.
> > > > This patch follows the same methodology we already have in
> > > > place
> > > > for TLB invalidation requests, where we send a request to GuC
> > > > and wait for that invalidation done response. If the response
> > > > doesn't come back in time we then at least print a message
> > > > indicating the invalidation failed for some reason.
> > > >
> > > > In this case, we send the schedule disable and the expectation
> > > > is that GuC will respond with a schedule done response. The KMD
> > > > then catches that response and in turn sends a context
> > > > deregistration
> > > > response. So in the event GuC becomes unresponsive after we
> > > > send
> > > > the schedule disable, we actually have two g2h responses that
> > > > have been reserved but never received.
> > > >
> > > > To handle this, make sure the pending disable event in the
> > > > exec queue gets cleared (i.e. we received that response from
> > > > GuC). If it doesn't in a reasonable amount of time, assume
> > > > GuC is dead: ban the exec queue, queue up a GT reset, and
> > > > manually call the schedule done handler. Then in the schedule
> > > > done handler, in turn, check whether the context had been
> > > > banned. If so, manually call the deregistration done handler
> > > > to ensure all resources related to that exec queue get
> > > > cleaned up properly. Without this, if the device becomes
> > > > wedged after an exec queue has been created, the attached
> > > > resources like the LRC will not get feed properly resulting
> > > > in a memory leak.
> > > >
> > > > Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> > > > ---
> > > > drivers/gpu/drm/xe/xe_guc_submit.c | 23
> > > > ++++++++++++++++++++++-
> > > > 1 file changed, 22 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > index 45b72bebfc63..a177d87c8524 100644
> > > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > @@ -939,6 +939,9 @@ int xe_guc_read_stopped(struct xe_guc *guc)
> > > > GUC_CONTEXT_##enable_disable,
> > > >
> > > > \
> > > > }
> > > >
> > > > +static void handle_sched_done(struct xe_guc *guc, struct
> > > > xe_exec_queue *q,
> > > > + u32 runnable_state);
> > > > +
> > > > static void disable_scheduling_deregister(struct xe_guc *guc,
> > > > struct xe_exec_queue
> > > > *q)
> > > > {
> > > > @@ -974,6 +977,17 @@ static void
> > > > disable_scheduling_deregister(struct xe_guc *guc,
> > > > xe_guc_ct_send(&guc->ct, action, ARRAY_SIZE(action),
> > > > G2H_LEN_DW_SCHED_CONTEXT_MODE_SET +
> > > > G2H_LEN_DW_DEREGISTER_CONTEXT, 2);
> > > > +
> > > > + ret = wait_event_timeout(guc->ct.wq,
> > > > + !exec_queue_pending_disable(q)
> > > > ||
> > > > + xe_guc_read_stopped(guc),
> > > > + HZ * 5);
> > >
> > > This doesn't look right. Deregister is designed to be fully
> > > async. If
> > > this flow stops working for whatever reason the GuC is dead and
> > > eventually somewhere in driver will detect this and trigger a GT
> > > reset
> > > which is cleanup all lost H2G.
> > >
> > > > + if (!ret || xe_guc_read_stopped(guc)) {
> > > > + xe_gt_warn(guc_to_gt(guc), "Schedule disable
> > > > failed
> > > > to respond");
> > > > + set_exec_queue_banned(q);
> > > > + handle_sched_done(guc, q, 0);
> > > > + xe_gt_reset_async(q->gt);
> > > > + }
> > > > }
> > > >
> > > > static void xe_guc_exec_queue_trigger_cleanup(struct
> > > > xe_exec_queue
> > > > *q)
> > > > @@ -2117,6 +2131,8 @@ g2h_exec_queue_lookup(struct xe_guc *guc,
> > > > u32
> > > > guc_id)
> > > > return q;
> > > > }
> > > >
> > > > +static void handle_deregister_done(struct xe_guc *guc, struct
> > > > xe_exec_queue *q);
> > > > +
> > > > static void deregister_exec_queue(struct xe_guc *guc, struct
> > > > xe_exec_queue *q)
> > > > {
> > > > u32 action[] = {
> > > > @@ -2131,7 +2147,12 @@ static void deregister_exec_queue(struct
> > > > xe_guc *guc, struct xe_exec_queue *q)
> > > >
> > > > trace_xe_exec_queue_deregister(q);
> > > >
> > > > - xe_guc_ct_send_g2h_handler(&guc->ct, action,
> > > > ARRAY_SIZE(action));
> > > > + if (exec_queue_banned(q)) {
> > > > + handle_deregister_done(guc, q);
> > >
> > > This would leave the GuC with reference to guc_id and subsequent
> > > reuse
> > > of the guc_id (i.e., next register) will fall.
> >
> > But again, in this case the GuC is dead and we should be getting
> > that
> > reset event you had mentioned above. The issue I'm having is
>
> Banned is a per thing queue and more than likely we wont be doing a
> GT
> reset, thus we still need remove references to the queue from the
> GuC.
Yeah this makes sense to me. My use of "banned" here was probably not
ideal.
>
> > specifically around wedge events. Without a GT wedge, we will
> > normally
> > go through the GT reset flow and recover like you mentioned. But in
> > the
>
> No. See above.
>
> > case of a wedge, we don't redo the software part of the reset (i.e.
> > we
> > don't reset contexts, etc) per gt_reset():
> > static int gt_reset(struct xe_gt *gt)
> > {
> > unsigned int fw_ref;
> > int err;
> >
> > if (xe_device_wedged(gt_to_xe(gt)))
> > return -ECANCELED;
> >
> > Maybe instead of banned I can check for banned and wedged here? Or
> > maybe we should rethink the software reset flow in the event of a
> > wedge?
>
> The idea with wedged is we leave all hardward state, including the
> GuC,
> intacted for inspection. So I think a xe_device_wedged checked here
> makes sense. This would cover the case where start a queue teardown
> via
> a CLEANUP message and mid-flow we wedge the device.
But hardware state doesn't mean software state. Are you saying when the
device is wedged we want the memory to all be intact as well? And how
do we determine when that gets freed? On unbind?
Thanks,
Stuart
>
> Matt
>
> >
> > Thanks,
> > Stuart
> >
> > >
> > > Matt
> > >
> > > > + } else {
> > > > + xe_guc_ct_send_g2h_handler(&guc->ct, action,
> > > > + ARRAY_SIZE(action));
> > > > + }
> > > > }
> > > >
> > > > static void handle_sched_done(struct xe_guc *guc, struct
> > > > xe_exec_queue *q,
> > > > --
> > > > 2.34.1
> > > >
> >
next prev parent reply other threads:[~2025-10-03 19:42 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-02 23:04 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-02 23:04 ` [PATCH 1/7] drm/xe: Add additional trace points for LRCs Stuart Summers
2025-10-02 23:04 ` [PATCH 2/7] drm/xe: Add a trace point for VM close Stuart Summers
2025-10-02 23:04 ` [PATCH 3/7] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
2025-10-02 23:04 ` [PATCH 4/7] drm/xe: Add new exec queue trace points Stuart Summers
2025-10-02 23:04 ` [PATCH 5/7] drm/xe: Handle missing migration VM on VM creation Stuart Summers
2025-10-02 23:34 ` Lin, Shuicheng
2025-10-03 6:56 ` Matthew Brost
2025-10-03 14:33 ` Summers, Stuart
2025-10-02 23:04 ` [PATCH 6/7] drm/xe: Don't send a CLEANUP message on sched pause Stuart Summers
2025-10-03 18:50 ` Matthew Brost
2025-10-03 18:53 ` Summers, Stuart
2025-10-02 23:04 ` [PATCH 7/7] drm/xe: Check for GuC responses on disabling scheduling Stuart Summers
2025-10-03 18:54 ` Matthew Brost
2025-10-03 18:58 ` Summers, Stuart
2025-10-03 19:38 ` Matthew Brost
2025-10-03 19:42 ` Summers, Stuart [this message]
2025-10-03 19:49 ` Matthew Brost
2025-10-03 19:53 ` Summers, Stuart
2025-10-02 23:11 ` ✗ CI.checkpatch: warning for Fix a couple of wedge corner-case memory leaks Patchwork
2025-10-02 23:12 ` ✓ CI.KUnit: success " Patchwork
2025-10-02 23:58 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-10-03 2:16 ` ✗ Xe.CI.Full: " Patchwork
2025-10-03 14:38 ` Summers, Stuart
-- strict thread matches above, loose matches on Subject: below --
2025-10-13 16:24 [PATCH 0/7] " Stuart Summers
2025-10-13 16:25 ` [PATCH 7/7] drm/xe: Check for GuC responses on disabling scheduling Stuart Summers
2025-10-13 22:31 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-13 22:31 ` [PATCH 7/7] drm/xe: Check for GuC responses on disabling scheduling Stuart Summers
2025-10-14 2:09 ` Matthew Brost
2025-10-14 3:10 ` Summers, Stuart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0e4a67db333a000b722e114ec52b8c8000db32e2.camel@intel.com \
--to=stuart.summers@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox