Re: [PATCH v4] drm/xe: Disable scheduling early on FD close to avoid CAT error cascade

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: "Summers, Stuart" <stuart.summers@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v4] drm/xe: Disable scheduling early on FD close to avoid CAT error cascade
Date: Tue, 16 Jun 2026 16:31:37 -0700	[thread overview]
Message-ID: <ajHc2aryQQKWChVY@gsse-cloud1.jf.intel.com> (raw)
In-Reply-To: <21184f8c769545e6e077e17985f858dfe9b7ea64.camel@intel.com>

On Tue, Jun 16, 2026 at 02:08:00PM -0600, Summers, Stuart wrote:
> On Fri, 2026-06-12 at 18:38 -0700, Matthew Brost wrote:
> > When an FD is closed with many exec queues, teardown relies on the
> > TDR
> > path to clean up scheduling. However, the TDR handling is serialized
> > (i.e., only one exec queue is processed at a time), which can make it
> > too slow compared to GuC scheduling activity.
> > 
> > In this window, GuC may continue to schedule contexts backed by
> > invalid page tables, leading to a cascade of CAT errors and repeated
> > engine resets. This significantly increases recovery time and can
> > degrade system stability.
> > 
> > To mitigate this, eagerly disable scheduling by sending a self-
> > message
> > outside of the TDR path. This prevents further scheduling of invalid
> > contexts and avoids the CAT error/reset cascade.
> > 
> > This change improves robustness and reduces recovery latency in
> > multiple queue teardown scenarios.
> > 
> > Cc: Wang Xin <x.wang@intel.com>
> > Cc: Jia Yao <jia.yao@intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Tested-by: Jia Yao <jia.yao@intel.com>
> > 
> > ---
> > v3:
> >  - Make kill message static tomake reclaim safe (CI)
> >  - Do not issue kill messages for queue which bypass cleanup messages
> >    (sashiko)
> > v4:
> >  - Add missing message lock
> > ---
> >  drivers/gpu/drm/xe/xe_guc_exec_queue_types.h |  2 +-
> >  drivers/gpu/drm/xe/xe_guc_submit.c           | 44
> > ++++++++++++++++++--
> >  2 files changed, 42 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
> > b/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
> > index e5e53b421f29..247947fd357f 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
> > +++ b/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
> > @@ -31,7 +31,7 @@ struct xe_guc_exec_queue {
> >          * a message needs to sent through the GPU scheduler but
> > memory
> >          * allocations are not allowed.
> >          */
> > -#define MAX_STATIC_MSG_TYPE    3
> > +#define MAX_STATIC_MSG_TYPE    4
> >         struct xe_sched_msg static_msgs[MAX_STATIC_MSG_TYPE];
> >         /** @destroy_async: do final destroy async from this worker
> > */
> >         struct work_struct destroy_async;
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index afe5d99cdd8b..5ec1dca0324c 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -1875,11 +1875,21 @@ static void
> > __guc_exec_queue_process_msg_set_multi_queue_priority(struct xe_sche
> >         kfree(msg);
> >  }
> >  
> > +static void __guc_exec_queue_process_msg_kill(struct xe_sched_msg
> > *msg)
> > +{
> > +       struct xe_exec_queue *q = msg->private_data;
> > +       struct xe_exec_queue *primary =
> > xe_exec_queue_multi_queue_primary(q);
> > +
> > +       if (exec_queue_enabled(primary))
> > +               disable_scheduling(primary, true);
> > +}
> > +
> >  #define CLEANUP                                1       /* Non-zero
> > values to catch uninitialized msg */
> >  #define SET_SCHED_PROPS                        2
> >  #define SUSPEND                                3
> >  #define RESUME                         4
> >  #define SET_MULTI_QUEUE_PRIORITY       5
> > +#define KILL                           6
> >  #define OPCODE_MASK    0xf
> >  #define MSG_LOCKED     BIT(8)
> >  #define MSG_HEAD       BIT(9)
> > @@ -1906,6 +1916,9 @@ static void guc_exec_queue_process_msg(struct
> > xe_sched_msg *msg)
> >         case SET_MULTI_QUEUE_PRIORITY:
> >                 __guc_exec_queue_process_msg_set_multi_queue_priority
> > (msg);
> >                 break;
> > +       case KILL:
> > +               __guc_exec_queue_process_msg_kill(msg);
> > +               break;
> >         default:
> >                 XE_WARN_ON("Unknown message type");
> >         }
> > @@ -2018,11 +2031,39 @@ static int guc_exec_queue_init(struct
> > xe_exec_queue *q)
> >         return err;
> >  }
> >  
> > +static bool guc_exec_queue_try_add_msg(struct xe_exec_queue *q,
> > +                                      struct xe_sched_msg *msg,
> > +                                      u32 opcode);
> > +
> > +#define STATIC_MSG_CLEANUP     0
> > +#define STATIC_MSG_SUSPEND     1
> > +#define STATIC_MSG_RESUME      2
> > +#define STATIC_MSG_KILL                3
> 
> This isn't related to this series... but can you explain these static
> messages a bit? We're adding over and over to the static_msgs linked
> list. I don't see that we're actually doing anything with this after
> adding though, so the list just grows indefinitely? Or maybe I'm
> missing something in the teardown here...
> 

The xe_gpu_scheduler component removes the messages from the list.

The idea behind static messages is places where we are the path of
reclaim (no memory allocations) we the static messages embedded in Guc
exec queue object to communicate with it self (i.e., kick the action to
the scheduler work queue).

> >  static void guc_exec_queue_kill(struct xe_exec_queue *q)
> >  {
> > +       struct xe_sched_msg *msg = q->guc->static_msgs +
> > STATIC_MSG_KILL;
> > +
> >         trace_xe_exec_queue_kill(q);
> >         set_exec_queue_killed(q);
> >         __suspend_fence_signal(q);
> > +
> > +       /*
> > +        * We eagerly send a message to ourselves to disable
> > scheduling, as the
> > +        * TDR is serialized (i.e., only one exec queue is processed
> > at a time).
> > +        * If an FD is closed with many exec queues, the TDR can be
> > slower than
> > +        * the GuC scheduling contexts with invalid page tables,
> > creating a
> > +        * cascade of CAT errors and engine resets, which is quite
> > slow. Avoid
> > +        * this by immediately disabling scheduling outside of the
> > TDR.
> > +        */
> > +       if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) &&
> > +           kref_read(&q->refcount) && !exec_queue_wedged(q)) {
> 
> Should we check pending disable here too?
> 

No, but this is an old bug I thought I had already fix - in stead of
__suspend_fence_signal we should wait on suspend fence to signal - that
is only place a pending disable can be inflight + signaling the suspend
fence here actually opens a memory corruption window. I'll fix this in
independent patch in the next rev.

Matt

> Thanks,
> Stuart
> 
> > +               struct xe_gpu_scheduler *sched = &q->guc->sched;
> > +
> > +               xe_sched_msg_lock(sched);
> > +               guc_exec_queue_try_add_msg(q, msg, KILL);
> > +               xe_sched_msg_unlock(sched);
> > +       }
> > +
> >         xe_guc_exec_queue_trigger_cleanup(q);
> >  }
> >  
> > @@ -2066,9 +2107,6 @@ static bool guc_exec_queue_try_add_msg(struct
> > xe_exec_queue *q,
> >         return true;
> >  }
> >  
> > -#define STATIC_MSG_CLEANUP     0
> > -#define STATIC_MSG_SUSPEND     1
> > -#define STATIC_MSG_RESUME      2
> >  static void guc_exec_queue_destroy(struct xe_exec_queue *q)
> >  {
> >         struct xe_sched_msg *msg = q->guc->static_msgs +
> > STATIC_MSG_CLEANUP;
>

next prev parent reply	other threads:[~2026-06-16 23:31 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-13  1:38 [PATCH v4] drm/xe: Disable scheduling early on FD close to avoid CAT error cascade Matthew Brost
2026-06-13  2:22 ` ✓ CI.KUnit: success for drm/xe: Disable scheduling early on FD close to avoid CAT error cascade (rev3) Patchwork
2026-06-13  3:01 ` ✗ Xe.CI.BAT: failure " Patchwork
2026-06-13 20:30 ` ✗ Xe.CI.FULL: " Patchwork
2026-06-14  0:48 ` ✓ CI.KUnit: success for drm/xe: Disable scheduling early on FD close to avoid CAT error cascade (rev4) Patchwork
2026-06-14  1:32 ` ✓ Xe.CI.BAT: " Patchwork
2026-06-14  2:35 ` ✓ Xe.CI.FULL: " Patchwork
2026-06-15 16:17 ` [PATCH v4] drm/xe: Disable scheduling early on FD close to avoid CAT error cascade Cavitt, Jonathan
2026-06-15 16:58   ` Matthew Brost
2026-06-16 18:40     ` Cavitt, Jonathan
2026-06-16 20:08 ` Summers, Stuart
2026-06-16 23:31   ` Matthew Brost [this message]
2026-06-16 20:25 ` Niranjana Vishwanathapura
2026-06-16 20:40   ` Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajHc2aryQQKWChVY@gsse-cloud1.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=stuart.summers@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox