From: "Summers, Stuart" <stuart.summers@intel.com>
To: "Brost, Matthew" <matthew.brost@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH 6/7] drm/xe: Don't block messages to the GPU scheduler
Date: Mon, 13 Oct 2025 17:17:58 +0000 [thread overview]
Message-ID: <69295444f047934f6f8a711b939bf1306dce0416.camel@intel.com> (raw)
In-Reply-To: <aO0vRcrh+vYm8JbN@lstrano-desk.jf.intel.com>
On Mon, 2025-10-13 at 09:56 -0700, Matthew Brost wrote:
> On Mon, Oct 13, 2025 at 04:25:03PM +0000, Stuart Summers wrote:
> > Right now we are using the state of the GPU scheduler
> > to determine whether we send and receive messages. There
> > are some states, however, where we might intentionally
> > pause the scheduler, like a device wedge, and expect that
> > messages are resumed later once the user has taken the
> > hardware state and is attempting to reset, like an unbind.
> >
> > Remove these checks in the XeKMD and let the GPU scheduler
> > handle state checks internally.
> >
>
> We can't do this. The entire queue stop / starting mechanism relies
> on
> getting exclusive access to the queue by ensuring the scheduler is
> fully
> stopped - this includes messages. This will break job timeouts, GT
> reset
> flows, and VF migration.
I'm not sure I full understand here. The scheduler should be stopped as
it was before, it just means we keep sending messages right? I can test
the job timeout piece to make sure...
Basically I'm arguing the start/stop mechanics should be inside the
scheduler and not in the calling driver.
>
> What exactly is the problem you are trying to solve? The device is
> wedged and queues are stopped, then an unbind occurs? That is
> probably a
> bug. IIRC even wedging a device / tearing down a queue we should
> always
> start the queue again. We could assert in guc_submit_wedged_fini that
I think there's basically a race between sending the cleanup message
and stopping the scheduler. And once we send that message, we don't
really track it on the xe side. So if we artificially pause things on
the xe side (by adding the checks I'm removing in this patch), we can
get into a scenario where the cleanup message is sent *after* the
scheduler is paused and thus that cleanup message gets dropped, and we
never issue the deregistration for that particular exec queue.
Thanks,
Stuart
> all queues are not paused.
>
> Also if you having issues on unbind - there is this patch [1] which
> fixes an issue too. I'm going to merge [1] now.
>
> Matt
>
> [1] https://patchwork.freedesktop.org/series/155417/
>
> > Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_gpu_scheduler.c | 6 +-----
> > 1 file changed, 1 insertion(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > index f91e06d03511..d9d6fb641188 100644
> > --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > @@ -7,8 +7,7 @@
> >
> > static void xe_sched_process_msg_queue(struct xe_gpu_scheduler
> > *sched)
> > {
> > - if (!READ_ONCE(sched->base.pause_submit))
> > - queue_work(sched->base.submit_wq, &sched-
> > >work_process_msg);
> > + queue_work(sched->base.submit_wq, &sched-
> > >work_process_msg);
> > }
> >
> > static void xe_sched_process_msg_queue_if_ready(struct
> > xe_gpu_scheduler *sched)
> > @@ -43,9 +42,6 @@ static void xe_sched_process_msg_work(struct
> > work_struct *w)
> > container_of(w, struct xe_gpu_scheduler,
> > work_process_msg);
> > struct xe_sched_msg *msg;
> >
> > - if (READ_ONCE(sched->base.pause_submit))
> > - return;
> > -
> > msg = xe_sched_get_msg(sched);
> > if (msg) {
> > sched->ops->process_msg(msg);
> > --
> > 2.34.1
> >
next prev parent reply other threads:[~2025-10-13 17:18 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-13 16:24 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-13 16:24 ` [PATCH 1/7] drm/xe: Add additional trace points for LRCs Stuart Summers
2025-10-13 16:24 ` [PATCH 2/7] drm/xe: Add a trace point for VM close Stuart Summers
2025-10-13 16:25 ` [PATCH 3/7] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
2025-10-13 16:25 ` [PATCH 4/7] drm/xe: Add new exec queue trace points Stuart Summers
2025-10-13 16:25 ` [PATCH 5/7] drm/xe: Correct migration VM teardown order Stuart Summers
2025-10-13 16:25 ` [PATCH 6/7] drm/xe: Don't block messages to the GPU scheduler Stuart Summers
2025-10-13 16:56 ` Matthew Brost
2025-10-13 17:17 ` Summers, Stuart [this message]
2025-10-13 17:31 ` Matthew Brost
2025-10-13 17:38 ` Summers, Stuart
2025-10-13 21:49 ` Summers, Stuart
2025-10-13 16:25 ` [PATCH 7/7] drm/xe: Check for GuC responses on disabling scheduling Stuart Summers
2025-10-13 17:04 ` [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Matthew Brost
2025-10-13 17:13 ` Summers, Stuart
2025-10-13 21:48 ` Summers, Stuart
2025-10-13 18:45 ` ✗ CI.checkpatch: warning for Fix a couple of wedge corner-case memory leaks (rev2) Patchwork
2025-10-13 18:46 ` ✓ CI.KUnit: success " Patchwork
2025-10-13 19:31 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-10-13 23:13 ` ✗ Xe.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=69295444f047934f6f8a711b939bf1306dce0416.camel@intel.com \
--to=stuart.summers@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox