From: Matthew Brost <matthew.brost@intel.com>
To: Stuart Summers <stuart.summers@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH 6/7] drm/xe: Don't block messages to the GPU scheduler
Date: Mon, 13 Oct 2025 09:56:37 -0700 [thread overview]
Message-ID: <aO0vRcrh+vYm8JbN@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <20251013162504.7768-7-stuart.summers@intel.com>
On Mon, Oct 13, 2025 at 04:25:03PM +0000, Stuart Summers wrote:
> Right now we are using the state of the GPU scheduler
> to determine whether we send and receive messages. There
> are some states, however, where we might intentionally
> pause the scheduler, like a device wedge, and expect that
> messages are resumed later once the user has taken the
> hardware state and is attempting to reset, like an unbind.
>
> Remove these checks in the XeKMD and let the GPU scheduler
> handle state checks internally.
>
We can't do this. The entire queue stop / starting mechanism relies on
getting exclusive access to the queue by ensuring the scheduler is fully
stopped - this includes messages. This will break job timeouts, GT reset
flows, and VF migration.
What exactly is the problem you are trying to solve? The device is
wedged and queues are stopped, then an unbind occurs? That is probably a
bug. IIRC even wedging a device / tearing down a queue we should always
start the queue again. We could assert in guc_submit_wedged_fini that
all queues are not paused.
Also if you having issues on unbind - there is this patch [1] which
fixes an issue too. I'm going to merge [1] now.
Matt
[1] https://patchwork.freedesktop.org/series/155417/
> Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> ---
> drivers/gpu/drm/xe/xe_gpu_scheduler.c | 6 +-----
> 1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> index f91e06d03511..d9d6fb641188 100644
> --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> @@ -7,8 +7,7 @@
>
> static void xe_sched_process_msg_queue(struct xe_gpu_scheduler *sched)
> {
> - if (!READ_ONCE(sched->base.pause_submit))
> - queue_work(sched->base.submit_wq, &sched->work_process_msg);
> + queue_work(sched->base.submit_wq, &sched->work_process_msg);
> }
>
> static void xe_sched_process_msg_queue_if_ready(struct xe_gpu_scheduler *sched)
> @@ -43,9 +42,6 @@ static void xe_sched_process_msg_work(struct work_struct *w)
> container_of(w, struct xe_gpu_scheduler, work_process_msg);
> struct xe_sched_msg *msg;
>
> - if (READ_ONCE(sched->base.pause_submit))
> - return;
> -
> msg = xe_sched_get_msg(sched);
> if (msg) {
> sched->ops->process_msg(msg);
> --
> 2.34.1
>
next prev parent reply other threads:[~2025-10-13 16:56 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-13 16:24 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-13 16:24 ` [PATCH 1/7] drm/xe: Add additional trace points for LRCs Stuart Summers
2025-10-13 16:24 ` [PATCH 2/7] drm/xe: Add a trace point for VM close Stuart Summers
2025-10-13 16:25 ` [PATCH 3/7] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
2025-10-13 16:25 ` [PATCH 4/7] drm/xe: Add new exec queue trace points Stuart Summers
2025-10-13 16:25 ` [PATCH 5/7] drm/xe: Correct migration VM teardown order Stuart Summers
2025-10-13 16:25 ` [PATCH 6/7] drm/xe: Don't block messages to the GPU scheduler Stuart Summers
2025-10-13 16:56 ` Matthew Brost [this message]
2025-10-13 17:17 ` Summers, Stuart
2025-10-13 17:31 ` Matthew Brost
2025-10-13 17:38 ` Summers, Stuart
2025-10-13 21:49 ` Summers, Stuart
2025-10-13 16:25 ` [PATCH 7/7] drm/xe: Check for GuC responses on disabling scheduling Stuart Summers
2025-10-13 17:04 ` [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Matthew Brost
2025-10-13 17:13 ` Summers, Stuart
2025-10-13 21:48 ` Summers, Stuart
2025-10-13 18:45 ` ✗ CI.checkpatch: warning for Fix a couple of wedge corner-case memory leaks (rev2) Patchwork
2025-10-13 18:46 ` ✓ CI.KUnit: success " Patchwork
2025-10-13 19:31 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-10-13 23:13 ` ✗ Xe.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aO0vRcrh+vYm8JbN@lstrano-desk.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=stuart.summers@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox