From: Matthew Brost <matthew.brost@intel.com>
To: "Dong, Zhanjun" <zhanjun.dong@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v3] drm/xe/uc: Add stop on hardware initialization error
Date: Tue, 18 Nov 2025 19:17:58 -0800 [thread overview]
Message-ID: <aR025izk8Q10bFOl@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <84fa5b89-61e7-4aec-ab17-5057f9c52d74@intel.com>
On Tue, Nov 04, 2025 at 11:33:19AM -0500, Dong, Zhanjun wrote:
>
>
> On 2025-10-28 6:36 p.m., Dong, Zhanjun wrote:
> >
> >
> > On 2025-10-28 3:57 p.m., Matthew Brost wrote:
> > > On Tue, Oct 28, 2025 at 11:38:20AM -0400, Zhanjun Dong wrote:
> > > > On hardware init fail, the hardware might no longer response,
> > > > add GuC stop
> > > > to clean up exec_queue items.
> > > >
> > > > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5466
> > > > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5530
> > > > Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> > > > ---
> > > > v3: Switch to xe_guc_stop
> > > > v2: Switch to xe_guc_ct_stop
> > > > ---
> > > > drivers/gpu/drm/xe/xe_uc.c | 2 ++
> > > > 1 file changed, 2 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
> > > > index 465bda355443..00ca5883e006 100644
> > > > --- a/drivers/gpu/drm/xe/xe_uc.c
> > > > +++ b/drivers/gpu/drm/xe/xe_uc.c
> > > > @@ -173,6 +173,7 @@ static int vf_uc_load_hw(struct xe_uc *uc)
> > > > return 0;
> > > > err_out:
> > > > + xe_guc_stop(&uc->guc);
> > >
> > > If exec queues are destroyed later—after the submission backend has been
> > > stopped—the final put on the queue may be lost, leading to dangling
> > > memory when aborting the driver load or unloading it.
> > >
> > > I think you'll need to call xe_guc_submit_pause_abort somewhere to
> > > ensure the final put cleanup messages are processed by the queues. Maybe
> > > we add this call in guc_submit_fini before wait_event_timeout?
> > >
> > > Matt
> > Thanks for review.
> > My original thought is through xe_guc_stop/xe_guc_submit_stop/
> > guc_exec_queue_stop, where will do clean up, might be not covers all
> > conditions, let me try.
> Tested with call xe_guc_submit_pause_abort in guc_submit_fini before
> wait_event_timeout, works in some condition, while there is 1 condition
> might not cover: for lr queues, it won't clear, so I'm thinking of:
>
> @@ -2375,7 +2382,9 @@ void xe_guc_submit_pause_abort(struct xe_guc *guc)
> continue;
>
> xe_sched_submission_start(sched);
> - if (exec_queue_killed_or_banned_or_wedged(q))
> + if (exec_queue_killed_or_banned_or_wedged(q) || \
> exec_queue_registered(q))
> xe_guc_exec_queue_trigger_cleanup(q);
> }
> mutex_unlock(&guc->submission_state.lock);
>
> @Matthew Brost <matthew.brost@intel.com>, Do you think this change has side
> effect to migration worker? I can make it another function if true.
>
Probably actually just change this function to forcefully kill all exec
queues, i.e., call guc_exec_queue_kill. That is likely what I should
have done in VF migration from the start and what you want to do here.
Matt
> Regards,
> Zhanjun Dong
>
> >
> > Regards,
> > Zhanjun Dong
> >
> > >
> > > > xe_guc_sanitize(&uc->guc);
> > > > return err;
> > > > }
> > > > @@ -228,6 +229,7 @@ int xe_uc_load_hw(struct xe_uc *uc)
> > > > return 0;
> > > > err_out:
> > > > + xe_guc_stop(&uc->guc);
> > > > xe_guc_sanitize(&uc->guc);
> > > > return ret;
> > > > }
> > > > --
> > > > 2.34.1
> > > >
> >
>
next prev parent reply other threads:[~2025-11-19 3:18 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-28 15:38 [PATCH v3] drm/xe/uc: Add stop on hardware initialization error Zhanjun Dong
2025-10-28 17:29 ` ✓ CI.KUnit: success for drm/xe/uc: Add stop on hardware initialization error (rev2) Patchwork
2025-10-28 18:23 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-28 19:57 ` [PATCH v3] drm/xe/uc: Add stop on hardware initialization error Matthew Brost
2025-10-28 22:36 ` Dong, Zhanjun
2025-11-04 16:33 ` Dong, Zhanjun
2025-11-19 3:17 ` Matthew Brost [this message]
2025-11-20 17:05 ` Dong, Zhanjun
2025-10-29 3:43 ` ✗ Xe.CI.Full: failure for drm/xe/uc: Add stop on hardware initialization error (rev2) Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aR025izk8Q10bFOl@lstrano-desk.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=zhanjun.dong@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox