Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: "Dong, Zhanjun" <zhanjun.dong@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v3] drm/xe/uc: Add stop on hardware initialization error
Date: Tue, 18 Nov 2025 19:17:58 -0800	[thread overview]
Message-ID: <aR025izk8Q10bFOl@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <84fa5b89-61e7-4aec-ab17-5057f9c52d74@intel.com>

On Tue, Nov 04, 2025 at 11:33:19AM -0500, Dong, Zhanjun wrote:
> 
> 
> On 2025-10-28 6:36 p.m., Dong, Zhanjun wrote:
> > 
> > 
> > On 2025-10-28 3:57 p.m., Matthew Brost wrote:
> > > On Tue, Oct 28, 2025 at 11:38:20AM -0400, Zhanjun Dong wrote:
> > > > On hardware init fail, the hardware might no longer response,
> > > > add GuC stop
> > > > to clean up exec_queue items.
> > > > 
> > > > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5466
> > > > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5530
> > > > Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> > > > ---
> > > > v3: Switch to xe_guc_stop
> > > > v2: Switch to xe_guc_ct_stop
> > > > ---
> > > >   drivers/gpu/drm/xe/xe_uc.c | 2 ++
> > > >   1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
> > > > index 465bda355443..00ca5883e006 100644
> > > > --- a/drivers/gpu/drm/xe/xe_uc.c
> > > > +++ b/drivers/gpu/drm/xe/xe_uc.c
> > > > @@ -173,6 +173,7 @@ static int vf_uc_load_hw(struct xe_uc *uc)
> > > >       return 0;
> > > >   err_out:
> > > > +    xe_guc_stop(&uc->guc);
> > > 
> > > If exec queues are destroyed later—after the submission backend has been
> > > stopped—the final put on the queue may be lost, leading to dangling
> > > memory when aborting the driver load or unloading it.
> > > 
> > > I think you'll need to call xe_guc_submit_pause_abort somewhere to
> > > ensure the final put cleanup messages are processed by the queues. Maybe
> > > we add this call in guc_submit_fini before wait_event_timeout?
> > > 
> > > Matt
> > Thanks for review.
> > My original thought is through xe_guc_stop/xe_guc_submit_stop/
> > guc_exec_queue_stop, where will do clean up, might be not covers all
> > conditions, let me try.
> Tested with call xe_guc_submit_pause_abort in guc_submit_fini before
> wait_event_timeout, works in some condition, while there is 1 condition
> might not cover: for lr queues, it won't clear, so I'm thinking of:
> 
> @@ -2375,7 +2382,9 @@ void xe_guc_submit_pause_abort(struct xe_guc *guc)
>                         continue;
> 
>                 xe_sched_submission_start(sched);
> -               if (exec_queue_killed_or_banned_or_wedged(q))
> +               if (exec_queue_killed_or_banned_or_wedged(q) || \
> 		    exec_queue_registered(q))
>                         xe_guc_exec_queue_trigger_cleanup(q);
>         }
>         mutex_unlock(&guc->submission_state.lock);
> 
> @Matthew Brost <matthew.brost@intel.com>, Do you think this change has side
> effect to migration worker? I can make it another function if true.
> 

Probably actually just change this function to forcefully kill all exec
queues, i.e., call guc_exec_queue_kill. That is likely what I should
have done in VF migration from the start and what you want to do here.

Matt 

> Regards,
> Zhanjun Dong
> 
> > 
> > Regards,
> > Zhanjun Dong
> > 
> > > 
> > > >       xe_guc_sanitize(&uc->guc);
> > > >       return err;
> > > >   }
> > > > @@ -228,6 +229,7 @@ int xe_uc_load_hw(struct xe_uc *uc)
> > > >       return 0;
> > > >   err_out:
> > > > +    xe_guc_stop(&uc->guc);
> > > >       xe_guc_sanitize(&uc->guc);
> > > >       return ret;
> > > >   }
> > > > -- 
> > > > 2.34.1
> > > > 
> > 
> 

  reply	other threads:[~2025-11-19  3:18 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-28 15:38 [PATCH v3] drm/xe/uc: Add stop on hardware initialization error Zhanjun Dong
2025-10-28 17:29 ` ✓ CI.KUnit: success for drm/xe/uc: Add stop on hardware initialization error (rev2) Patchwork
2025-10-28 18:23 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-28 19:57 ` [PATCH v3] drm/xe/uc: Add stop on hardware initialization error Matthew Brost
2025-10-28 22:36   ` Dong, Zhanjun
2025-11-04 16:33     ` Dong, Zhanjun
2025-11-19  3:17       ` Matthew Brost [this message]
2025-11-20 17:05         ` Dong, Zhanjun
2025-10-29  3:43 ` ✗ Xe.CI.Full: failure for drm/xe/uc: Add stop on hardware initialization error (rev2) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aR025izk8Q10bFOl@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=zhanjun.dong@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox