From: Matthew Brost <matthew.brost@intel.com>
To: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Zhanjun Dong <zhanjun.dong@intel.com>, <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v10] drm/xe/uc: Add stop on hardware initialization error
Date: Fri, 12 Dec 2025 14:01:29 -0800 [thread overview]
Message-ID: <aTyQubQC/OXInleX@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <aTbLhN9UCZt29bMJ@intel.com>
On Mon, Dec 08, 2025 at 07:58:44AM -0500, Rodrigo Vivi wrote:
> On Fri, Dec 05, 2025 at 01:06:42PM -0500, Zhanjun Dong wrote:
> > On hardware init fail, the hardware might no longer response, add uc stop
> > to clean up. At driver unload, all exec_queue items need to be freeed,
> > change xe_guc_submit_pause_abort to free all contexts.
> >
> > This will fix memory leak issue like:
> > [ 189.997904] [drm:drm_mm_takedown] *ERROR* node [00f0f000 + 00007000]: inserted at
> > drm_mm_insert_node_in_range+0x2c0/0x510
> > __xe_ggtt_insert_bo_at+0x167/0x540 [xe]
> > xe_ggtt_insert_bo+0x1a/0x30 [xe]
> > __xe_bo_create_locked+0x1f3/0x930 [xe]
> > xe_bo_create_pin_map_at_aligned+0x59/0x1f0 [xe]
> > xe_bo_create_pin_map_at_novm+0xae/0x140 [xe]
> > xe_bo_create_pin_map_novm+0x23/0x40 [xe]
> > xe_lrc_create+0x1e4/0x17c0 [xe]
> > xe_exec_queue_create+0x38a/0x6a0 [xe]
> > xe_gt_record_default_lrcs+0x117/0x8b0 [xe]
> > xe_uc_load_hw+0xa2/0x290 [xe]
> > xe_gt_init+0x357/0xab0 [xe]
> > xe_device_probe+0x403/0xa30 [xe]
> > xe_pci_probe+0x39a/0x610 [xe]
> > local_pci_probe+0x47/0xb0
> > pci_device_probe+0xf3/0x260
> > really_probe+0xf1/0x3b0
> > __driver_probe_device+0x8c/0x180
> > device_driver_attach+0x57/0xd0
> > bind_store+0x77/0xd0
> > drv_attr_store+0x24/0x50
> > sysfs_kf_write+0x4d/0x80
> > kernfs_fop_write_iter+0x188/0x240
> > vfs_write+0x280/0x540
> > ksys_write+0x6f/0xf0
> > __x64_sys_write+0x19/0x30
> > x64_sys_call+0x2171/0x25a0
> > do_syscall_64+0x93/0xb80
> > entry_SYSCALL_64_after_hwframe+0x7
> > and:
> > [ 189.973775] xe 0000:00:02.0: [drm] *ERROR* Tile0: GT1: GUC ID manager unclean (1/65535)
> > [ 189.981731] xe 0000:00:02.0: [drm] Tile0: GT1: total 65535
> > [ 189.981733] xe 0000:00:02.0: [drm] Tile0: GT1: used 1
> > [ 189.981734] xe 0000:00:02.0: [drm] Tile0: GT1: range 2..2 (1)
> >
> > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5466
> > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5530
> > Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> > ---
> > v10:Add submit initialized helper function (Matthew)
> > Call xe_uc_reset_prepare rather than set flag directly (Matthew)
> > v9: Rebase and keep xe_guc_submit_pause_abort name unchanged
> > v8: Fix __mutex_lock warning
> > v7: Clear all queue items by guc_submit_fini/xe_guc_submit_pause_abort (Matthew)
> > v6: As huc not involved in vf_uc_load_hw, roll back to guc sanitize
> > v5: Move stop flag set in guc_fini_hw
> > Change to uc_sanitize in uc init path
> > v4: Add memory leak fix
> > Switch to xe_uc_stop
> > v3: Switch to xe_guc_stop
> > v2: Switch to xe_guc_ct_stop
> > ---
> > drivers/gpu/drm/xe/xe_guc.c | 6 ++++++
> > drivers/gpu/drm/xe/xe_guc_submit.c | 12 ++++++++----
> > drivers/gpu/drm/xe/xe_guc_submit.h | 1 +
> > drivers/gpu/drm/xe/xe_uc.c | 8 ++++++--
> > 4 files changed, 21 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> > index f0407bab9a0c..3dcf078e111f 100644
> > --- a/drivers/gpu/drm/xe/xe_guc.c
> > +++ b/drivers/gpu/drm/xe/xe_guc.c
> > @@ -662,6 +662,12 @@ static void guc_fini_hw(void *arg)
> > struct xe_guc *guc = arg;
> > struct xe_gt *gt = guc_to_gt(guc);
> >
> > + if (xe_guc_submit_initialized(guc)) {
> > + xe_guc_reset_prepare(guc);
> > + xe_guc_stop(guc);
> > + xe_guc_submit_pause_abort(guc);
> > + }
This should likely be in guc_submit_fini actually.
> > +
> > xe_with_force_wake(fw_ref, gt_to_fw(gt), XE_FORCEWAKE_ALL)
> > xe_uc_sanitize_reset(&guc_to_gt(guc)->uc);
> >
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index f3f2c8556a66..34c6e8a03013 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -425,6 +425,11 @@ void xe_guc_submit_disable(struct xe_guc *guc)
> > guc->submission_state.enabled = false;
> > }
> >
> > +bool xe_guc_submit_initialized(struct xe_guc *guc)
> > +{
> > + return guc->submission_state.initialized;
> > +}
> > +
> > static void __release_guc_id(struct xe_guc *guc, struct xe_exec_queue *q, u32 xa_count)
> > {
> > int i;
> > @@ -992,7 +997,7 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
> > * If device is being wedged even before submission_state is
> > * initialized, there's nothing to do here.
> > */
> > - if (!guc->submission_state.initialized)
> > + if (!xe_guc_submit_initialized(guc))
> > return;
> >
> > err = devm_add_action_or_reset(guc_to_xe(guc)->drm.dev,
> > @@ -1994,7 +1999,7 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> > if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
> > return 0;
> >
> > - if (!guc->submission_state.initialized)
> > + if (!xe_guc_submit_initialized(guc))
> > return 0;
> >
> > /*
> > @@ -2418,8 +2423,7 @@ void xe_guc_submit_pause_abort(struct xe_guc *guc)
> > continue;
> >
> > xe_sched_submission_start(sched);
> > - if (exec_queue_killed_or_banned_or_wedged(q))
> > - xe_guc_exec_queue_trigger_cleanup(q);
> > + guc_exec_queue_kill(q);
>
> I believe this could deserve some extra explanation in a separate patch
>
Yes, break this into a different patch. Which fixes the original
implementation.
> > }
> > mutex_unlock(&guc->submission_state.lock);
> > }
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
> > index 100a7891b918..9308da2bd104 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.h
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.h
> > @@ -15,6 +15,7 @@ struct xe_guc;
> > int xe_guc_submit_init(struct xe_guc *guc, unsigned int num_ids);
> > int xe_guc_submit_enable(struct xe_guc *guc);
> > void xe_guc_submit_disable(struct xe_guc *guc);
> > +bool xe_guc_submit_initialized(struct xe_guc *guc);
> >
> > int xe_guc_submit_reset_prepare(struct xe_guc *guc);
> > void xe_guc_submit_reset_wait(struct xe_guc *guc);
> > diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
> > index 157520ea1783..60430d56c79c 100644
> > --- a/drivers/gpu/drm/xe/xe_uc.c
> > +++ b/drivers/gpu/drm/xe/xe_uc.c
> > @@ -173,7 +173,9 @@ static int vf_uc_load_hw(struct xe_uc *uc)
> > return 0;
> >
> > err_out:
> > - xe_guc_sanitize(&uc->guc);
> > + xe_uc_reset_prepare(uc);
> > + xe_uc_stop(uc);
> > + xe_uc_sanitize(uc);
>
> Why reset_prepare and not stop_prepare?
> All these guc variant functions are hard to follow nowadays
> and this combination seems strange and make things worse to follow.
>
> Probably some refactor on the current names or a new wrapper function
> is needed here.
>
> And why you use sanitize here, but the pause_abort on the above block...
>
> This patch is doing a lot, in a single shot and without explanation.
> It is probably an indication that a cleaner refactor preparation
> series is needed here.
>
The entire start / stop flows are a mess. A lot of this I copied from
the i915 early in Xe and appartently got a lot of things wrong. The
wedging code is wrong too, again my bad. We are getting bug reports on
this mess. I'm going to audit the entire driver now and try to clean
this is up.
I'll likely pull in code from Zhanjun in this process.
Matt
> Thanks,
> Rodrigo.
>
> > return err;
> > }
> >
> > @@ -231,7 +233,9 @@ int xe_uc_load_hw(struct xe_uc *uc)
> > return 0;
> >
> > err_out:
> > - xe_guc_sanitize(&uc->guc);
> > + xe_uc_reset_prepare(uc);
> > + xe_uc_stop(uc);
> > + xe_uc_sanitize(uc);
> > return ret;
> > }
> >
> > --
> > 2.34.1
> >
prev parent reply other threads:[~2025-12-12 22:03 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-05 18:06 [PATCH v10] drm/xe/uc: Add stop on hardware initialization error Zhanjun Dong
2025-12-05 21:25 ` ✓ CI.KUnit: success for drm/xe/uc: Add stop on hardware initialization error (rev10) Patchwork
2025-12-05 21:59 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-06 8:00 ` ✗ Xe.CI.Full: failure " Patchwork
2025-12-08 12:58 ` [PATCH v10] drm/xe/uc: Add stop on hardware initialization error Rodrigo Vivi
2025-12-12 22:01 ` Matthew Brost [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aTyQubQC/OXInleX@lstrano-desk.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=rodrigo.vivi@intel.com \
--cc=zhanjun.dong@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox