Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Summers, Stuart" <stuart.summers@intel.com>
To: "Dong, Zhanjun" <zhanjun.dong@intel.com>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Cc: "Brost, Matthew" <matthew.brost@intel.com>,
	"Ceraolo Spurio, Daniele" <daniele.ceraolospurio@intel.com>
Subject: Re: [PATCH v4] drm/xe/uc: Add stop on hardware initialization error
Date: Thu, 13 Nov 2025 23:09:54 +0000	[thread overview]
Message-ID: <95a3dd458fd57ff11513a1e5682b40c295a42d0c.camel@intel.com> (raw)
In-Reply-To: <20251112234453.1871032-1-zhanjun.dong@intel.com>

On Wed, 2025-11-12 at 18:44 -0500, Zhanjun Dong wrote:
> On hardware init fail, the hardware might no longer response, add GuC
> stop
> to clean up exec_queue items.
> At driver unload path, add call to GuC stop to clean up queue items.
> This
> clean up will fix memory leak issue like:
> [  189.997904] [drm:drm_mm_takedown] *ERROR* node [00f0f000 +
> 00007000]: inserted at
>                 drm_mm_insert_node_in_range+0x2c0/0x510
>                 __xe_ggtt_insert_bo_at+0x167/0x540 [xe]
>                 xe_ggtt_insert_bo+0x1a/0x30 [xe]
>                 __xe_bo_create_locked+0x1f3/0x930 [xe]
>                 xe_bo_create_pin_map_at_aligned+0x59/0x1f0 [xe]
>                 xe_bo_create_pin_map_at_novm+0xae/0x140 [xe]
>                 xe_bo_create_pin_map_novm+0x23/0x40 [xe]
>                 xe_lrc_create+0x1e4/0x17c0 [xe]
>                 xe_exec_queue_create+0x38a/0x6a0 [xe]
>                 xe_gt_record_default_lrcs+0x117/0x8b0 [xe]
>                 xe_uc_load_hw+0xa2/0x290 [xe]
>                 xe_gt_init+0x357/0xab0 [xe]
>                 xe_device_probe+0x403/0xa30 [xe]
>                 xe_pci_probe+0x39a/0x610 [xe]
>                 local_pci_probe+0x47/0xb0
>                 pci_device_probe+0xf3/0x260
>                 really_probe+0xf1/0x3b0
>                 __driver_probe_device+0x8c/0x180
>                 device_driver_attach+0x57/0xd0
>                 bind_store+0x77/0xd0
>                 drv_attr_store+0x24/0x50
>                 sysfs_kf_write+0x4d/0x80
>                 kernfs_fop_write_iter+0x188/0x240
>                 vfs_write+0x280/0x540
>                 ksys_write+0x6f/0xf0
>                 __x64_sys_write+0x19/0x30
>                 x64_sys_call+0x2171/0x25a0
>                 do_syscall_64+0x93/0xb80
>                 entry_SYSCALL_64_after_hwframe+0x7
> and:
> [  189.973775] xe 0000:00:02.0: [drm] *ERROR* Tile0: GT1: GUC ID
> manager unclean (1/65535)
> [  189.981731] xe 0000:00:02.0: [drm] Tile0: GT1:       total 65535
> [  189.981733] xe 0000:00:02.0: [drm] Tile0: GT1:       used 1
> [  189.981734] xe 0000:00:02.0: [drm] Tile0: GT1:       range 2..2
> (1)
> 
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5466
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5530
> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> ---
> v4: Add memory leak fix
>     Switch to xe_uc_stop
> v3: Switch to xe_guc_stop
> v2: Switch to xe_guc_ct_stop
> ---
>  drivers/gpu/drm/xe/xe_guc.c | 3 ++-
>  drivers/gpu/drm/xe/xe_uc.c  | 2 ++
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc.c
> b/drivers/gpu/drm/xe/xe_guc.c
> index ecc3e091b89e..b4c9673f84d6 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -661,6 +661,7 @@ static void guc_fini_hw(void *arg)
>         unsigned int fw_ref;
>  
>         fw_ref = xe_force_wake_get(gt_to_fw(gt), XE_FORCEWAKE_ALL);
> +       xe_guc_stop(guc);
>         xe_uc_sanitize_reset(&guc_to_gt(guc)->uc);

Kind of a flyby comment here, but it doesn't look right that the
encapsulation is all over the place here and in the later functions in
this patch. Why aren't we calling xe_guc_sanitize() here?

>         xe_force_wake_put(gt_to_fw(gt), fw_ref);
>  
> @@ -1598,7 +1599,7 @@ void xe_guc_stop_prepare(struct xe_guc *guc)
>  void xe_guc_stop(struct xe_guc *guc)
>  {
>         xe_guc_ct_stop(&guc->ct);
> -
> +       xe_guc_submit_reset_prepare(guc);

Shouldn't we just call xe_guc_reset_prepare() from
xe_guc_stop_prepare() instead to keep the ordering?

>         xe_guc_submit_stop(guc);
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
> index 465bda355443..6c72ce305d6c 100644
> --- a/drivers/gpu/drm/xe/xe_uc.c
> +++ b/drivers/gpu/drm/xe/xe_uc.c
> @@ -173,6 +173,7 @@ static int vf_uc_load_hw(struct xe_uc *uc)
>         return 0;
>  
>  err_out:
> +       xe_uc_stop(uc);
>         xe_guc_sanitize(&uc->guc);

And again, why xe_guc_sanitize() instead of xe_uc_sanitize()?

>         return err;
>  }
> @@ -228,6 +229,7 @@ int xe_uc_load_hw(struct xe_uc *uc)
>         return 0;
>  
>  err_out:
> +       xe_uc_stop(uc);
>         xe_guc_sanitize(&uc->guc);

And here... In this function above we have xe_huc_load() as well, so at
a minimum it seems like we should call that here. But IMO we should
just move this to xe_uc_sanitize().

I realize most of this isn't directly related to these bug fixes, so if
you agree, happy for these to be in a separate commit with the
exception of my comment around xe_guc_submit_reset_prepare() above.

Thanks,
Stuart

>         return ret;
>  }


  parent reply	other threads:[~2025-11-13 23:10 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-12 23:44 [PATCH v4] drm/xe/uc: Add stop on hardware initialization error Zhanjun Dong
2025-11-13  0:59 ` ✓ CI.KUnit: success for drm/xe/uc: Add stop on hardware initialization error (rev3) Patchwork
2025-11-13  1:51 ` ✓ Xe.CI.BAT: " Patchwork
2025-11-13 11:10 ` ✗ Xe.CI.Full: failure " Patchwork
2025-11-13 23:09 ` Summers, Stuart [this message]
2025-11-14 15:36   ` [PATCH v4] drm/xe/uc: Add stop on hardware initialization error Dong, Zhanjun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=95a3dd458fd57ff11513a1e5682b40c295a42d0c.camel@intel.com \
    --to=stuart.summers@intel.com \
    --cc=daniele.ceraolospurio@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=zhanjun.dong@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox