Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Dong, Zhanjun" <zhanjun.dong@intel.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	<daniele.ceraolospurio@intel.com>, <stuart.summers@intel.com>
Subject: Re: [PATCH v9] drm/xe/uc: Add stop on hardware initialization error
Date: Fri, 5 Dec 2025 14:58:03 -0500	[thread overview]
Message-ID: <2a2e606c-1e2b-44db-a4ae-6a51bf975515@intel.com> (raw)
In-Reply-To: <aTJbEulTxUNFySK4@lstrano-desk.jf.intel.com>



On 2025-12-04 11:09 p.m., Matthew Brost wrote:
> On Fri, Nov 28, 2025 at 04:34:11PM -0500, Zhanjun Dong wrote:
>> On hardware init fail, the hardware might no longer response, add uc stop
>> to clean up. At driver unload, all exec_queue items need to be freeed,
>> change xe_guc_submit_pause_abort to free all contexts.
>>
>> This will fix memory leak issue like:
>> [  189.997904] [drm:drm_mm_takedown] *ERROR* node [00f0f000 + 00007000]: inserted at
>>                  drm_mm_insert_node_in_range+0x2c0/0x510
>>                  __xe_ggtt_insert_bo_at+0x167/0x540 [xe]
>>                  xe_ggtt_insert_bo+0x1a/0x30 [xe]
>>                  __xe_bo_create_locked+0x1f3/0x930 [xe]
>>                  xe_bo_create_pin_map_at_aligned+0x59/0x1f0 [xe]
>>                  xe_bo_create_pin_map_at_novm+0xae/0x140 [xe]
>>                  xe_bo_create_pin_map_novm+0x23/0x40 [xe]
>>                  xe_lrc_create+0x1e4/0x17c0 [xe]
>>                  xe_exec_queue_create+0x38a/0x6a0 [xe]
>>                  xe_gt_record_default_lrcs+0x117/0x8b0 [xe]
>>                  xe_uc_load_hw+0xa2/0x290 [xe]
>>                  xe_gt_init+0x357/0xab0 [xe]
>>                  xe_device_probe+0x403/0xa30 [xe]
>>                  xe_pci_probe+0x39a/0x610 [xe]
>>                  local_pci_probe+0x47/0xb0
>>                  pci_device_probe+0xf3/0x260
>>                  really_probe+0xf1/0x3b0
>>                  __driver_probe_device+0x8c/0x180
>>                  device_driver_attach+0x57/0xd0
>>                  bind_store+0x77/0xd0
>>                  drv_attr_store+0x24/0x50
>>                  sysfs_kf_write+0x4d/0x80
>>                  kernfs_fop_write_iter+0x188/0x240
>>                  vfs_write+0x280/0x540
>>                  ksys_write+0x6f/0xf0
>>                  __x64_sys_write+0x19/0x30
>>                  x64_sys_call+0x2171/0x25a0
>>                  do_syscall_64+0x93/0xb80
>>                  entry_SYSCALL_64_after_hwframe+0x7
>> and:
>> [  189.973775] xe 0000:00:02.0: [drm] *ERROR* Tile0: GT1: GUC ID manager unclean (1/65535)
>> [  189.981731] xe 0000:00:02.0: [drm] Tile0: GT1: 	total 65535
>> [  189.981733] xe 0000:00:02.0: [drm] Tile0: GT1: 	used 1
>> [  189.981734] xe 0000:00:02.0: [drm] Tile0: GT1: 	range 2..2 (1)
>>
>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5466
>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5530
>> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
>> ---
>> v9: Rebase and keep xe_guc_submit_pause_abort name unchanged
>> v8: Fix __mutex_lock warning
>> v7: Clear all queue items by guc_submit_fini/xe_guc_submit_pause_abort (Matthew)
>> v6: As huc not involved in vf_uc_load_hw, roll back to guc sanitize
>> v5: Move stop flag set in guc_fini_hw
>>      Change to uc_sanitize in uc init path
>> v4: Add memory leak fix
>>      Switch to xe_uc_stop
>> v3: Switch to xe_guc_stop
>> v2: Switch to xe_guc_ct_stop
>>
>> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_guc.c        | 6 ++++++
>>   drivers/gpu/drm/xe/xe_guc_submit.c | 3 +--
>>   drivers/gpu/drm/xe/xe_uc.c         | 8 +++++++-
>>   3 files changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>> index 88376bc2a483..64e5959bfb60 100644
>> --- a/drivers/gpu/drm/xe/xe_guc.c
>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>> @@ -662,6 +662,12 @@ static void guc_fini_hw(void *arg)
>>   	struct xe_guc *guc = arg;
>>   	struct xe_gt *gt = guc_to_gt(guc);
>>   
>> +	if (guc->submission_state.initialized) {
> 
> We probably should have a submit layer helper to read this variable.
Will do in next rev.
> 
>> +		xe_guc_reset_prepare(guc);
>> +		xe_guc_stop(guc);
>> +		xe_guc_submit_pause_abort(guc);
>> +	}
>> +
>>   	xe_with_force_wake(fw_ref, gt_to_fw(gt), XE_FORCEWAKE_ALL)
>>   		xe_uc_sanitize_reset(&guc_to_gt(guc)->uc);
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>> index 3ca2558c8c96..a64aa4edc360 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>> @@ -2417,8 +2417,7 @@ void xe_guc_submit_pause_abort(struct xe_guc *guc)
>>   			continue;
>>   
>>   		xe_sched_submission_start(sched);
>> -		if (exec_queue_killed_or_banned_or_wedged(q))
>> -			xe_guc_exec_queue_trigger_cleanup(q);
>> +		guc_exec_queue_kill(q);
>>   	}
>>   	mutex_unlock(&guc->submission_state.lock);
>>   }
>> diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
>> index 157520ea1783..5967b8d9f3cf 100644
>> --- a/drivers/gpu/drm/xe/xe_uc.c
>> +++ b/drivers/gpu/drm/xe/xe_uc.c
>> @@ -173,6 +173,9 @@ static int vf_uc_load_hw(struct xe_uc *uc)
>>   	return 0;
>>   
>>   err_out:
>> +	/* Stop guc submission */
>> +	atomic_fetch_or(1, &uc->guc.submission_state.stopped);
> 
> Can we call xe_uc_reset_prepare here?
> 
>> +	xe_uc_stop(uc);
>>   	xe_guc_sanitize(&uc->guc);
> 
> I know this is existing code but probably xe_uc_sanitize here.
> 
>>   	return err;
>>   }
>> @@ -231,7 +234,10 @@ int xe_uc_load_hw(struct xe_uc *uc)
>>   	return 0;
>>   
>>   err_out:
>> -	xe_guc_sanitize(&uc->guc);
>> +	/* Stop guc submission */
>> +	atomic_fetch_or(1, &uc->guc.submission_state.stopped);
> 
> Can we call xe_uc_reset_preparee here?
Yes, will do that in next rev.

Regards,
Zhanjun Dong
> 
> Matt
> 
>> +	xe_uc_stop(uc);
>> +	xe_uc_sanitize(uc);
>>   	return ret;
>>   }
>>   
>> -- 
>> 2.34.1
>>


      reply	other threads:[~2025-12-05 19:58 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-28 21:34 [PATCH v9] drm/xe/uc: Add stop on hardware initialization error Zhanjun Dong
2025-11-28 21:41 ` ✓ CI.KUnit: success for drm/xe/uc: Add stop on hardware initialization error (rev9) Patchwork
2025-11-28 22:43 ` ✓ Xe.CI.BAT: " Patchwork
2025-11-28 23:46 ` ✗ Xe.CI.Full: failure " Patchwork
2025-12-02 15:13   ` Dong, Zhanjun
2025-12-05  4:09 ` [PATCH v9] drm/xe/uc: Add stop on hardware initialization error Matthew Brost
2025-12-05 19:58   ` Dong, Zhanjun [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2a2e606c-1e2b-44db-a4ae-6a51bf975515@intel.com \
    --to=zhanjun.dong@intel.com \
    --cc=daniele.ceraolospurio@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=stuart.summers@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox