* [PATCH v5] drm/xe/uc: Add stop on hardware initialization error
@ 2025-11-18 17:01 Zhanjun Dong
2025-11-18 18:41 ` Dong, Zhanjun
2025-11-18 18:52 ` ✗ CI.KUnit: failure for drm/xe/uc: Add stop on hardware initialization error (rev4) Patchwork
0 siblings, 2 replies; 3+ messages in thread
From: Zhanjun Dong @ 2025-11-18 17:01 UTC (permalink / raw)
To: intel-xe; +Cc: daniele.ceraolospurio, matthew.brost, stuart.summers,
Zhanjun Dong
On hardware init fail, the hardware might no longer response, add GuC stop
to clean up exec_queue items.
At driver unload path, add call to GuC stop to clean up queue items. This
clean up will fix memory leak issue like:
[ 189.997904] [drm:drm_mm_takedown] *ERROR* node [00f0f000 + 00007000]: inserted at
drm_mm_insert_node_in_range+0x2c0/0x510
__xe_ggtt_insert_bo_at+0x167/0x540 [xe]
xe_ggtt_insert_bo+0x1a/0x30 [xe]
__xe_bo_create_locked+0x1f3/0x930 [xe]
xe_bo_create_pin_map_at_aligned+0x59/0x1f0 [xe]
xe_bo_create_pin_map_at_novm+0xae/0x140 [xe]
xe_bo_create_pin_map_novm+0x23/0x40 [xe]
xe_lrc_create+0x1e4/0x17c0 [xe]
xe_exec_queue_create+0x38a/0x6a0 [xe]
xe_gt_record_default_lrcs+0x117/0x8b0 [xe]
xe_uc_load_hw+0xa2/0x290 [xe]
xe_gt_init+0x357/0xab0 [xe]
xe_device_probe+0x403/0xa30 [xe]
xe_pci_probe+0x39a/0x610 [xe]
local_pci_probe+0x47/0xb0
pci_device_probe+0xf3/0x260
really_probe+0xf1/0x3b0
__driver_probe_device+0x8c/0x180
device_driver_attach+0x57/0xd0
bind_store+0x77/0xd0
drv_attr_store+0x24/0x50
sysfs_kf_write+0x4d/0x80
kernfs_fop_write_iter+0x188/0x240
vfs_write+0x280/0x540
ksys_write+0x6f/0xf0
__x64_sys_write+0x19/0x30
x64_sys_call+0x2171/0x25a0
do_syscall_64+0x93/0xb80
entry_SYSCALL_64_after_hwframe+0x7
and:
[ 189.973775] xe 0000:00:02.0: [drm] *ERROR* Tile0: GT1: GUC ID manager unclean (1/65535)
[ 189.981731] xe 0000:00:02.0: [drm] Tile0: GT1: total 65535
[ 189.981733] xe 0000:00:02.0: [drm] Tile0: GT1: used 1
[ 189.981734] xe 0000:00:02.0: [drm] Tile0: GT1: range 2..2 (1)
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5466
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5530
Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
---
v5: Move stop flag set in guc_fini_hw
Change to uc_sanitize in uc init path
v4: Add memory leak fix
Switch to xe_uc_stop
v3: Switch to xe_guc_stop
v2: Switch to xe_guc_ct_stop
---
drivers/gpu/drm/xe/xe_guc.c | 3 +++
drivers/gpu/drm/xe/xe_uc.c | 6 ++++--
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index a686b04879d6..48b0aece5020 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -662,6 +662,9 @@ static void guc_fini_hw(void *arg)
struct xe_gt *gt = guc_to_gt(guc);
unsigned int fw_ref;
+ /* Set stop flag, even submission not initialized */
+ atomic_fetch_or(1, &guc->submission_state.stopped);
+ xe_guc_stop(guc);
fw_ref = xe_force_wake_get(gt_to_fw(gt), XE_FORCEWAKE_ALL);
xe_uc_sanitize_reset(&guc_to_gt(guc)->uc);
xe_force_wake_put(gt_to_fw(gt), fw_ref);
diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
index 465bda355443..fa9a9649afe3 100644
--- a/drivers/gpu/drm/xe/xe_uc.c
+++ b/drivers/gpu/drm/xe/xe_uc.c
@@ -173,7 +173,8 @@ static int vf_uc_load_hw(struct xe_uc *uc)
return 0;
err_out:
- xe_guc_sanitize(&uc->guc);
+ xe_uc_stop(uc);
+ xe_uc_sanitize(&uc->guc);
return err;
}
@@ -228,7 +229,8 @@ int xe_uc_load_hw(struct xe_uc *uc)
return 0;
err_out:
- xe_guc_sanitize(&uc->guc);
+ xe_uc_stop(uc);
+ xe_uc_sanitize(uc);
return ret;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v5] drm/xe/uc: Add stop on hardware initialization error
2025-11-18 17:01 [PATCH v5] drm/xe/uc: Add stop on hardware initialization error Zhanjun Dong
@ 2025-11-18 18:41 ` Dong, Zhanjun
2025-11-18 18:52 ` ✗ CI.KUnit: failure for drm/xe/uc: Add stop on hardware initialization error (rev4) Patchwork
1 sibling, 0 replies; 3+ messages in thread
From: Dong, Zhanjun @ 2025-11-18 18:41 UTC (permalink / raw)
To: intel-xe; +Cc: daniele.ceraolospurio, matthew.brost, stuart.summers
Hi Matthew,
About call xe_guc_submit_pause_abort in guc_submit_fini before
wait_event_timeout, test shown queues allocated in
xe_gt_record_default_lrcs call trace, as shown in this commit message,
have the flag is not match what xe_guc_submit_pause_abort designed for
(exec_queue_killed_or_banned_or_wedged).
Expand condition might have unexpected side effect, here I want to limit
the change within GuC scope, so new rev posted to add guc_stop in
guc_fini_hw, please take a look.
Reards,
Zhanjun Dong
On 2025-11-18 12:01 p.m., Zhanjun Dong wrote:
> On hardware init fail, the hardware might no longer response, add GuC stop
> to clean up exec_queue items.
> At driver unload path, add call to GuC stop to clean up queue items. This
> clean up will fix memory leak issue like:
> [ 189.997904] [drm:drm_mm_takedown] *ERROR* node [00f0f000 + 00007000]: inserted at
> drm_mm_insert_node_in_range+0x2c0/0x510
> __xe_ggtt_insert_bo_at+0x167/0x540 [xe]
> xe_ggtt_insert_bo+0x1a/0x30 [xe]
> __xe_bo_create_locked+0x1f3/0x930 [xe]
> xe_bo_create_pin_map_at_aligned+0x59/0x1f0 [xe]
> xe_bo_create_pin_map_at_novm+0xae/0x140 [xe]
> xe_bo_create_pin_map_novm+0x23/0x40 [xe]
> xe_lrc_create+0x1e4/0x17c0 [xe]
> xe_exec_queue_create+0x38a/0x6a0 [xe]
> xe_gt_record_default_lrcs+0x117/0x8b0 [xe]
> xe_uc_load_hw+0xa2/0x290 [xe]
> xe_gt_init+0x357/0xab0 [xe]
> xe_device_probe+0x403/0xa30 [xe]
> xe_pci_probe+0x39a/0x610 [xe]
> local_pci_probe+0x47/0xb0
> pci_device_probe+0xf3/0x260
> really_probe+0xf1/0x3b0
> __driver_probe_device+0x8c/0x180
> device_driver_attach+0x57/0xd0
> bind_store+0x77/0xd0
> drv_attr_store+0x24/0x50
> sysfs_kf_write+0x4d/0x80
> kernfs_fop_write_iter+0x188/0x240
> vfs_write+0x280/0x540
> ksys_write+0x6f/0xf0
> __x64_sys_write+0x19/0x30
> x64_sys_call+0x2171/0x25a0
> do_syscall_64+0x93/0xb80
> entry_SYSCALL_64_after_hwframe+0x7
> and:
> [ 189.973775] xe 0000:00:02.0: [drm] *ERROR* Tile0: GT1: GUC ID manager unclean (1/65535)
> [ 189.981731] xe 0000:00:02.0: [drm] Tile0: GT1: total 65535
> [ 189.981733] xe 0000:00:02.0: [drm] Tile0: GT1: used 1
> [ 189.981734] xe 0000:00:02.0: [drm] Tile0: GT1: range 2..2 (1)
>
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5466
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5530
> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> ---
> v5: Move stop flag set in guc_fini_hw
> Change to uc_sanitize in uc init path
> v4: Add memory leak fix
> Switch to xe_uc_stop
> v3: Switch to xe_guc_stop
> v2: Switch to xe_guc_ct_stop
> ---
> drivers/gpu/drm/xe/xe_guc.c | 3 +++
> drivers/gpu/drm/xe/xe_uc.c | 6 ++++--
> 2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> index a686b04879d6..48b0aece5020 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -662,6 +662,9 @@ static void guc_fini_hw(void *arg)
> struct xe_gt *gt = guc_to_gt(guc);
> unsigned int fw_ref;
>
> + /* Set stop flag, even submission not initialized */
> + atomic_fetch_or(1, &guc->submission_state.stopped);
> + xe_guc_stop(guc);
> fw_ref = xe_force_wake_get(gt_to_fw(gt), XE_FORCEWAKE_ALL);
> xe_uc_sanitize_reset(&guc_to_gt(guc)->uc);
> xe_force_wake_put(gt_to_fw(gt), fw_ref);
> diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
> index 465bda355443..fa9a9649afe3 100644
> --- a/drivers/gpu/drm/xe/xe_uc.c
> +++ b/drivers/gpu/drm/xe/xe_uc.c
> @@ -173,7 +173,8 @@ static int vf_uc_load_hw(struct xe_uc *uc)
> return 0;
>
> err_out:
> - xe_guc_sanitize(&uc->guc);
> + xe_uc_stop(uc);
> + xe_uc_sanitize(&uc->guc);
> return err;
> }
>
> @@ -228,7 +229,8 @@ int xe_uc_load_hw(struct xe_uc *uc)
> return 0;
>
> err_out:
> - xe_guc_sanitize(&uc->guc);
> + xe_uc_stop(uc);
> + xe_uc_sanitize(uc);
> return ret;
> }
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* ✗ CI.KUnit: failure for drm/xe/uc: Add stop on hardware initialization error (rev4)
2025-11-18 17:01 [PATCH v5] drm/xe/uc: Add stop on hardware initialization error Zhanjun Dong
2025-11-18 18:41 ` Dong, Zhanjun
@ 2025-11-18 18:52 ` Patchwork
1 sibling, 0 replies; 3+ messages in thread
From: Patchwork @ 2025-11-18 18:52 UTC (permalink / raw)
To: Dong, Zhanjun; +Cc: intel-xe
== Series Details ==
Series: drm/xe/uc: Add stop on hardware initialization error (rev4)
URL : https://patchwork.freedesktop.org/series/152100/
State : failure
== Summary ==
+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
ERROR:root:../drivers/gpu/drm/xe/xe_uc.c: In function ‘vf_uc_load_hw’:
../drivers/gpu/drm/xe/xe_uc.c:177:24: error: passing argument 1 of ‘xe_uc_sanitize’ from incompatible pointer type [-Werror=incompatible-pointer-types]
177 | xe_uc_sanitize(&uc->guc);
| ^~~~~~~~
| |
| struct xe_guc *
../drivers/gpu/drm/xe/xe_uc.c:134:42: note: expected ‘struct xe_uc *’ but argument is of type ‘struct xe_guc *’
134 | static void xe_uc_sanitize(struct xe_uc *uc)
| ~~~~~~~~~~~~~~^~
cc1: some warnings being treated as errors
make[7]: *** [../scripts/Makefile.build:287: drivers/gpu/drm/xe/xe_uc.o] Error 1
make[7]: *** Waiting for unfinished jobs....
make[6]: *** [../scripts/Makefile.build:556: drivers/gpu/drm/xe] Error 2
make[5]: *** [../scripts/Makefile.build:556: drivers/gpu/drm] Error 2
make[4]: *** [../scripts/Makefile.build:556: drivers/gpu] Error 2
make[3]: *** [../scripts/Makefile.build:556: drivers] Error 2
make[2]: *** [/kernel/Makefile:2010: .] Error 2
make[1]: *** [/kernel/Makefile:248: __sub-make] Error 2
make: *** [Makefile:248: __sub-make] Error 2
[18:52:09] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[18:52:14] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=25
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-11-18 18:52 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-18 17:01 [PATCH v5] drm/xe/uc: Add stop on hardware initialization error Zhanjun Dong
2025-11-18 18:41 ` Dong, Zhanjun
2025-11-18 18:52 ` ✗ CI.KUnit: failure for drm/xe/uc: Add stop on hardware initialization error (rev4) Patchwork
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox