* drm: xe: Kernel-submitted job timed out
@ 2026-05-22 18:52 Linus Torvalds
2026-05-22 18:55 ` Maarten Lankhorst
0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2026-05-22 18:52 UTC (permalink / raw)
To: Matthew Brost, Thomas Hellström, Rodrigo Vivi
Cc: David Airlie, Simona Vetter, intel-xe, dri-devel
[-- Attachment #1: Type: text/plain, Size: 690 bytes --]
Actually, this doesn't seem to have actually timed out, it seems to
have never been started, and then subsequent operations were confused.
Because I had to reboot my desktop as non-responsive (the cursor was
moving, but no screen updates) after two lines of
xe 0000:4b:00.0: [drm] Tile0: GT0: Check job timeout: seqno=4485322,
lrc_seqno=4485322, guc_id=0, not started
followed a few seconds later by some Xe fault and then an endless
stream of "Kernel-submitted job timed out" reports.
Presumably that job was the thing that was never started in the first place.
Cut-down dmesg with the endless repeats deleted (after rebooting to
get a working system) attached.
Linus
[-- Attachment #2: out --]
[-- Type: application/octet-stream, Size: 5224 bytes --]
May 22 11:09:11 3970x kernel: xe 0000:4b:00.0: [drm] Tile0: GT0: Check job timeout: seqno=4485322, lrc_seqno=4485322, guc_id=0, not started
May 22 11:09:16 3970x kernel: xe 0000:4b:00.0: [drm] Tile0: GT0: Check job timeout: seqno=4485322, lrc_seqno=4485322, guc_id=0, not started
May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Tile0: GT0:
ASID: 0
Faulted Address: 0x00000002fa9fa000
FaultType: 0
AccessType: 0
FaultLevel: 2
EngineClass: 3 bcs
EngineInstance: 8
May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Tile0: GT0: Fault response: Unsuccessful -EINVAL
May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Tile0: GT0: Engine memory CAT error [18]: class=bcs, logical_mask: 0x2, guc_id=0
May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Tile0: GT0: Engine reset: engine_class=bcs, logical_mask: 0x2, guc_id=0, state=0x249
May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Tile0: GT0: Timedout job: seqno=4485322, lrc_seqno=4485322, guc_id=0, flags=0x73 in no process [-1]
May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Xe device coredump has been created
May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
May 22 11:09:19 3970x kernel: ------------[ cut here ]------------
May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Tile0: GT0: Kernel-submitted job timed out
May 22 11:09:19 3970x kernel: WARNING: drivers/gpu/drm/xe/xe_guc_submit.c:1627 at guc_exec_queue_timedout_job+0xe29/0x1000 [xe], CPU#17: kworker/u256:0/2306935
May 22 11:09:19 3970x kernel: Modules linked in: uas usb_storage uinput rfcomm nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables sunrpc bnep vfat fat iwlmvm mac80211 libarc4 snd_hda_codec_intelhdmi snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_usb_audio snd_hda_core btusb snd_hwdep btrtl amd_atl snd_usbmidi_lib iwlwifi snd_seq btintel amd64_edac snd_rawmidi btbcm bluetooth snd_pcm edac_mce_amd snd_seq_device wmi_bmof atlantic cfg80211 pcspkr igb mc macsec dca snd_timer rfkill mxm_wmi snd i2c_piix4 soundcore i2c_smbus k10temp joydev nfnetlink zram dm_crypt xe drm_ttm_helper ttm i2c_algo_bit gpu_sched drm_buddy video drm_client_lib drm_suballoc_helper drm_gpuvm drm_exec drm_gpusvm_helper drm_display_helper drm_kms_helper ccp drm cec nvme sp5100_tco nvme_core wmi i2c_dev fuse
May 22 11:09:19 3970x kernel: CPU: 17 UID: 0 PID: 2306935 Comm: kworker/u256:0 Not tainted 7.1.0-rc3-00073-ga6920214ba75 #46 PREEMPTLAZY
May 22 11:09:19 3970x kernel: Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40 AORUS MASTER, BIOS F7 09/07/2022
May 22 11:09:19 3970x kernel: Workqueue: gt-ordered-wq drm_sched_job_timedout [gpu_sched]
May 22 11:09:19 3970x kernel: RIP: 0010:guc_exec_queue_timedout_job+0xf3b/0x1000 [xe]
May 22 11:09:19 3970x kernel: Code: 8b 11 48 85 d2 74 06 48 8b 7a 08 eb 02 31 ff 48 8b 57 50 48 85 d2 75 03 48 8b 17 44 0f b6 46 26 0f b6 49 08 4c 89 f7 48 89 c6 <67> 48 0f b9 3a 48 8b 43 60 4c 8b 7c 24 20 44 8b 74 24 28 49 89 dc
May 22 11:09:19 3970x kernel: RSP: 0018:ffffd0e0aa6c7d88 EFLAGS: 00010246
May 22 11:09:19 3970x kernel: RAX: ffffffffc0c3b7cd RBX: ffff8a4020c46400 RCX: 0000000000000000
May 22 11:09:19 3970x kernel: RDX: ffff8a4004d369c0 RSI: ffffffffc0c3b7cd RDI: ffffffffc091cd30
May 22 11:09:19 3970x kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8a4f3f0fb240
May 22 11:09:19 3970x kernel: R10: 000000000000bffd R11: 3fffffffffffbfff R12: ffff8a4020c46400
May 22 11:09:19 3970x kernel: R13: ffff8a401c558e00 R14: ffffffffc091cd30 R15: ffff8a40166c0000
May 22 11:09:19 3970x kernel: FS: 0000000000000000(0000) GS:ffff8a4f4e2ed000(0000) knlGS:0000000000000000
May 22 11:09:19 3970x kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 22 11:09:19 3970x kernel: CR2: 0000243c08ff0000 CR3: 00000001994d5000 CR4: 0000000000350ef0
May 22 11:09:19 3970x kernel: Call Trace:
May 22 11:09:19 3970x kernel: <TASK>
May 22 11:09:19 3970x kernel: ? wake_bit_function+0x60/0x60
May 22 11:09:19 3970x kernel: drm_sched_job_timedout+0xb8/0x130 [gpu_sched]
May 22 11:09:19 3970x kernel: process_scheduled_works+0x1ac/0x380
May 22 11:09:19 3970x kernel: worker_thread+0x1f4/0x2d0
May 22 11:09:19 3970x kernel: ? pr_cont_work+0x1b0/0x1b0
May 22 11:09:19 3970x kernel: kthread+0xee/0x120
May 22 11:09:19 3970x kernel: ? kthread_blkcg+0x30/0x30
May 22 11:09:19 3970x kernel: ret_from_fork+0x9d/0x200
May 22 11:09:19 3970x kernel: ? kthread_blkcg+0x30/0x30
May 22 11:09:19 3970x kernel: ret_from_fork_asm+0x11/0x20
May 22 11:09:19 3970x kernel: </TASK>
May 22 11:09:19 3970x kernel: ---[ end trace 0000000000000000 ]---
May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Tile0: GT0: Timedout job: seqno=4485325, lrc_seqno=4485325, guc_id=0, flags=0x73 in no process [-1]
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: drm: xe: Kernel-submitted job timed out
2026-05-22 18:52 drm: xe: Kernel-submitted job timed out Linus Torvalds
@ 2026-05-22 18:55 ` Maarten Lankhorst
2026-05-22 19:05 ` Linus Torvalds
0 siblings, 1 reply; 9+ messages in thread
From: Maarten Lankhorst @ 2026-05-22 18:55 UTC (permalink / raw)
To: Linus Torvalds, Matthew Brost, Thomas Hellström,
Rodrigo Vivi
Cc: David Airlie, Simona Vetter, intel-xe, dri-devel
Hey,
Den 2026-05-22 kl. 20:52, skrev Linus Torvalds:
> Actually, this doesn't seem to have actually timed out, it seems to
> have never been started, and then subsequent operations were confused.
>
> Because I had to reboot my desktop as non-responsive (the cursor was
> moving, but no screen updates) after two lines of
>
> xe 0000:4b:00.0: [drm] Tile0: GT0: Check job timeout: seqno=4485322,
> lrc_seqno=4485322, guc_id=0, not started
>
> followed a few seconds later by some Xe fault and then an endless
> stream of "Kernel-submitted job timed out" reports.
>
> Presumably that job was the thing that was never started in the first place.
>
> Cut-down dmesg with the endless repeats deleted (after rebooting to
> get a working system) attached.
>
> Linus
There's a
May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Tile0: GT0: Timedout job: seqno=4485322, lrc_seqno=4485322, guc_id=0, flags=0x73 in no process [-1]
May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Xe device coredump has been created
May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
Do you have this coredump too?
Kind regards,
~Maarten Lankhorst
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: drm: xe: Kernel-submitted job timed out
2026-05-22 18:55 ` Maarten Lankhorst
@ 2026-05-22 19:05 ` Linus Torvalds
2026-05-22 20:44 ` Rodrigo Vivi
0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2026-05-22 19:05 UTC (permalink / raw)
To: Maarten Lankhorst
Cc: Matthew Brost, Thomas Hellström, Rodrigo Vivi, David Airlie,
Simona Vetter, intel-xe, dri-devel
On Fri, 22 May 2026 at 11:55, Maarten Lankhorst <dev@lankhorst.se> wrote:
>
> There's a
> May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Tile0: GT0: Timedout job: seqno=4485322, lrc_seqno=4485322, guc_id=0, flags=0x73 in no process [-1]
> May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Xe device coredump has been created
> May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
>
> Do you have this coredump too?
Nope. I was assuming it didn't survive the reboot.
(This machine doesn't allow any remote logins - very much on purpose -
so when the GPU hangs, it's toast).
Linus
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: drm: xe: Kernel-submitted job timed out
2026-05-22 19:05 ` Linus Torvalds
@ 2026-05-22 20:44 ` Rodrigo Vivi
2026-05-22 20:54 ` Linus Torvalds
0 siblings, 1 reply; 9+ messages in thread
From: Rodrigo Vivi @ 2026-05-22 20:44 UTC (permalink / raw)
To: Linus Torvalds
Cc: Maarten Lankhorst, Matthew Brost, Thomas Hellström,
David Airlie, Simona Vetter, intel-xe, dri-devel
On Fri, May 22, 2026 at 12:05:35PM -0700, Linus Torvalds wrote:
> On Fri, 22 May 2026 at 11:55, Maarten Lankhorst <dev@lankhorst.se> wrote:
> >
> > There's a
> > May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Tile0: GT0: Timedout job: seqno=4485322, lrc_seqno=4485322, guc_id=0, flags=0x73 in no process [-1]
> > May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Xe device coredump has been created
> > May 22 11:09:19 3970x kernel: xe 0000:4b:00.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
> >
> > Do you have this coredump too?
>
> Nope. I was assuming it didn't survive the reboot.
It doesn't. In this kind of setup the best way to deal with devcoredump
is to create a udev rule that copies the data file to a persistent place.
>
> (This machine doesn't allow any remote logins - very much on purpose -
> so when the GPU hangs, it's toast).
Any journal saving the kernel buf log of previous boots? Preferably with
some drm.debug flags enabled 0xf likely
Also:
Any bisect possible in this setup? I imagine it might be painful though...
What was the last drm-fixes pull you got in this 7.1.0-rc3-00073-ga6920214ba75 ?
I believe the quickest path might be to simply drop the xe fixes you might
have recently gotten there while we don't identify the culprit.
Thanks,
Rodrigo.
>
> Linus
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: drm: xe: Kernel-submitted job timed out
2026-05-22 20:44 ` Rodrigo Vivi
@ 2026-05-22 20:54 ` Linus Torvalds
2026-05-23 8:29 ` Maarten Lankhorst
0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2026-05-22 20:54 UTC (permalink / raw)
To: Rodrigo Vivi
Cc: Maarten Lankhorst, Matthew Brost, Thomas Hellström,
David Airlie, Simona Vetter, intel-xe, dri-devel
On Fri, 22 May 2026 at 13:44, Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
>
> Any journal saving the kernel buf log of previous boots? Preferably with
> some drm.debug flags enabled 0xf likely
Note that this is very much not repeatable. I have no idea what
triggered it, and I don't think it was necessarily brought on by
anything recent.
I've seen timeouts before, but looking at my logs, the last time it
caused a complete hang was Feb 3. So a few months ago...
> What was the last drm-fixes pull you got in this 7.1.0-rc3-00073-ga6920214ba75 ?
That's just mainline commit v7.1-rc3-71-g31e62c2ebbfd with two random
small patches on top that change some build flags (this is my "built
by clang" tree)
So the last drm merge would have been 51d24842acb9 Merge tag
'drm-fixes-2026-05-08-1' of https://gitlab.freedesktop.org/drm/kernel
Linus
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: drm: xe: Kernel-submitted job timed out
2026-05-22 20:54 ` Linus Torvalds
@ 2026-05-23 8:29 ` Maarten Lankhorst
2026-05-23 14:48 ` Linus Torvalds
0 siblings, 1 reply; 9+ messages in thread
From: Maarten Lankhorst @ 2026-05-23 8:29 UTC (permalink / raw)
To: Linus Torvalds, Rodrigo Vivi
Cc: Matthew Brost, Thomas Hellström, David Airlie, Simona Vetter,
intel-xe, dri-devel
Hey,
Den 2026-05-22 kl. 22:54, skrev Linus Torvalds:
> On Fri, 22 May 2026 at 13:44, Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
>>
>> Any journal saving the kernel buf log of previous boots? Preferably with
>> some drm.debug flags enabled 0xf likely
>
> Note that this is very much not repeatable. I have no idea what
> triggered it, and I don't think it was necessarily brought on by
> anything recent.
>
> I've seen timeouts before, but looking at my logs, the last time it
> caused a complete hang was Feb 3. So a few months ago...
>
>> What was the last drm-fixes pull you got in this 7.1.0-rc3-00073-ga6920214ba75 ?
>
> That's just mainline commit v7.1-rc3-71-g31e62c2ebbfd with two random
> small patches on top that change some build flags (this is my "built
> by clang" tree)
>
> So the last drm merge would have been 51d24842acb9 Merge tag
> 'drm-fixes-2026-05-08-1' of https://gitlab.freedesktop.org/drm/kernel
>
> Linus
Just thinking that since the guc_id=0, the most likely culprit is in the
kernel migration code.
There are 3 places you'll most likely interact with it:
- Zeroing VRAM bo's on allocation
- On integrated, it may clear system memory bo's CCS data.
- Moving memory between system and VRAM.
I'm assuming you only have a discrete card, so it's either happening
on allocation or memory movement.
Since it's sporadic, it *might* be more likely the latter.
Does it happen more frequently when loading VRAM intensive programs?
Kind regards,
~Maarten Lankhorst
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: drm: xe: Kernel-submitted job timed out
2026-05-23 8:29 ` Maarten Lankhorst
@ 2026-05-23 14:48 ` Linus Torvalds
2026-06-09 16:30 ` Matthew Brost
0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2026-05-23 14:48 UTC (permalink / raw)
To: Maarten Lankhorst
Cc: Rodrigo Vivi, Matthew Brost, Thomas Hellström, David Airlie,
Simona Vetter, intel-xe, dri-devel
On Sat, May 23, 2026 at 1:29 AM Maarten Lankhorst <dev@lankhorst.se> wrote:
>
> Does it happen more frequently when loading VRAM intensive programs?
Well, "more frequently" is hard to say since it's happened twice, but
this time it certainly happened when launching a new program.
This time it was a markdown viewer.
I wouldn't expect that to be particularly VRAM-intensive, but hey,
since I run with two 6k monitors, I suspect *anything* with big
windows will chew up a few hundred megs of VRAM just for the frame
buffer side.
It's a B50 Pro, so it's a discrete card with 16GB on card.
I have no memory of what it might have been back a few months ago. But
I would expect it to be all the usual stuff - ten terminals, a web
browser with a dozen tabs, and whatever gnome and wayland do, and then
the occasional random other thing.
Linus
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: drm: xe: Kernel-submitted job timed out
2026-05-23 14:48 ` Linus Torvalds
@ 2026-06-09 16:30 ` Matthew Brost
2026-06-11 13:46 ` Rodrigo Vivi
0 siblings, 1 reply; 9+ messages in thread
From: Matthew Brost @ 2026-06-09 16:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Maarten Lankhorst, Rodrigo Vivi, Thomas Hellström,
David Airlie, Simona Vetter, intel-xe, dri-devel
On Sat, May 23, 2026 at 07:48:49AM -0700, Linus Torvalds wrote:
> On Sat, May 23, 2026 at 1:29 AM Maarten Lankhorst <dev@lankhorst.se> wrote:
> >
> > Does it happen more frequently when loading VRAM intensive programs?
>
> Well, "more frequently" is hard to say since it's happened twice, but
> this time it certainly happened when launching a new program.
>
> This time it was a markdown viewer.
>
> I wouldn't expect that to be particularly VRAM-intensive, but hey,
> since I run with two 6k monitors, I suspect *anything* with big
> windows will chew up a few hundred megs of VRAM just for the frame
> buffer side.
>
> It's a B50 Pro, so it's a discrete card with 16GB on card.
>
> I have no memory of what it might have been back a few months ago. But
> I would expect it to be all the usual stuff - ten terminals, a web
> browser with a dozen tabs, and whatever gnome and wayland do, and then
> the occasional random other thing.
>
> Linus
I’ve also intermittently seen kernel job timeouts during my development
over the last several months. It truly seems random—on some Linux builds
it happens somewhat frequently when running internal tests, while on
others it disappears, only to show up again in a different build.
I’ve also seen cases where a kernel timed-out job loops indefinitely,
though I haven’t investigated fixing that part. However, Rodrigo just
posted a series that should at least address that issue, allowing us to
focus on root-causing why kernel jobs are timing out in the first place.
Matt
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: drm: xe: Kernel-submitted job timed out
2026-06-09 16:30 ` Matthew Brost
@ 2026-06-11 13:46 ` Rodrigo Vivi
0 siblings, 0 replies; 9+ messages in thread
From: Rodrigo Vivi @ 2026-06-11 13:46 UTC (permalink / raw)
To: Matthew Brost
Cc: Linus Torvalds, Maarten Lankhorst, Thomas Hellström,
David Airlie, Simona Vetter, intel-xe, dri-devel
On Tue, Jun 09, 2026 at 09:30:45AM -0700, Matthew Brost wrote:
> On Sat, May 23, 2026 at 07:48:49AM -0700, Linus Torvalds wrote:
> > On Sat, May 23, 2026 at 1:29 AM Maarten Lankhorst <dev@lankhorst.se> wrote:
> > >
> > > Does it happen more frequently when loading VRAM intensive programs?
> >
> > Well, "more frequently" is hard to say since it's happened twice, but
> > this time it certainly happened when launching a new program.
> >
> > This time it was a markdown viewer.
> >
> > I wouldn't expect that to be particularly VRAM-intensive, but hey,
> > since I run with two 6k monitors, I suspect *anything* with big
> > windows will chew up a few hundred megs of VRAM just for the frame
> > buffer side.
> >
> > It's a B50 Pro, so it's a discrete card with 16GB on card.
> >
> > I have no memory of what it might have been back a few months ago. But
> > I would expect it to be all the usual stuff - ten terminals, a web
> > browser with a dozen tabs, and whatever gnome and wayland do, and then
> > the occasional random other thing.
> >
> > Linus
>
> I’ve also intermittently seen kernel job timeouts during my development
> over the last several months. It truly seems random—on some Linux builds
> it happens somewhat frequently when running internal tests, while on
> others it disappears, only to show up again in a different build.
>
> I’ve also seen cases where a kernel timed-out job loops indefinitely,
> though I haven’t investigated fixing that part. However, Rodrigo just
> posted a series that should at least address that issue, allowing us to
> focus on root-causing why kernel jobs are timing out in the first place.
Yeap, commit ("drm/xe: fix job timeout recovery for unstarted jobs and kernel queues") [1]
merged on drm-xe-next and on the way for this week's fixes PR won't solve what
caused the initial GPU hang, but it should make the reset more robust and
avoid getting the machine frozen/lock-up. So, the next time you got the hang
you could continue using the machine and getting the devcoredump so we can
debug the hang itself.
[1] https://lore.kernel.org/all/20260610152548.404575-3-rodrigo.vivi@intel.com/
>
> Matt
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-06-11 13:46 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22 18:52 drm: xe: Kernel-submitted job timed out Linus Torvalds
2026-05-22 18:55 ` Maarten Lankhorst
2026-05-22 19:05 ` Linus Torvalds
2026-05-22 20:44 ` Rodrigo Vivi
2026-05-22 20:54 ` Linus Torvalds
2026-05-23 8:29 ` Maarten Lankhorst
2026-05-23 14:48 ` Linus Torvalds
2026-06-09 16:30 ` Matthew Brost
2026-06-11 13:46 ` Rodrigo Vivi
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.