* [Bug 112226] [HadesCanyon] GPU hangs don't anymore recover (although kernel still claims that they do)
@ 2019-11-07 13:53 bugzilla-daemon
2019-11-07 13:53 ` bugzilla-daemon
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-07 13:53 UTC (permalink / raw)
To: dri-devel
[-- Attachment #1.1: Type: text/plain, Size: 2291 bytes --]
https://bugs.freedesktop.org/show_bug.cgi?id=112226
Bug ID: 112226
Summary: [HadesCanyon] GPU hangs don't anymore recover
(although kernel still claims that they do)
Product: DRI
Version: DRI git
Hardware: x86-64 (AMD64)
OS: Linux (All)
Status: NEW
Severity: critical
Priority: not set
Component: DRM/AMDgpu
Assignee: dri-devel@lists.freedesktop.org
Reporter: eero.t.tamminen@intel.com
Setup:
* HW: KBL HadesCanyon (i7-8809G with Radeon RX Vega M GH)
* OS: Ubuntu 18.04 with Unity desktop (compiz)
* SW: Git builds of drm-tip kernel, Mesa and X server
Issue:
* AMD GPU driver stopped recovering from bug 108898 KBL HadesCanyon GPU hangs.
It still claims to recover from the bug:
-------------------------------------------------------
[ 1057.512690] Iteration 2/3: bin/testfw_app --gfx glfw --gl_api desktop_core
--width 1920 --height 1080 --fullscreen 1 --test_id gl_manhattan
[ 1119.867403] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for
fences timed out!
[ 1124.987449] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but
soft recovered
-------------------------------------------------------
But now all 3D tests run after this error will fail.
This started to happen between following (drm-tip) kernel commits:
* 2019-10-28 16:01:46: 912b87256c: drm-tip: 2019y-10m-28d-16h-00m-10s UTC
integration manifest
* 2019-10-29 17:58:05: a2c9f8ce2a: drm-tip: 2019y-10m-29d-17h-57m-39s UTC
integration manifest
And following Mesa commits:
* 2019-10-28 17:47:06: d298740a1c: iris: Disallow incomplete resource creation
* 2019-10-29 16:19:34: ff6e148a3d: freedreno/a6xx: add a618 support
Note:
* I'm not seeing the same issue by using few months old Mesa with latest
drm-tip kernel, so some change in Mesa triggers this kernel issue
* If latest Mesa is used with drm-tip kernel 5.3, 4/5 times X fails to start.
This started to happen with Mesa version within couple of days of the GPU hang
recovery issue, so potentially there are more issue in Mesa (HadesCanyon) AMD
support
--
You are receiving this mail because:
You are the assignee for the bug.
[-- Attachment #1.2: Type: text/html, Size: 3796 bytes --]
[-- Attachment #2: Type: text/plain, Size: 159 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug 112226] [HadesCanyon] GPU hangs don't anymore recover (although kernel still claims that they do)
2019-11-07 13:53 [Bug 112226] [HadesCanyon] GPU hangs don't anymore recover (although kernel still claims that they do) bugzilla-daemon
@ 2019-11-07 13:53 ` bugzilla-daemon
2019-11-07 14:04 ` bugzilla-daemon
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-07 13:53 UTC (permalink / raw)
To: dri-devel
[-- Attachment #1.1: Type: text/plain, Size: 474 bytes --]
https://bugs.freedesktop.org/show_bug.cgi?id=112226
Eero Tamminen <eero.t.tamminen@intel.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
See Also| |https://bugs.freedesktop.or
| |g/show_bug.cgi?id=108898
--
You are receiving this mail because:
You are the assignee for the bug.
[-- Attachment #1.2: Type: text/html, Size: 1180 bytes --]
[-- Attachment #2: Type: text/plain, Size: 159 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug 112226] [HadesCanyon] GPU hangs don't anymore recover (although kernel still claims that they do)
2019-11-07 13:53 [Bug 112226] [HadesCanyon] GPU hangs don't anymore recover (although kernel still claims that they do) bugzilla-daemon
2019-11-07 13:53 ` bugzilla-daemon
@ 2019-11-07 14:04 ` bugzilla-daemon
2019-11-07 14:25 ` bugzilla-daemon
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-07 14:04 UTC (permalink / raw)
To: dri-devel
[-- Attachment #1.1: Type: text/plain, Size: 442 bytes --]
https://bugs.freedesktop.org/show_bug.cgi?id=112226
--- Comment #1 from Alex Deucher <alexdeucher@gmail.com> ---
Please attach your dmesg output and xorg log is using X. Please note that
after a GPU reset, in most cases you need to restart your desktop environment
because no desktop environments properly handle the loss of their contexts at
the moment.
--
You are receiving this mail because:
You are the assignee for the bug.
[-- Attachment #1.2: Type: text/html, Size: 1307 bytes --]
[-- Attachment #2: Type: text/plain, Size: 159 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug 112226] [HadesCanyon] GPU hangs don't anymore recover (although kernel still claims that they do)
2019-11-07 13:53 [Bug 112226] [HadesCanyon] GPU hangs don't anymore recover (although kernel still claims that they do) bugzilla-daemon
2019-11-07 13:53 ` bugzilla-daemon
2019-11-07 14:04 ` bugzilla-daemon
@ 2019-11-07 14:25 ` bugzilla-daemon
2019-11-07 14:35 ` [Bug 112226] [HadesCanyon/regression] GPU hang causes also X server to die bugzilla-daemon
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-07 14:25 UTC (permalink / raw)
To: dri-devel
[-- Attachment #1.1: Type: text/plain, Size: 307 bytes --]
https://bugs.freedesktop.org/show_bug.cgi?id=112226
--- Comment #2 from Eero Tamminen <eero.t.tamminen@intel.com> ---
Created attachment 145908
--> https://bugs.freedesktop.org/attachment.cgi?id=145908&action=edit
dmesg
--
You are receiving this mail because:
You are the assignee for the bug.
[-- Attachment #1.2: Type: text/html, Size: 1279 bytes --]
[-- Attachment #2: Type: text/plain, Size: 159 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug 112226] [HadesCanyon/regression] GPU hang causes also X server to die
2019-11-07 13:53 [Bug 112226] [HadesCanyon] GPU hangs don't anymore recover (although kernel still claims that they do) bugzilla-daemon
` (2 preceding siblings ...)
2019-11-07 14:25 ` bugzilla-daemon
@ 2019-11-07 14:35 ` bugzilla-daemon
2019-11-07 14:46 ` bugzilla-daemon
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-07 14:35 UTC (permalink / raw)
To: dri-devel
[-- Attachment #1.1: Type: text/plain, Size: 2089 bytes --]
https://bugs.freedesktop.org/show_bug.cgi?id=112226
Eero Tamminen <eero.t.tamminen@intel.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|[HadesCanyon] GPU hangs |[HadesCanyon/regression]
|don't anymore recover |GPU hang causes also X
|(although kernel still |server to die
|claims that they do) |
--- Comment #3 from Eero Tamminen <eero.t.tamminen@intel.com> ---
(In reply to Alex Deucher from comment #1)
> Please attach your dmesg output and xorg log is using X. Please note that
> after a GPU reset, in most cases you need to restart your desktop
> environment because no desktop environments properly handle the loss of
> their contexts at the moment.
Failed tests complain about the invalid MIT-MAGIC-COOKIE-1, so it seems that
later failures are because X went down (and came back up with display manager).
AFAIK reset should affect only the context running in the GPU when it was
reseted, not the others [1], and in this case the problematic client should be
GfxBench (Manhattan test-case, see bug 108898), not X server.
Btw. Why AMD kernel module doesn't tell which process / context had the issue,
like i915 does?
[1] At least that's the case with i915, as long as the whole system doesn't
hang.
(In reply to Eero Tamminen from comment #0)
> * If latest Mesa is used with drm-tip kernel 5.3, 4/5 times X fails to
> start. This started to happen with Mesa version within couple of days of
> the GPU hang recovery issue, so potentially there are more issue in Mesa
> (HadesCanyon) AMD support
Correction. That issue happens only when using latest Mesa with few months old
X server and (5.3) drm-tip kernel. If latest git versions of all are used, X
starts fine. But since the indicated date, it dies later, when Manhattan
test-case causes problems.
--
You are receiving this mail because:
You are the assignee for the bug.
[-- Attachment #1.2: Type: text/html, Size: 3733 bytes --]
[-- Attachment #2: Type: text/plain, Size: 159 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug 112226] [HadesCanyon/regression] GPU hang causes also X server to die
2019-11-07 13:53 [Bug 112226] [HadesCanyon] GPU hangs don't anymore recover (although kernel still claims that they do) bugzilla-daemon
` (3 preceding siblings ...)
2019-11-07 14:35 ` [Bug 112226] [HadesCanyon/regression] GPU hang causes also X server to die bugzilla-daemon
@ 2019-11-07 14:46 ` bugzilla-daemon
2019-11-07 17:22 ` bugzilla-daemon
2019-11-19 10:01 ` bugzilla-daemon
6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-07 14:46 UTC (permalink / raw)
To: dri-devel
[-- Attachment #1.1: Type: text/plain, Size: 797 bytes --]
https://bugs.freedesktop.org/show_bug.cgi?id=112226
--- Comment #4 from Eero Tamminen <eero.t.tamminen@intel.com> ---
Created attachment 145909
--> https://bugs.freedesktop.org/attachment.cgi?id=145909&action=edit
Xorg log
X dies to ConfigureWindow() -> miResizeWindow() -> miCopyRegion() ->
glamor_create_pixmap() -> radeonsi_dri.so -> abort().
Lightdm log show abort to be:
X: src/gallium/winsys/amdgpu/drm/amdgpu_cs.c:1061: amdgpu_cs_check_space:
Assertion `rcs->current.cdw <= rcs->current.max_dw' failed.
This is the same abort that causes X server to fail at boot with git Mesa and a
bit older X server & drm-tip kernel.
Is above abort due to something on the kernel side, or Mesa issue?
--
You are receiving this mail because:
You are the assignee for the bug.
[-- Attachment #1.2: Type: text/html, Size: 1747 bytes --]
[-- Attachment #2: Type: text/plain, Size: 159 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug 112226] [HadesCanyon/regression] GPU hang causes also X server to die
2019-11-07 13:53 [Bug 112226] [HadesCanyon] GPU hangs don't anymore recover (although kernel still claims that they do) bugzilla-daemon
` (4 preceding siblings ...)
2019-11-07 14:46 ` bugzilla-daemon
@ 2019-11-07 17:22 ` bugzilla-daemon
2019-11-19 10:01 ` bugzilla-daemon
6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-07 17:22 UTC (permalink / raw)
To: dri-devel
[-- Attachment #1.1: Type: text/plain, Size: 828 bytes --]
https://bugs.freedesktop.org/show_bug.cgi?id=112226
--- Comment #5 from Alex Deucher <alexdeucher@gmail.com> ---
(In reply to Eero Tamminen from comment #3)
>
> AFAIK reset should affect only the context running in the GPU when it was
> reseted, not the others [1], and in this case the problematic client should
> be GfxBench (Manhattan test-case, see bug 108898), not X server.
>
> Btw. Why AMD kernel module doesn't tell which process / context had the
> issue, like i915 does?
It does, but in the case of a whole GPU reset, vram is lost after a reset so
the buffers from all processes that use the GPU are lost. Depending on the
nature of the hang, a whole GPU reset may be required rather than just killing
the shader wave.
--
You are receiving this mail because:
You are the assignee for the bug.
[-- Attachment #1.2: Type: text/html, Size: 1915 bytes --]
[-- Attachment #2: Type: text/plain, Size: 159 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug 112226] [HadesCanyon/regression] GPU hang causes also X server to die
2019-11-07 13:53 [Bug 112226] [HadesCanyon] GPU hangs don't anymore recover (although kernel still claims that they do) bugzilla-daemon
` (5 preceding siblings ...)
2019-11-07 17:22 ` bugzilla-daemon
@ 2019-11-19 10:01 ` bugzilla-daemon
6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-19 10:01 UTC (permalink / raw)
To: dri-devel
[-- Attachment #1.1: Type: text/plain, Size: 805 bytes --]
https://bugs.freedesktop.org/show_bug.cgi?id=112226
Martin Peres <martin.peres@free.fr> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution|--- |MOVED
--- Comment #6 from Martin Peres <martin.peres@free.fr> ---
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been
closed from further activity.
You can subscribe and participate further through the new bug through this link
to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/951.
--
You are receiving this mail because:
You are the assignee for the bug.
[-- Attachment #1.2: Type: text/html, Size: 2407 bytes --]
[-- Attachment #2: Type: text/plain, Size: 159 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-11-19 10:01 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-11-07 13:53 [Bug 112226] [HadesCanyon] GPU hangs don't anymore recover (although kernel still claims that they do) bugzilla-daemon
2019-11-07 13:53 ` bugzilla-daemon
2019-11-07 14:04 ` bugzilla-daemon
2019-11-07 14:25 ` bugzilla-daemon
2019-11-07 14:35 ` [Bug 112226] [HadesCanyon/regression] GPU hang causes also X server to die bugzilla-daemon
2019-11-07 14:46 ` bugzilla-daemon
2019-11-07 17:22 ` bugzilla-daemon
2019-11-19 10:01 ` bugzilla-daemon
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.