* [Bug 94081] New: [radeon 3.18 regression] GPU reset recovery fails
@ 2015-03-01 19:02 bugzilla-daemon
2015-03-01 19:05 ` [Bug 94081] " bugzilla-daemon
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: bugzilla-daemon @ 2015-03-01 19:02 UTC (permalink / raw)
To: dri-devel
https://bugzilla.kernel.org/show_bug.cgi?id=94081
Bug ID: 94081
Summary: [radeon 3.18 regression] GPU reset recovery fails
Product: Drivers
Version: 2.5
Kernel Version: 3.18.x
Hardware: All
OS: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: Video(DRI - non Intel)
Assignee: drivers_video-dri@kernel-bugs.osdl.org
Reporter: jan.vesely@rutgers.edu
Regression: No
starting with kernel-3.18 (fedora version) fails to recover from OpenCL induced
GPU loockup.
reproducer:
Run noise-hurl.xml OpenCL test in gegl library:
[354672.707822] radeon 0000:01:00.0: ring 0 stalled for more than 10020msec
on 3.17 (fedora again) I observe one or two display flashes, and full recovery.
starting with 3.18 I see the flash, and the dispaly stays frozen. the task
itself(gegl) stays in uninteruptible state
Here are the relevant lines from dmesg on 3.18:
[354672.707822] radeon 0000:01:00.0: ring 0 stalled for more than 10020msec
[354672.707828] radeon 0000:01:00.0: GPU lockup (current fence id
0x00000000007778a3 last fence id 0x00000000007778b3 on ring 0)
[354672.828879] radeon 0000:01:00.0: Saved 503 dwords of commands on ring 0.
[354672.828898] radeon 0000:01:00.0: GPU softreset: 0x00000009
[354672.828900] radeon 0000:01:00.0: GRBM_STATUS = 0xA0433828
[354672.828902] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x08000007
[354672.828903] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000007
[354672.828905] radeon 0000:01:00.0: SRBM_STATUS = 0x20000AC0
[354672.828907] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000
[354672.828908] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000
[354672.828910] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00018000
[354672.828912] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00010002
[354672.828913] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80038647
[354672.828915] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57
[354672.842214] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00007F6B
[354672.842267] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[354672.843423] radeon 0000:01:00.0: GRBM_STATUS = 0x00003828
[354672.843425] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000007
[354672.843426] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000007
[354672.843428] radeon 0000:01:00.0: SRBM_STATUS = 0x200000C0
[354672.843429] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000
[354672.843431] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000
[354672.843432] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000
[354672.843434] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000
[354672.843435] radeon 0000:01:00.0: R_008680_CP_STAT = 0x00000000
[354672.843437] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57
[354672.843456] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[354672.865723] [drm] enabling PCIE gen 2 link speeds, disable with
radeon.pcie_gen2=0
[354672.868296] [drm] PCIE GART of 1024M enabled (table at 0x0000000000274000).
[354672.868388] radeon 0000:01:00.0: WB enabled
[354672.868390] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
0x0000000040000c00 and cpu addr 0xffff880401c54c00
[354672.868391] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
0x0000000040000c0c and cpu addr 0xffff880401c54c0c
[354672.869865] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr
0x0000000000072118 and cpu addr 0xffffc900062b2118
[354672.886233] [drm] ring test on 0 succeeded in 3 usecs
[354672.886244] [drm] ring test on 3 succeeded in 7 usecs
[354673.063433] [drm] ring test on 5 succeeded in 2 usecs
[354673.063441] [drm] UVD initialized successfully.
[354673.187403] [drm] ib test on ring 0 succeeded in 0 usecs
[354673.187432] [drm] ib test on ring 3 succeeded in 0 usecs
--
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 5+ messages in thread* [Bug 94081] [radeon 3.18 regression] GPU reset recovery fails
2015-03-01 19:02 [Bug 94081] New: [radeon 3.18 regression] GPU reset recovery fails bugzilla-daemon
@ 2015-03-01 19:05 ` bugzilla-daemon
2015-03-02 6:58 ` bugzilla-daemon
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2015-03-01 19:05 UTC (permalink / raw)
To: dri-devel
https://bugzilla.kernel.org/show_bug.cgi?id=94081
--- Comment #1 from Jan Vesely <jan.vesely@rutgers.edu> ---
here's dmesg output for 3.17 kernel:
(looks like ib test on ring 5 is missing in 3.18)
[ 249.015280] radeon 0000:01:00.0: ring 0 stalled for more than 10000msec
[ 249.015287] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000742
last fence id 0x0000000000000741 on ring 0)
[ 249.027303] radeon 0000:01:00.0: ring 0 stalled for more than 10012msec
[ 249.027309] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000745
last fence id 0x0000000000000741 on ring 0)
[ 249.131987] radeon 0000:01:00.0: Saved 183 dwords of commands on ring 0.
[ 249.132005] radeon 0000:01:00.0: GPU softreset: 0x00000009
[ 249.132007] radeon 0000:01:00.0: GRBM_STATUS = 0xA0433828
[ 249.132009] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x08000007
[ 249.132010] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000007
[ 249.132012] radeon 0000:01:00.0: SRBM_STATUS = 0x20000AC0
[ 249.132014] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000
[ 249.132015] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000
[ 249.132017] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00018000
[ 249.132019] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00010002
[ 249.132021] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80038647
[ 249.132023] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57
[ 249.142433] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00007F6B
[ 249.142486] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[ 249.143643] radeon 0000:01:00.0: GRBM_STATUS = 0x00003828
[ 249.143644] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000007
[ 249.143646] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000007
[ 249.143648] radeon 0000:01:00.0: SRBM_STATUS = 0x200000C0
[ 249.143649] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000
[ 249.143651] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000
[ 249.143653] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000
[ 249.143654] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000
[ 249.143656] radeon 0000:01:00.0: R_008680_CP_STAT = 0x00000000
[ 249.143658] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57
[ 249.143681] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[ 249.165960] [drm] enabling PCIE gen 2 link speeds, disable with
radeon.pcie_gen2=0
[ 249.167181] [drm] PCIE GART of 1024M enabled (table at 0x0000000000273000).
[ 249.167273] radeon 0000:01:00.0: WB enabled
[ 249.167274] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
0x0000000040000c00 and cpu addr 0xffff880401eeac00
[ 249.167276] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
0x0000000040000c0c and cpu addr 0xffff880401eeac0c
[ 249.168728] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr
0x0000000000072118 and cpu addr 0xffffc900062b2118
[ 249.185032] [drm] ring test on 0 succeeded in 3 usecs
[ 249.185042] [drm] ring test on 3 succeeded in 6 usecs
[ 249.362220] [drm] ring test on 5 succeeded in 2 usecs
[ 249.362228] [drm] UVD initialized successfully.
[ 249.362280] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
[ 249.362281] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on
GFX ring (-35).
[ 249.362282] radeon 0000:01:00.0: ib ring test failed (-35).
[ 249.370592] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[ 249.377656] [drm] enabling PCIE gen 2 link speeds, disable with
radeon.pcie_gen2=0
[ 249.378860] [drm] PCIE GART of 1024M enabled (table at 0x0000000000273000).
[ 249.378951] radeon 0000:01:00.0: WB enabled
[ 249.378952] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
0x0000000040000c00 and cpu addr 0xffff880401eeac00
[ 249.378954] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
0x0000000040000c0c and cpu addr 0xffff880401eeac0c
[ 249.380537] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr
0x0000000000072118 and cpu addr 0xffffc900062b2118
[ 249.396825] [drm] ring test on 0 succeeded in 3 usecs
[ 249.396835] [drm] ring test on 3 succeeded in 7 usecs
[ 249.574006] [drm] ring test on 5 succeeded in 2 usecs
[ 249.574014] [drm] UVD initialized successfully.
[ 249.574061] [drm] ib test on ring 0 succeeded in 0 usecs
[ 249.574105] [drm] ib test on ring 3 succeeded in 0 usecs
[ 249.726529] [drm] ib test on ring 5 succeeded
--
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 5+ messages in thread* [Bug 94081] [radeon 3.18 regression] GPU reset recovery fails
2015-03-01 19:02 [Bug 94081] New: [radeon 3.18 regression] GPU reset recovery fails bugzilla-daemon
2015-03-01 19:05 ` [Bug 94081] " bugzilla-daemon
@ 2015-03-02 6:58 ` bugzilla-daemon
2015-03-04 21:10 ` bugzilla-daemon
2015-03-05 1:12 ` bugzilla-daemon
3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2015-03-02 6:58 UTC (permalink / raw)
To: dri-devel
https://bugzilla.kernel.org/show_bug.cgi?id=94081
--- Comment #2 from Michel Dänzer <michel@daenzer.net> ---
Can you bisect?
(In reply to Jan Vesely from comment #1)
> here's dmesg output for 3.17 kernel:
[...]
> [ 249.362280] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
> [ 249.362281] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB
> on GFX ring (-35).
Actually, this looks like the reset didn't fully work with 3.17 either
though...
--
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug 94081] [radeon 3.18 regression] GPU reset recovery fails
2015-03-01 19:02 [Bug 94081] New: [radeon 3.18 regression] GPU reset recovery fails bugzilla-daemon
2015-03-01 19:05 ` [Bug 94081] " bugzilla-daemon
2015-03-02 6:58 ` bugzilla-daemon
@ 2015-03-04 21:10 ` bugzilla-daemon
2015-03-05 1:12 ` bugzilla-daemon
3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2015-03-04 21:10 UTC (permalink / raw)
To: dri-devel
https://bugzilla.kernel.org/show_bug.cgi?id=94081
--- Comment #3 from Jan Vesely <jan.vesely@rutgers.edu> ---
(In reply to Michel Dänzer from comment #2)
> Can you bisect?
It took a while (first bisect found unrelated i915 dispaly commit).
the failure was introduced in:
commit dd7cfd641228abb2669d8d047d5ec377b1835900
Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Date: Tue Jan 21 13:07:31 2014 +0100
drm/ttm: kill fence_lock
No users are left, kill it off! :D
Conversion to the reservation api is next on the list, after
that the functionality can be restored with rcu.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
the commit moves a call to fence get below two "goto cleanup" in error paths,
however, fence_put is left in the cleanup: error target. Moving the fence_put
call to pflip_cleanup fixes the issue.
I've posted a patch.
>
> (In reply to Jan Vesely from comment #1)
> > here's dmesg output for 3.17 kernel:
>
> [...]
>
> > [ 249.362280] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
> > [ 249.362281] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB
> > on GFX ring (-35).
>
> Actually, this looks like the reset didn't fully work with 3.17 either
> though...
I don't remember seeing this during bisection. This log is from fedora 3.17.8
kernel. I'll check 3.17.8 vanilla and see whether it's fedora specific
--
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 5+ messages in thread* [Bug 94081] [radeon 3.18 regression] GPU reset recovery fails
2015-03-01 19:02 [Bug 94081] New: [radeon 3.18 regression] GPU reset recovery fails bugzilla-daemon
` (2 preceding siblings ...)
2015-03-04 21:10 ` bugzilla-daemon
@ 2015-03-05 1:12 ` bugzilla-daemon
3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2015-03-05 1:12 UTC (permalink / raw)
To: dri-devel
https://bugzilla.kernel.org/show_bug.cgi?id=94081
--- Comment #4 from Jan Vesely <jan.vesely@rutgers.edu> ---
This does not make sense, the work structure is zeroed so fence put should is
OK.
it looks like sometimes the lockup needs more than 1 GPU restart to manifest,
I'll replay without the good entries (at least it explains inconsistent bisect
results)
sorry for the noise
(In reply to Jan Vesely from comment #3)
> (In reply to Michel Dänzer from comment #2)
> > Can you bisect?
>
> It took a while (first bisect found unrelated i915 dispaly commit).
> the failure was introduced in:
>
> commit dd7cfd641228abb2669d8d047d5ec377b1835900
> Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
> Date: Tue Jan 21 13:07:31 2014 +0100
>
> drm/ttm: kill fence_lock
>
> No users are left, kill it off! :D
> Conversion to the reservation api is next on the list, after
> that the functionality can be restored with rcu.
>
> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
>
> the commit moves a call to fence get below two "goto cleanup" in error
> paths, however, fence_put is left in the cleanup: error target. Moving the
> fence_put call to pflip_cleanup fixes the issue.
>
> I've posted a patch.
> >
> > (In reply to Jan Vesely from comment #1)
> > > here's dmesg output for 3.17 kernel:
> >
> > [...]
> >
> > > [ 249.362280] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
> > > [ 249.362281] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB
> > > on GFX ring (-35).
> >
> > Actually, this looks like the reset didn't fully work with 3.17 either
> > though...
>
> I don't remember seeing this during bisection. This log is from fedora
> 3.17.8 kernel. I'll check 3.17.8 vanilla and see whether it's fedora specific
--
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2015-03-05 1:12 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-01 19:02 [Bug 94081] New: [radeon 3.18 regression] GPU reset recovery fails bugzilla-daemon
2015-03-01 19:05 ` [Bug 94081] " bugzilla-daemon
2015-03-02 6:58 ` bugzilla-daemon
2015-03-04 21:10 ` bugzilla-daemon
2015-03-05 1:12 ` bugzilla-daemon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).