* GPU lockup CP stall for more than 10000msec on latest vanilla git @ 2012-12-17 18:27 Markus Trippelsdorf 2012-12-17 21:32 ` Alex Deucher 0 siblings, 1 reply; 20+ messages in thread From: Markus Trippelsdorf @ 2012-12-17 18:27 UTC (permalink / raw) To: dri-devel As soon as I open the following website: http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: Dec 17 17:41:39 x4 kernel: [drm] Initialized drm 1.1.0 20060810 Dec 17 17:41:39 x4 kernel: [drm] radeon defaulting to kernel modesetting. Dec 17 17:41:39 x4 kernel: [drm] radeon kernel modesetting enabled. Dec 17 17:41:39 x4 kernel: [drm] initializing kernel modesetting (RS780 0x1002:0x9614 0x1043:0x834D). Dec 17 17:41:39 x4 kernel: [drm] register mmio base: 0xFBEE0000 Dec 17 17:41:39 x4 kernel: [drm] register mmio size: 65536 Dec 17 17:41:39 x4 kernel: ATOM BIOS: 113 Dec 17 17:41:39 x4 kernel: radeon 0000:01:05.0: VRAM: 128M 0x00000000C0000000 - 0x00000000C7FFFFFF (128M used) Dec 17 17:41:39 x4 kernel: radeon 0000:01:05.0: GTT: 512M 0x00000000A0000000 - 0x00000000BFFFFFFF Dec 17 17:41:39 x4 kernel: [drm] Detected VRAM RAM=128M, BAR=128M Dec 17 17:41:39 x4 kernel: [drm] RAM width 32bits DDR Dec 17 17:41:39 x4 kernel: [TTM] Zone kernel: Available graphics memory: 4083532 kiB Dec 17 17:41:39 x4 kernel: [TTM] Zone dma32: Available graphics memory: 2097152 kiB Dec 17 17:41:39 x4 kernel: [TTM] Initializing pool allocator Dec 17 17:41:39 x4 kernel: [TTM] Initializing DMA pool allocator Dec 17 17:41:39 x4 kernel: [drm] radeon: 128M of VRAM memory ready Dec 17 17:41:39 x4 kernel: [drm] radeon: 512M of GTT memory ready. Dec 17 17:41:39 x4 kernel: [drm] Supports vblank timestamp caching Rev 1 (10.10.2010). Dec 17 17:41:39 x4 kernel: [drm] Driver supports precise vblank timestamp query. Dec 17 17:41:39 x4 kernel: [drm] radeon: irq initialized. Dec 17 17:41:39 x4 kernel: [drm] GART: num cpu pages 131072, num gpu pages 131072 Dec 17 17:41:39 x4 kernel: [drm] Loading RS780 Microcode Dec 17 17:41:39 x4 kernel: [drm] PCIE GART of 512M enabled (table at 0x00000000C0040000). Dec 17 17:41:39 x4 kernel: radeon 0000:01:05.0: WB enabled Dec 17 17:41:39 x4 kernel: radeon 0000:01:05.0: fence driver on ring 0 use gpu addr 0x00000000a0000c00 and cpu addr 0xffff8802163acc00 Dec 17 17:41:39 x4 kernel: radeon 0000:01:05.0: fence driver on ring 3 use gpu addr 0x00000000a0000c0c and cpu addr 0xffff8802163acc0c Dec 17 17:41:39 x4 kernel: radeon 0000:01:05.0: setting latency timer to 64 Dec 17 17:41:39 x4 kernel: [drm] ring test on 0 succeeded in 0 usecs Dec 17 17:41:39 x4 kernel: [drm] ring test on 3 succeeded in 1 usecs Dec 17 17:41:39 x4 kernel: [drm] ib test on ring 0 succeeded in 0 usecs Dec 17 17:41:39 x4 kernel: [drm] ib test on ring 3 succeeded in 0 usecs Dec 17 17:41:39 x4 kernel: [drm] Radeon Display Connectors Dec 17 17:41:39 x4 kernel: [drm] Connector 0: Dec 17 17:41:39 x4 kernel: [drm] VGA-1 Dec 17 17:41:39 x4 kernel: [drm] DDC: 0x7e40 0x7e40 0x7e44 0x7e44 0x7e48 0x7e48 0x7e4c 0x7e4c Dec 17 17:41:39 x4 kernel: [drm] Encoders: Dec 17 17:41:39 x4 kernel: [drm] CRT1: INTERNAL_KLDSCP_DAC1 Dec 17 17:41:39 x4 kernel: [drm] Connector 1: Dec 17 17:41:39 x4 kernel: [drm] DVI-D-1 Dec 17 17:41:39 x4 kernel: [drm] HPD3 Dec 17 17:41:39 x4 kernel: [drm] DDC: 0x7e50 0x7e50 0x7e54 0x7e54 0x7e58 0x7e58 0x7e5c 0x7e5c Dec 17 17:41:39 x4 kernel: [drm] Encoders: Dec 17 17:41:39 x4 kernel: [drm] DFP3: INTERNAL_KLDSCP_LVTMA Dec 17 17:41:39 x4 kernel: [drm] radeon: power management initialized Dec 17 17:41:39 x4 kernel: [drm] fb mappable at 0xF0142000 Dec 17 17:41:39 x4 kernel: [drm] vram apper at 0xF0000000 Dec 17 17:41:39 x4 kernel: [drm] size 7299072 Dec 17 17:41:39 x4 kernel: [drm] fb depth is 24 Dec 17 17:41:39 x4 kernel: [drm] pitch is 6912 Dec 17 17:41:39 x4 kernel: fbcon: radeondrmfb (fb0) is primary device Dec 17 17:41:39 x4 kernel: Console: switching to colour frame buffer device 131x105 Dec 17 17:41:39 x4 kernel: radeon 0000:01:05.0: fb0: radeondrmfb frame buffer device Dec 17 17:41:39 x4 kernel: radeon 0000:01:05.0: registered panic notifier Dec 17 17:41:39 x4 kernel: [drm] Initialized radeon 2.27.0 20080528 for 0000:01:05.0 on minor 0 ... Dec 17 19:12:33 x4 kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec Dec 17 19:12:33 x4 kernel: radeon 0000:01:05.0: GPU lockup (waiting for 0x0000000000022777 last fence id 0x0000000000022774) after reboot: Dec 17 19:14:32 x4 kernel: Adding 4194300k swap on /var/cache/swapfile.img. Priority:-1 extents:9 across:629080060k Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: GPU lockup (waiting for 0x0000000000000954 last fence id 0x0000000000000952) Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: Saved 89 dwords of commands on ring 0. Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: GPU softreset Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_008010_GRBM_STATUS=0xA000B030 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_008014_GRBM_STATUS2=0x00000003 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_000E50_SRBM_STATUS=0x20005040 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_008674_CP_STALLED_STAT1 = 0x00000000 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_008678_CP_STALLED_STAT2 = 0x00000002 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_00867C_CP_BUSY_STAT = 0x0000D084 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_008680_CP_STAT = 0x80098645 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00007FEE Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00000001 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_008010_GRBM_STATUS=0xA000B030 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_008014_GRBM_STATUS2=0x00000003 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_000E50_SRBM_STATUS=0x2000C040 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_008674_CP_STALLED_STAT1 = 0x00000000 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_008678_CP_STALLED_STAT2 = 0x00000000 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_00867C_CP_BUSY_STAT = 0x00000000 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: R_008680_CP_STAT = 0x80100000 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: GPU reset succeeded, trying to resume Dec 17 19:16:44 x4 kernel: [drm] PCIE GART of 512M enabled (table at 0x00000000C0040000). Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: WB enabled Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: fence driver on ring 0 use gpu addr 0x00000000a0000c00 and cpu addr 0xffff8802163acc00 Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: fence driver on ring 3 use gpu addr 0x00000000a0000c0c and cpu addr 0xffff8802163acc0c Dec 17 19:16:44 x4 kernel: radeon 0000:01:05.0: setting latency timer to 64 Dec 17 19:16:44 x4 kernel: [drm] ring test on 0 succeeded in 1 usecs Dec 17 19:16:44 x4 kernel: [drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD) Dec 17 19:16:44 x4 kernel: [drm:r600_resume] *ERROR* r600 startup failed on resume Dec 17 19:17:03 x4 kernel: SysRq : Emergency Sync -- Markus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-17 18:27 GPU lockup CP stall for more than 10000msec on latest vanilla git Markus Trippelsdorf @ 2012-12-17 21:32 ` Alex Deucher 2012-12-17 21:48 ` Markus Trippelsdorf 0 siblings, 1 reply; 20+ messages in thread From: Alex Deucher @ 2012-12-17 21:32 UTC (permalink / raw) To: Markus Trippelsdorf; +Cc: dri-devel On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote: > As soon as I open the following website: > http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html > > my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: Is this a regression? Most likely a 3D driver bug unless you are only seeing it with specific kernels. What browser are you using and do you have hw accelerated webgl, etc. enabled? If so, what version of mesa are you using? Alex ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-17 21:32 ` Alex Deucher @ 2012-12-17 21:48 ` Markus Trippelsdorf 2012-12-17 21:58 ` Markus Trippelsdorf 2012-12-17 22:00 ` Alex Deucher 0 siblings, 2 replies; 20+ messages in thread From: Markus Trippelsdorf @ 2012-12-17 21:48 UTC (permalink / raw) To: Alex Deucher; +Cc: dri-devel On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: > On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf > <markus@trippelsdorf.de> wrote: > > As soon as I open the following website: > > http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html > > > > my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: > > Is this a regression? Most likely a 3D driver bug unless you are only > seeing it with specific kernels. What browser are you using and do > you have hw accelerated webgl, etc. enabled? If so, what version of > mesa are you using? This is a regression, because it is caused by yesterdays merge of drm-next by Linus. IOW I only see this bug when running a v3.7-9432-g9360b53 kernel. -- Markus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-17 21:48 ` Markus Trippelsdorf @ 2012-12-17 21:58 ` Markus Trippelsdorf 2012-12-17 22:00 ` Alex Deucher 1 sibling, 0 replies; 20+ messages in thread From: Markus Trippelsdorf @ 2012-12-17 21:58 UTC (permalink / raw) To: Alex Deucher; +Cc: dri-devel On 2012.12.17 at 22:48 +0100, Markus Trippelsdorf wrote: > On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: > > On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf > > <markus@trippelsdorf.de> wrote: > > > As soon as I open the following website: > > > http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html > > > > > > my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: > > > > Is this a regression? Most likely a 3D driver bug unless you are only > > seeing it with specific kernels. What browser are you using and do > > you have hw accelerated webgl, etc. enabled? If so, what version of > > mesa are you using? > > This is a regression, because it is caused by yesterdays merge of > drm-next by Linus. IOW I only see this bug when running a > v3.7-9432-g9360b53 kernel. Forgot to mention that I don't use webgl. Browser is Firefox. And I use my screen in portrait mode: DVI-0 connected 1050x1680+0+0 left (normal left inverted right x axis y axis) 434mm x 270mm -- Markus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-17 21:48 ` Markus Trippelsdorf 2012-12-17 21:58 ` Markus Trippelsdorf @ 2012-12-17 22:00 ` Alex Deucher 2012-12-17 22:25 ` Markus Trippelsdorf 1 sibling, 1 reply; 20+ messages in thread From: Alex Deucher @ 2012-12-17 22:00 UTC (permalink / raw) To: Markus Trippelsdorf; +Cc: dri-devel On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote: > On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: >> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf >> <markus@trippelsdorf.de> wrote: >> > As soon as I open the following website: >> > http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html >> > >> > my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: >> >> Is this a regression? Most likely a 3D driver bug unless you are only >> seeing it with specific kernels. What browser are you using and do >> you have hw accelerated webgl, etc. enabled? If so, what version of >> mesa are you using? > > This is a regression, because it is caused by yesterdays merge of > drm-next by Linus. IOW I only see this bug when running a > v3.7-9432-g9360b53 kernel. Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 or http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=4d75658bffea78f0c6f82fd46df1ec983ccacdf0 Alex ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-17 22:00 ` Alex Deucher @ 2012-12-17 22:25 ` Markus Trippelsdorf 2012-12-17 22:55 ` Markus Trippelsdorf 2012-12-23 1:46 ` Alex Deucher 0 siblings, 2 replies; 20+ messages in thread From: Markus Trippelsdorf @ 2012-12-17 22:25 UTC (permalink / raw) To: Alex Deucher; +Cc: dri-devel On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: > On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf > <markus@trippelsdorf.de> wrote: > > On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: > >> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf > >> <markus@trippelsdorf.de> wrote: > >> > As soon as I open the following website: > >> > http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html > >> > > >> > my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: > >> > >> Is this a regression? Most likely a 3D driver bug unless you are only > >> seeing it with specific kernels. What browser are you using and do > >> you have hw accelerated webgl, etc. enabled? If so, what version of > >> mesa are you using? > > > > This is a regression, because it is caused by yesterdays merge of > > drm-next by Linus. IOW I only see this bug when running a > > v3.7-9432-g9360b53 kernel. > > Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 Yes, the commit above causes the issue. 2d6cc72 GPU lockups 009ee7a runs fine -- Markus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-17 22:25 ` Markus Trippelsdorf @ 2012-12-17 22:55 ` Markus Trippelsdorf 2012-12-18 11:20 ` Michel Dänzer 2012-12-23 1:46 ` Alex Deucher 1 sibling, 1 reply; 20+ messages in thread From: Markus Trippelsdorf @ 2012-12-17 22:55 UTC (permalink / raw) To: Alex Deucher; +Cc: dri-devel On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote: > On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: > > On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf > > <markus@trippelsdorf.de> wrote: > > > On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: > > >> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf > > >> <markus@trippelsdorf.de> wrote: > > >> > As soon as I open the following website: > > >> > http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html > > >> > > > >> > my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: > > >> > > >> Is this a regression? Most likely a 3D driver bug unless you are only > > >> seeing it with specific kernels. What browser are you using and do > > >> you have hw accelerated webgl, etc. enabled? If so, what version of > > >> mesa are you using? > > > > > > This is a regression, because it is caused by yesterdays merge of > > > drm-next by Linus. IOW I only see this bug when running a > > > v3.7-9432-g9360b53 kernel. > > > > Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 > > Yes, the commit above causes the issue. > > 2d6cc72 GPU lockups With 2d6cc72 reverted I get: Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------ Dec 17 23:09:35 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() Dec 17 23:09:35 x4 kernel: Hardware name: System Product Name Dec 17 23:09:35 x4 kernel: Pid: 182, comm: X Not tainted 3.7.0-09433-ge033059 #155 Dec 17 23:09:35 x4 kernel: Call Trace: Dec 17 23:09:35 x4 kernel: [<ffffffff81059c94>] ? warn_slowpath_common+0x74/0xb0 Dec 17 23:09:35 x4 kernel: [<ffffffff8129de0c>] ? radeon_fence_ref+0x2c/0x40 Dec 17 23:09:35 x4 kernel: [<ffffffff8126a02c>] ? ttm_bo_cleanup_refs_and_unlock+0x17c/0x2c0 Dec 17 23:09:35 x4 kernel: [<ffffffff8126a6f4>] ? ttm_mem_evict_first+0x94/0x1d0 Dec 17 23:09:35 x4 kernel: [<ffffffff8126f9c2>] ? ttm_bo_man_get_node+0x62/0xb0 Dec 17 23:09:35 x4 kernel: [<ffffffff8126aaa1>] ? ttm_bo_mem_space+0x271/0x320 Dec 17 23:09:35 x4 kernel: [<ffffffff8126b0bd>] ? ttm_bo_move_buffer+0xdd/0x150 Dec 17 23:09:35 x4 kernel: [<ffffffff8126b1b9>] ? ttm_bo_validate+0x89/0xf0 Dec 17 23:09:35 x4 kernel: [<ffffffff8126b509>] ? ttm_bo_init+0x2e9/0x3a0 Dec 17 23:09:35 x4 kernel: [<ffffffff8129f84a>] ? radeon_bo_create+0x18a/0x200 Dec 17 23:09:35 x4 kernel: [<ffffffff8129f510>] ? radeon_bo_clear_va+0x40/0x40 Dec 17 23:09:35 x4 kernel: [<ffffffff812b0d42>] ? radeon_gem_object_create+0x92/0x160 Dec 17 23:09:35 x4 kernel: [<ffffffff812b113c>] ? radeon_gem_create_ioctl+0x6c/0x150 Dec 17 23:09:35 x4 kernel: [<ffffffff81252250>] ? drm_ioctl+0x420/0x4f0 Dec 17 23:09:35 x4 kernel: [<ffffffff812b10d0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 Dec 17 23:09:35 x4 kernel: [<ffffffff810521a9>] ? __do_page_fault+0x1a9/0x490 Dec 17 23:09:35 x4 kernel: [<ffffffff810d1ac9>] ? mmap_region+0x169/0x560 Dec 17 23:09:35 x4 kernel: [<ffffffff810f7f84>] ? do_vfs_ioctl+0x2e4/0x4e0 Dec 17 23:09:35 x4 kernel: [<ffffffff810c0e19>] ? vm_mmap_pgoff+0x69/0x80 Dec 17 23:09:35 x4 kernel: [<ffffffff810f81cc>] ? sys_ioctl+0x4c/0xa0 Dec 17 23:09:35 x4 kernel: [<ffffffff814c2a12>] ? system_call_fastpath+0x16/0x1b Dec 17 23:09:35 x4 kernel: ---[ end trace eb6036661a77c177 ]--- Dec 17 23:09:35 x4 kernel: BUG: unable to handle kernel paging request at ffff8803d9ee4bd8 Dec 17 23:09:35 x4 kernel: IP: [<ffffffff8129d395>] radeon_fence_wait_seq+0x85/0x440 Dec 17 23:09:35 x4 kernel: PGD 180c063 PUD 0 Dec 17 23:09:35 x4 kernel: Oops: 0000 [#1] SMP Dec 17 23:09:35 x4 kernel: CPU 3 Dec 17 23:09:35 x4 kernel: Pid: 182, comm: X Tainted: G W 3.7.0-09433-ge033059 #155 System manufacturer System Product Name/M4A78T-E Dec 17 23:09:35 x4 kernel: RIP: 0010:[<ffffffff8129d395>] [<ffffffff8129d395>] radeon_fence_wait_seq+0x85/0x440 Dec 17 23:09:35 x4 kernel: RSP: 0018:ffff880210cc7a38 EFLAGS: 00010282 Dec 17 23:09:35 x4 kernel: RAX: ffff880210cc7a90 RBX: ffff88020674c970 RCX: 0000000000000001 Dec 17 23:09:35 x4 kernel: RDX: 000000000605b580 RSI: 0000000000000058 RDI: ffff8801c7f7dc80 Dec 17 23:09:35 x4 kernel: RBP: ffff8803d9ee4bd8 R08: 0000000000000001 R09: 00000000000002a9 Dec 17 23:09:35 x4 kernel: R10: 00000000000002a8 R11: 0000000000000006 R12: ffff880210ee6981 Dec 17 23:09:35 x4 kernel: R13: 000000000605b580 R14: ffff8801c7f7dc80 R15: ffff8802161864f8 Dec 17 23:09:35 x4 kernel: FS: 00007f5ee88f4880(0000) GS:ffff88021fd80000(0000) knlGS:0000000000000000 Dec 17 23:09:35 x4 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 17 23:09:35 x4 kernel: CR2: ffff8803d9ee4bd8 CR3: 0000000210c63000 CR4: 00000000000007e0 Dec 17 23:09:35 x4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 17 23:09:35 x4 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 17 23:09:35 x4 kernel: Process X (pid: 182, threadinfo ffff880210cc6000, task ffff880215f45730) Dec 17 23:09:35 x4 kernel: Stack: Dec 17 23:09:35 x4 kernel: ffffffff8129de0c 000000000605b580 ffff8803d9ee4080 0000000000000010 Dec 17 23:09:35 x4 kernel: ffff880210cc7aa8 ffff880201cc7a68 ffff880210cc7a90 000000010177c177 Dec 17 23:09:35 x4 kernel: 00000000000000c7 0000000000000001 ffff88020674c890 0000000000000286 Dec 17 23:09:35 x4 kernel: Call Trace: Dec 17 23:09:35 x4 kernel: [<ffffffff8129de0c>] ? radeon_fence_ref+0x2c/0x40 Dec 17 23:09:35 x4 kernel: [<ffffffff8129dc32>] ? radeon_fence_wait+0x22/0x60 Dec 17 23:09:35 x4 kernel: [<ffffffff8126a06d>] ? ttm_bo_cleanup_refs_and_unlock+0x1bd/0x2c0 Dec 17 23:09:35 x4 kernel: [<ffffffff8126a6f4>] ? ttm_mem_evict_first+0x94/0x1d0 Dec 17 23:09:35 x4 kernel: [<ffffffff8126f9c2>] ? ttm_bo_man_get_node+0x62/0xb0 Dec 17 23:09:35 x4 kernel: [<ffffffff8126aaa1>] ? ttm_bo_mem_space+0x271/0x320 Dec 17 23:09:35 x4 kernel: [<ffffffff8126b0bd>] ? ttm_bo_move_buffer+0xdd/0x150 Dec 17 23:09:35 x4 kernel: [<ffffffff8126b1b9>] ? ttm_bo_validate+0x89/0xf0 Dec 17 23:09:35 x4 kernel: [<ffffffff8126b509>] ? ttm_bo_init+0x2e9/0x3a0 Dec 17 23:09:35 x4 kernel: [<ffffffff8129f84a>] ? radeon_bo_create+0x18a/0x200 Dec 17 23:09:35 x4 kernel: [<ffffffff8129f510>] ? radeon_bo_clear_va+0x40/0x40 Dec 17 23:09:35 x4 kernel: [<ffffffff812b0d42>] ? radeon_gem_object_create+0x92/0x160 Dec 17 23:09:35 x4 kernel: [<ffffffff812b113c>] ? radeon_gem_create_ioctl+0x6c/0x150 Dec 17 23:09:35 x4 kernel: [<ffffffff81252250>] ? drm_ioctl+0x420/0x4f0 Dec 17 23:09:35 x4 kernel: [<ffffffff812b10d0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 Dec 17 23:09:35 x4 kernel: [<ffffffff810521a9>] ? __do_page_fault+0x1a9/0x490 Dec 17 23:09:35 x4 kernel: [<ffffffff810d1ac9>] ? mmap_region+0x169/0x560 Dec 17 23:09:35 x4 kernel: [<ffffffff810f7f84>] ? do_vfs_ioctl+0x2e4/0x4e0 Dec 17 23:09:35 x4 kernel: [<ffffffff810c0e19>] ? vm_mmap_pgoff+0x69/0x80 Dec 17 23:09:35 x4 kernel: [<ffffffff810f81cc>] ? sys_ioctl+0x4c/0xa0 Dec 17 23:09:35 x4 kernel: [<ffffffff814c2a12>] ? system_call_fastpath+0x16/0x1b Dec 17 23:09:35 x4 kernel: Code: c4 0f 87 77 01 00 00 41 89 df bb 01 00 00 00 44 89 ee 4c 89 f7 e8 ec 5a 01 00 45 85 ff 0f 88 43 03 00 00 84 db 0f 84 57 02 00 00 <48> 8b 45 00 4c 39 e0 0f 83 19 02 00 00 48 8b 44 24 08 48 c1 e0 Dec 17 23:09:35 x4 kernel: RIP [<ffffffff8129d395>] radeon_fence_wait_seq+0x85/0x440 Dec 17 23:09:35 x4 kernel: RSP <ffff880210cc7a38> Dec 17 23:09:35 x4 kernel: CR2: ffff8803d9ee4bd8 Dec 17 23:09:35 x4 kernel: ---[ end trace eb6036661a77c178 ]--- Dec 17 23:09:35 x4 kernel: [drm:drm_release] *ERROR* Device busy: 1 -- Markus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-17 22:55 ` Markus Trippelsdorf @ 2012-12-18 11:20 ` Michel Dänzer 2012-12-18 13:38 ` Markus Trippelsdorf 0 siblings, 1 reply; 20+ messages in thread From: Michel Dänzer @ 2012-12-18 11:20 UTC (permalink / raw) To: Markus Trippelsdorf; +Cc: dri-devel On Mon, 2012-12-17 at 23:55 +0100, Markus Trippelsdorf wrote: > On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote: > > On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: > > > On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf > > > <markus@trippelsdorf.de> wrote: > > > > On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: > > > >> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf > > > >> <markus@trippelsdorf.de> wrote: > > > >> > As soon as I open the following website: > > > >> > http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html > > > >> > > > > >> > my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: > > > >> > > > >> Is this a regression? Most likely a 3D driver bug unless you are only > > > >> seeing it with specific kernels. What browser are you using and do > > > >> you have hw accelerated webgl, etc. enabled? If so, what version of > > > >> mesa are you using? > > > > > > > > This is a regression, because it is caused by yesterdays merge of > > > > drm-next by Linus. IOW I only see this bug when running a > > > > v3.7-9432-g9360b53 kernel. > > > > > > Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 > > > > Yes, the commit above causes the issue. > > > > 2d6cc72 GPU lockups > > With 2d6cc72 reverted I get: > > Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------ Probably a separate issue, can you bisect this one as well? -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Debian, X and DRI developer ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-18 11:20 ` Michel Dänzer @ 2012-12-18 13:38 ` Markus Trippelsdorf 2012-12-18 13:51 ` Markus Trippelsdorf 2012-12-18 15:24 ` Maarten Lankhorst 0 siblings, 2 replies; 20+ messages in thread From: Markus Trippelsdorf @ 2012-12-18 13:38 UTC (permalink / raw) To: Michel Dänzer; +Cc: dri-devel On 2012.12.18 at 12:20 +0100, Michel Dänzer wrote: > On Mon, 2012-12-17 at 23:55 +0100, Markus Trippelsdorf wrote: > > On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote: > > > On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: > > > > On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf > > > > <markus@trippelsdorf.de> wrote: > > > > > On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: > > > > >> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf > > > > >> <markus@trippelsdorf.de> wrote: > > > > >> > As soon as I open the following website: > > > > >> > http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html > > > > >> > > > > > >> > my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: > > > > >> > > > > >> Is this a regression? Most likely a 3D driver bug unless you are only > > > > >> seeing it with specific kernels. What browser are you using and do > > > > >> you have hw accelerated webgl, etc. enabled? If so, what version of > > > > >> mesa are you using? > > > > > > > > > > This is a regression, because it is caused by yesterdays merge of > > > > > drm-next by Linus. IOW I only see this bug when running a > > > > > v3.7-9432-g9360b53 kernel. > > > > > > > > Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 > > > > > > Yes, the commit above causes the issue. > > > > > > 2d6cc72 GPU lockups > > > > With 2d6cc72 reverted I get: > > > > Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------ > > Probably a separate issue, can you bisect this one as well? Yes. Git-bisect points to: 85b144f860176ec18db927d6d9ecdfb24d9c6483 is the first bad commit commit 85b144f860176ec18db927d6d9ecdfb24d9c6483 Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> Date: Thu Nov 29 11:36:54 2012 +0000 drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock held, v3 (Please note that this bug is a little bit harder to reproduce. But when you scroll up and down for ~10 seconds on the webpage mentioned above it will trigger the oops. So while I'm not 100% sure that the issue is caused by exactly this commit, the vicinity should be right) Dec 18 14:29:07 x4 kernel: ------------[ cut here ]------------ Dec 18 14:29:07 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() Dec 18 14:29:07 x4 kernel: Hardware name: System Product Name Dec 18 14:29:07 x4 kernel: Pid: 161, comm: X Not tainted 3.7.0-rc7-00520-g85b144f #168 Dec 18 14:29:07 x4 kernel: Call Trace: Dec 18 14:29:07 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0 Dec 18 14:29:07 x4 kernel: [<ffffffff812926fc>] ? radeon_fence_ref+0x2c/0x40 Dec 18 14:29:07 x4 kernel: [<ffffffff8125e91c>] ? ttm_bo_cleanup_refs_and_unlock+0x17c/0x2c0 Dec 18 14:29:07 x4 kernel: [<ffffffff8125f13c>] ? ttm_mem_evict_first+0x1dc/0x2a0 Dec 18 14:29:07 x4 kernel: [<ffffffff81264412>] ? ttm_bo_man_get_node+0x62/0xb0 Dec 18 14:29:07 x4 kernel: [<ffffffff8125f48e>] ? ttm_bo_mem_space+0x28e/0x340 Dec 18 14:29:07 x4 kernel: [<ffffffff8125facc>] ? ttm_bo_move_buffer+0xfc/0x170 Dec 18 14:29:07 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0 Dec 18 14:29:07 x4 kernel: [<ffffffff8125fbd5>] ? ttm_bo_validate+0x95/0x110 Dec 18 14:29:07 x4 kernel: [<ffffffff8125ff3c>] ? ttm_bo_init+0x2ec/0x3b0 Dec 18 14:29:07 x4 kernel: [<ffffffff8129415a>] ? radeon_bo_create+0x18a/0x200 Dec 18 14:29:07 x4 kernel: [<ffffffff81293e40>] ? radeon_bo_clear_va+0x40/0x40 Dec 18 14:29:07 x4 kernel: [<ffffffff812a5302>] ? radeon_gem_object_create+0x92/0x160 Dec 18 14:29:07 x4 kernel: [<ffffffff812a571c>] ? radeon_gem_create_ioctl+0x6c/0x150 Dec 18 14:29:07 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 Dec 18 14:29:07 x4 kernel: [<ffffffff812a56b0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 Dec 18 14:29:07 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 Dec 18 14:29:07 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160 Dec 18 14:29:07 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 Dec 18 14:29:07 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0 Dec 18 14:29:07 x4 kernel: [<ffffffff814b05d2>] ? system_call_fastpath+0x16/0x1b Dec 18 14:29:07 x4 kernel: ---[ end trace c5e6f68fefd3a70b ]--- Dec 18 14:29:07 x4 kernel: BUG: unable to handle kernel paging request at 0000000100000077 Dec 18 14:29:07 x4 kernel: IP: [<ffffffff814afa15>] _raw_spin_lock+0x5/0x30 Dec 18 14:29:07 x4 kernel: PGD 2156c4067 PUD 0 Dec 18 14:29:07 x4 kernel: Oops: 0002 [#1] SMP Dec 18 14:29:07 x4 kernel: CPU 1 Dec 18 14:29:07 x4 kernel: Pid: 161, comm: X Tainted: G W 3.7.0-rc7-00520-g85b144f #168 System manufacturer System Product Name/M4A78T-E Dec 18 14:29:07 x4 kernel: RIP: 0010:[<ffffffff814afa15>] [<ffffffff814afa15>] _raw_spin_lock+0x5/0x30 Dec 18 14:29:07 x4 kernel: RSP: 0018:ffff880211645d58 EFLAGS: 00010286 Dec 18 14:29:07 x4 kernel: RAX: 0000000000000100 RBX: ffff8801c0e29448 RCX: 0000000000000000 Dec 18 14:29:07 x4 kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000100000077 Dec 18 14:29:07 x4 kernel: RBP: 00000000ffffffff R08: 0000000000000000 R09: ffffffff81838370 Dec 18 14:29:07 x4 kernel: R10: ffffffff812a5960 R11: 0000000000000246 R12: 0000000000000001 Dec 18 14:29:07 x4 kernel: R13: 0000000000000001 R14: 0000000000000000 R15: 00007fff0723dba0 Dec 18 14:29:07 x4 kernel: FS: 00007f958542f880(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000 Dec 18 14:29:07 x4 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 18 14:29:07 x4 kernel: CR2: 0000000100000077 CR3: 000000021161a000 CR4: 00000000000007e0 Dec 18 14:29:07 x4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 18 14:29:07 x4 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 18 14:29:07 x4 kernel: Process X (pid: 161, threadinfo ffff880211644000, task ffff880215ab85d0) Dec 18 14:29:07 x4 kernel: Stack: Dec 18 14:29:07 x4 kernel: ffffffff8125d9ba 0000000015c83600 ffff8801c0e29400 ffff880211645e30 Dec 18 14:29:07 x4 kernel: ffff8801c0e29448 ffff880211645dcc 0000000000000001 ffffffff81294bff Dec 18 14:29:07 x4 kernel: ffff8801c0e29608 ffff880211645e30 ffff880216a76000 ffff880211645e30 Dec 18 14:29:07 x4 kernel: Call Trace: Dec 18 14:29:07 x4 kernel: [<ffffffff8125d9ba>] ? ttm_bo_reserve+0x3a/0x110 Dec 18 14:29:07 x4 kernel: [<ffffffff81294bff>] ? radeon_bo_wait+0x3f/0xc0 Dec 18 14:29:07 x4 kernel: [<ffffffff812a59b7>] ? radeon_gem_busy_ioctl+0x57/0x100 Dec 18 14:29:07 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 Dec 18 14:29:07 x4 kernel: [<ffffffff812a5960>] ? radeon_gem_mmap_ioctl+0x20/0x20 Dec 18 14:29:07 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 Dec 18 14:29:07 x4 kernel: [<ffffffff810e55ad>] ? vfs_read+0x13d/0x160 Dec 18 14:29:07 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 Dec 18 14:29:07 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0 Dec 18 14:29:07 x4 kernel: [<ffffffff814b05d2>] ? system_call_fastpath+0x16/0x1b Dec 18 14:29:07 x4 kernel: Code: 31 c0 5b c3 66 90 8d 8a 00 01 00 00 89 d0 f0 66 0f b1 0b 66 39 d0 75 de b8 01 00 00 00 5b c3 0f 1f 80 00 00 00 00 b8 00 01 00 00 <f0> 66 0f c1 07 0f b6 d4 38 c2 74 10 0f 1f 80 00 00 00 00 f3 90 Dec 18 14:29:07 x4 kernel: RIP [<ffffffff814afa15>] _raw_spin_lock+0x5/0x30 Dec 18 14:29:07 x4 kernel: RSP <ffff880211645d58> Dec 18 14:29:07 x4 kernel: CR2: 0000000100000077 Dec 18 14:29:07 x4 kernel: ---[ end trace c5e6f68fefd3a70c ]--- Dec 18 14:29:28 x4 kernel: BUG: unable to handle kernel paging request at 0000000100000023 Dec 18 14:29:28 x4 kernel: IP: [<ffffffff81296448>] radeon_vm_bo_invalidate+0x18/0x30 Dec 18 14:29:28 x4 kernel: PGD 205289067 PUD 0 Dec 18 14:29:28 x4 kernel: Oops: 0002 [#2] SMP Dec 18 14:29:28 x4 kernel: CPU 1 Dec 18 14:29:28 x4 kernel: Pid: 13, comm: kworker/1:0 Tainted: G D W 3.7.0-rc7-00520-g85b144f #168 System manufacturer System Product Name/M4A78T-E Dec 18 14:29:28 x4 kernel: RIP: 0010:[<ffffffff81296448>] [<ffffffff81296448>] radeon_vm_bo_invalidate+0x18/0x30 Dec 18 14:29:28 x4 kernel: RSP: 0018:ffff8802168b3d78 EFLAGS: 00010207 Dec 18 14:29:28 x4 kernel: RAX: 00000000ffffffff RBX: ffff8801c0e29048 RCX: ffff8801c0e2b928 Dec 18 14:29:28 x4 kernel: RDX: 0000000000000001 RSI: ffff8801c0e291f0 RDI: 00000000ffffffff Dec 18 14:29:28 x4 kernel: RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 Dec 18 14:29:28 x4 kernel: R10: ffffea0007038a00 R11: dead000000100100 R12: ffff880216a76590 Dec 18 14:29:28 x4 kernel: R13: ffffffff818383e0 R14: 0000000000000000 R15: ffff880215c83678 Dec 18 14:29:28 x4 kernel: FS: 00007f4bb2b64740(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000 Dec 18 14:29:28 x4 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 18 14:29:28 x4 kernel: CR2: 0000000100000023 CR3: 000000020698f000 CR4: 00000000000007e0 Dec 18 14:29:28 x4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 18 14:29:28 x4 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 18 14:29:28 x4 kernel: Process kworker/1:0 (pid: 13, threadinfo ffff8802168b2000, task ffff88021687d730) Dec 18 14:29:28 x4 kernel: Stack: Dec 18 14:29:28 x4 kernel: ffffffff8125d2e9 ffff8801c0e29048 ffffffff8125e8cb ffff880216a769b8 Dec 18 14:29:28 x4 kernel: ffffffff810de82f ffff8801c0e2b848 ffff880215c83678 ffff8801c0e2b900 Dec 18 14:29:28 x4 kernel: 0000000000000001 ffff880216a76a80 ffff8801c0e29048 ffffffff8125eb7d Dec 18 14:29:28 x4 kernel: Call Trace: Dec 18 14:29:28 x4 kernel: [<ffffffff8125d2e9>] ? ttm_bo_cleanup_memtype_use+0x19/0x90 Dec 18 14:29:28 x4 kernel: [<ffffffff8125e8cb>] ? ttm_bo_cleanup_refs_and_unlock+0x12b/0x2c0 Dec 18 14:29:28 x4 kernel: [<ffffffff810de82f>] ? kfree+0xf/0xb0 Dec 18 14:29:28 x4 kernel: [<ffffffff8125eb7d>] ? ttm_bo_delayed_delete+0x11d/0x1a0 Dec 18 14:29:28 x4 kernel: [<ffffffff8125ec12>] ? ttm_bo_delayed_workqueue+0x12/0x30 Dec 18 14:29:28 x4 kernel: [<ffffffff8106e5f9>] ? process_one_work+0x179/0x480 Dec 18 14:29:28 x4 kernel: [<ffffffff8125ec00>] ? ttm_bo_delayed_delete+0x1a0/0x1a0 Dec 18 14:29:28 x4 kernel: [<ffffffff8106f5b1>] ? worker_thread+0x1b1/0x540 Dec 18 14:29:28 x4 kernel: [<ffffffff8106f400>] ? busy_worker_rebind_fn+0x100/0x100 Dec 18 14:29:28 x4 kernel: [<ffffffff810741cf>] ? kthread+0xaf/0xc0 Dec 18 14:29:28 x4 kernel: [<ffffffff81074120>] ? __kthread_bind+0x30/0x30 Dec 18 14:29:28 x4 kernel: [<ffffffff814b052c>] ? ret_from_fork+0x7c/0xb0 Dec 18 14:29:28 x4 kernel: [<ffffffff81074120>] ? __kthread_bind+0x30/0x30 Dec 18 14:29:28 x4 kernel: Code: 8b 44 24 04 48 83 c4 08 5b 5d 41 5c c3 66 0f 1f 44 00 00 48 8b 86 f0 01 00 00 48 81 c6 f0 01 00 00 48 39 f0 74 11 0f 1f 44 00 00 <c6> 40 24 00 48 8b 00 48 39 f0 75 f4 f3 c3 66 2e 0f 1f 84 00 00 Dec 18 14:29:28 x4 kernel: RIP [<ffffffff81296448>] radeon_vm_bo_invalidate+0x18/0x30 Dec 18 14:29:28 x4 kernel: RSP <ffff8802168b3d78> Dec 18 14:29:28 x4 kernel: CR2: 0000000100000023 Dec 18 14:29:28 x4 kernel: ---[ end trace c5e6f68fefd3a70d ]--- Dec 18 14:29:28 x4 kernel: BUG: unable to handle kernel paging request at ffffffffffffffd8 Dec 18 14:29:28 x4 kernel: IP: [<ffffffff81074257>] kthread_data+0x7/0x10 Dec 18 14:29:28 x4 kernel: PGD 180d067 PUD 180e067 PMD 0 Dec 18 14:29:28 x4 kernel: Oops: 0000 [#3] SMP Dec 18 14:29:28 x4 kernel: CPU 1 Dec 18 14:29:28 x4 kernel: Pid: 13, comm: kworker/1:0 Tainted: G D W 3.7.0-rc7-00520-g85b144f #168 System manufacturer System Product Name/M4A78T-E Dec 18 14:29:28 x4 kernel: RIP: 0010:[<ffffffff81074257>] [<ffffffff81074257>] kthread_data+0x7/0x10 Dec 18 14:29:28 x4 kernel: RSP: 0018:ffff8802168b3aa0 EFLAGS: 00010002 Dec 18 14:29:28 x4 kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000000015c7992d1 Dec 18 14:29:28 x4 kernel: RDX: ffffffffff8a8b63 RSI: 0000000000000001 RDI: ffff88021687d730 Dec 18 14:29:28 x4 kernel: RBP: ffff88021687d730 R08: 0000000000000000 R09: 0000000000000000 Dec 18 14:29:28 x4 kernel: R10: ffff880216887980 R11: 0000000000000000 R12: ffff88021fc912c0 Dec 18 14:29:28 x4 kernel: R13: 0000000000000001 R14: ffff88021687d720 R15: ffff88021687d730 Dec 18 14:29:28 x4 kernel: FS: 00007f4bb2b64740(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000 Dec 18 14:29:28 x4 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 18 14:29:28 x4 kernel: CR2: ffffffffffffffd8 CR3: 000000020698f000 CR4: 00000000000007e0 Dec 18 14:29:28 x4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 18 14:29:28 x4 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 18 14:29:28 x4 kernel: Process kworker/1:0 (pid: 13, threadinfo ffff8802168b2000, task ffff88021687d730) Dec 18 14:29:28 x4 kernel: Stack: Dec 18 14:29:28 x4 kernel: ffffffff8106fb98 ffff88021687d9d0 ffffffff814ae8b5 00000000000112c0 Dec 18 14:29:28 x4 kernel: ffff8802168b3fd8 00000000000112c0 ffff8802168b3fd8 0000000000000001 Dec 18 14:29:28 x4 kernel: ffff88021687d8d8 ffff88021687d720 ffff880216878000 ffff88021687d720 Dec 18 14:29:28 x4 kernel: Call Trace: Dec 18 14:29:28 x4 kernel: [<ffffffff8106fb98>] ? wq_worker_sleeping+0x8/0xb0 Dec 18 14:29:28 x4 kernel: [<ffffffff814ae8b5>] ? __schedule+0x3a5/0x5f0 Dec 18 14:29:28 x4 kernel: [<ffffffff8105dbba>] ? do_exit+0x52a/0x830 Dec 18 14:29:28 x4 kernel: [<ffffffff8103785e>] ? oops_end+0x8e/0xd0 Dec 18 14:29:28 x4 kernel: [<ffffffff814a94c8>] ? no_context+0x251/0x25d Dec 18 14:29:28 x4 kernel: [<ffffffff810512ce>] ? __do_page_fault+0x2ee/0x490 Dec 18 14:29:28 x4 kernel: [<ffffffff81083e18>] ? find_busiest_group+0x28/0x480 Dec 18 14:29:28 x4 kernel: [<ffffffff814b00af>] ? page_fault+0x1f/0x30 Dec 18 14:29:28 x4 kernel: [<ffffffff81296448>] ? radeon_vm_bo_invalidate+0x18/0x30 Dec 18 14:29:28 x4 kernel: [<ffffffff8125d2e9>] ? ttm_bo_cleanup_memtype_use+0x19/0x90 Dec 18 14:29:28 x4 kernel: [<ffffffff8125e8cb>] ? ttm_bo_cleanup_refs_and_unlock+0x12b/0x2c0 Dec 18 14:29:28 x4 kernel: [<ffffffff810de82f>] ? kfree+0xf/0xb0 Dec 18 14:29:28 x4 kernel: [<ffffffff8125eb7d>] ? ttm_bo_delayed_delete+0x11d/0x1a0 Dec 18 14:29:28 x4 kernel: [<ffffffff8125ec12>] ? ttm_bo_delayed_workqueue+0x12/0x30 Dec 18 14:29:28 x4 kernel: [<ffffffff8106e5f9>] ? process_one_work+0x179/0x480 Dec 18 14:29:28 x4 kernel: [<ffffffff8125ec00>] ? ttm_bo_delayed_delete+0x1a0/0x1a0 Dec 18 14:29:28 x4 kernel: [<ffffffff8106f5b1>] ? worker_thread+0x1b1/0x540 Dec 18 14:29:28 x4 kernel: [<ffffffff8106f400>] ? busy_worker_rebind_fn+0x100/0x100 Dec 18 14:29:28 x4 kernel: [<ffffffff810741cf>] ? kthread+0xaf/0xc0 Dec 18 14:29:28 x4 kernel: [<ffffffff81074120>] ? __kthread_bind+0x30/0x30 Dec 18 14:29:28 x4 kernel: [<ffffffff814b052c>] ? ret_from_fork+0x7c/0xb0 Dec 18 14:29:28 x4 kernel: [<ffffffff81074120>] ? __kthread_bind+0x30/0x30 Dec 18 14:29:28 x4 kernel: Code: 74 03 c6 03 00 65 48 8b 04 25 c0 b9 00 00 48 8b 80 48 02 00 00 5b 48 8b 40 c8 48 d1 e8 83 e0 01 c3 0f 1f 00 48 8b 87 48 02 00 00 <48> 8b 40 d8 c3 0f 1f 40 00 65 48 8b 04 25 c0 b9 00 00 48 8b b8 Dec 18 14:29:28 x4 kernel: RIP [<ffffffff81074257>] kthread_data+0x7/0x10 Dec 18 14:29:28 x4 kernel: RSP <ffff8802168b3aa0> Dec 18 14:29:28 x4 kernel: CR2: ffffffffffffffd8 Dec 18 14:29:28 x4 kernel: ---[ end trace c5e6f68fefd3a70e ]--- Dec 18 14:29:28 x4 kernel: Fixing recursive fault but reboot is needed! Dec 18 14:29:28 x4 kernel: SysRq : Emergency Sync -- Markus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-18 13:38 ` Markus Trippelsdorf @ 2012-12-18 13:51 ` Markus Trippelsdorf 2012-12-18 15:24 ` Maarten Lankhorst 1 sibling, 0 replies; 20+ messages in thread From: Markus Trippelsdorf @ 2012-12-18 13:51 UTC (permalink / raw) To: Michel Dänzer; +Cc: dri-devel On 2012.12.18 at 14:38 +0100, Markus Trippelsdorf wrote: > On 2012.12.18 at 12:20 +0100, Michel Dänzer wrote: > > On Mon, 2012-12-17 at 23:55 +0100, Markus Trippelsdorf wrote: > > > On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote: > > > > On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: > > > > > On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf > > > > > <markus@trippelsdorf.de> wrote: > > > > > > On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: > > > > > >> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf > > > > > >> <markus@trippelsdorf.de> wrote: > > > > > >> > As soon as I open the following website: > > > > > >> > http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html > > > > > >> > > > > > > >> > my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: > > > > > >> > > > > > >> Is this a regression? Most likely a 3D driver bug unless you are only > > > > > >> seeing it with specific kernels. What browser are you using and do > > > > > >> you have hw accelerated webgl, etc. enabled? If so, what version of > > > > > >> mesa are you using? > > > > > > > > > > > > This is a regression, because it is caused by yesterdays merge of > > > > > > drm-next by Linus. IOW I only see this bug when running a > > > > > > v3.7-9432-g9360b53 kernel. > > > > > > > > > > Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: > > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 > > > > > > > > Yes, the commit above causes the issue. > > > > > > > > 2d6cc72 GPU lockups > > > > > > With 2d6cc72 reverted I get: > > > > > > Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------ > > > > Probably a separate issue, can you bisect this one as well? > > Yes. Git-bisect points to: > > 85b144f860176ec18db927d6d9ecdfb24d9c6483 is the first bad commit > commit 85b144f860176ec18db927d6d9ecdfb24d9c6483 > Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> > Date: Thu Nov 29 11:36:54 2012 +0000 > > drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock > held, v3 > > (Please note that this bug is a little bit harder to reproduce. But > when you scroll up and down for ~10 seconds on the webpage mentioned > above it will trigger the oops. > So while I'm not 100% sure that the issue is caused by exactly this > commit, the vicinity should be right) > > Dec 18 14:29:07 x4 kernel: ------------[ cut here ]------------ > Dec 18 14:29:07 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() > Dec 18 14:29:07 x4 kernel: Hardware name: System Product Name > Dec 18 14:29:07 x4 kernel: Pid: 161, comm: X Not tainted 3.7.0-rc7-00520-g85b144f #168 > Dec 18 14:29:07 x4 kernel: Call Trace: > Dec 18 14:29:07 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0 > Dec 18 14:29:07 x4 kernel: [<ffffffff812926fc>] ? radeon_fence_ref+0x2c/0x40 > Dec 18 14:29:07 x4 kernel: [<ffffffff8125e91c>] ? ttm_bo_cleanup_refs_and_unlock+0x17c/0x2c0 > Dec 18 14:29:07 x4 kernel: [<ffffffff8125f13c>] ? ttm_mem_evict_first+0x1dc/0x2a0 > Dec 18 14:29:07 x4 kernel: [<ffffffff81264412>] ? ttm_bo_man_get_node+0x62/0xb0 > Dec 18 14:29:07 x4 kernel: [<ffffffff8125f48e>] ? ttm_bo_mem_space+0x28e/0x340 > Dec 18 14:29:07 x4 kernel: [<ffffffff8125facc>] ? ttm_bo_move_buffer+0xfc/0x170 > Dec 18 14:29:07 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0 > Dec 18 14:29:07 x4 kernel: [<ffffffff8125fbd5>] ? ttm_bo_validate+0x95/0x110 > Dec 18 14:29:07 x4 kernel: [<ffffffff8125ff3c>] ? ttm_bo_init+0x2ec/0x3b0 > Dec 18 14:29:07 x4 kernel: [<ffffffff8129415a>] ? radeon_bo_create+0x18a/0x200 > Dec 18 14:29:07 x4 kernel: [<ffffffff81293e40>] ? radeon_bo_clear_va+0x40/0x40 > Dec 18 14:29:07 x4 kernel: [<ffffffff812a5302>] ? radeon_gem_object_create+0x92/0x160 > Dec 18 14:29:07 x4 kernel: [<ffffffff812a571c>] ? radeon_gem_create_ioctl+0x6c/0x150 > Dec 18 14:29:07 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 > Dec 18 14:29:07 x4 kernel: [<ffffffff812a56b0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 > Dec 18 14:29:07 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 > Dec 18 14:29:07 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160 > Dec 18 14:29:07 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 > Dec 18 14:29:07 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0 > Dec 18 14:29:07 x4 kernel: [<ffffffff814b05d2>] ? system_call_fastpath+0x16/0x1b > Dec 18 14:29:07 x4 kernel: ---[ end trace c5e6f68fefd3a70b ]--- > Dec 18 14:29:07 x4 kernel: BUG: unable to handle kernel paging request at 0000000100000077 > Dec 18 14:29:07 x4 kernel: IP: [<ffffffff814afa15>] _raw_spin_lock+0x5/0x30 > Dec 18 14:29:07 x4 kernel: PGD 2156c4067 PUD 0 > Dec 18 14:29:07 x4 kernel: Oops: 0002 [#1] SMP > Dec 18 14:29:07 x4 kernel: CPU 1 > Dec 18 14:29:07 x4 kernel: Pid: 161, comm: X Tainted: G W 3.7.0-rc7-00520-g85b144f #168 System manufacturer System Product Name/M4A78T-E > Dec 18 14:29:07 x4 kernel: RIP: 0010:[<ffffffff814afa15>] [<ffffffff814afa15>] _raw_spin_lock+0x5/0x30 > Dec 18 14:29:07 x4 kernel: RSP: 0018:ffff880211645d58 EFLAGS: 00010286 > Dec 18 14:29:07 x4 kernel: RAX: 0000000000000100 RBX: ffff8801c0e29448 RCX: 0000000000000000 > Dec 18 14:29:07 x4 kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000100000077 > Dec 18 14:29:07 x4 kernel: RBP: 00000000ffffffff R08: 0000000000000000 R09: ffffffff81838370 > Dec 18 14:29:07 x4 kernel: R10: ffffffff812a5960 R11: 0000000000000246 R12: 0000000000000001 > Dec 18 14:29:07 x4 kernel: R13: 0000000000000001 R14: 0000000000000000 R15: 00007fff0723dba0 > Dec 18 14:29:07 x4 kernel: FS: 00007f958542f880(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000 > Dec 18 14:29:07 x4 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Dec 18 14:29:07 x4 kernel: CR2: 0000000100000077 CR3: 000000021161a000 CR4: 00000000000007e0 > Dec 18 14:29:07 x4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Dec 18 14:29:07 x4 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Dec 18 14:29:07 x4 kernel: Process X (pid: 161, threadinfo ffff880211644000, task ffff880215ab85d0) > Dec 18 14:29:07 x4 kernel: Stack: > Dec 18 14:29:07 x4 kernel: ffffffff8125d9ba 0000000015c83600 ffff8801c0e29400 ffff880211645e30 > Dec 18 14:29:07 x4 kernel: ffff8801c0e29448 ffff880211645dcc 0000000000000001 ffffffff81294bff > Dec 18 14:29:07 x4 kernel: ffff8801c0e29608 ffff880211645e30 ffff880216a76000 ffff880211645e30 > Dec 18 14:29:07 x4 kernel: Call Trace: > Dec 18 14:29:07 x4 kernel: [<ffffffff8125d9ba>] ? ttm_bo_reserve+0x3a/0x110 > Dec 18 14:29:07 x4 kernel: [<ffffffff81294bff>] ? radeon_bo_wait+0x3f/0xc0 > Dec 18 14:29:07 x4 kernel: [<ffffffff812a59b7>] ? radeon_gem_busy_ioctl+0x57/0x100 > Dec 18 14:29:07 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 > Dec 18 14:29:07 x4 kernel: [<ffffffff812a5960>] ? radeon_gem_mmap_ioctl+0x20/0x20 > Dec 18 14:29:07 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 > Dec 18 14:29:07 x4 kernel: [<ffffffff810e55ad>] ? vfs_read+0x13d/0x160 > Dec 18 14:29:07 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 > Dec 18 14:29:07 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0 > Dec 18 14:29:07 x4 kernel: [<ffffffff814b05d2>] ? system_call_fastpath+0x16/0x1b > Dec 18 14:29:07 x4 kernel: Code: 31 c0 5b c3 66 90 8d 8a 00 01 00 00 89 d0 f0 66 0f b1 0b 66 39 d0 75 de b8 01 00 00 00 5b c3 0f 1f 80 00 00 00 00 b8 00 01 00 00 <f0> 66 0f c1 07 0f b6 d4 38 c2 74 10 0f 1f 80 00 00 00 00 f3 90 > Dec 18 14:29:07 x4 kernel: RIP [<ffffffff814afa15>] _raw_spin_lock+0x5/0x30 > Dec 18 14:29:07 x4 kernel: RSP <ffff880211645d58> > Dec 18 14:29:07 x4 kernel: CR2: 0000000100000077 > Dec 18 14:29:07 x4 kernel: ---[ end trace c5e6f68fefd3a70c ]--- > Dec 18 14:29:28 x4 kernel: BUG: unable to handle kernel paging request at 0000000100000023 > Dec 18 14:29:28 x4 kernel: IP: [<ffffffff81296448>] radeon_vm_bo_invalidate+0x18/0x30 > Dec 18 14:29:28 x4 kernel: PGD 205289067 PUD 0 > Dec 18 14:29:28 x4 kernel: Oops: 0002 [#2] SMP > Dec 18 14:29:28 x4 kernel: CPU 1 > Dec 18 14:29:28 x4 kernel: Pid: 13, comm: kworker/1:0 Tainted: G D W 3.7.0-rc7-00520-g85b144f #168 System manufacturer System Product Name/M4A78T-E > Dec 18 14:29:28 x4 kernel: RIP: 0010:[<ffffffff81296448>] [<ffffffff81296448>] radeon_vm_bo_invalidate+0x18/0x30 > Dec 18 14:29:28 x4 kernel: RSP: 0018:ffff8802168b3d78 EFLAGS: 00010207 > Dec 18 14:29:28 x4 kernel: RAX: 00000000ffffffff RBX: ffff8801c0e29048 RCX: ffff8801c0e2b928 > Dec 18 14:29:28 x4 kernel: RDX: 0000000000000001 RSI: ffff8801c0e291f0 RDI: 00000000ffffffff > Dec 18 14:29:28 x4 kernel: RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 > Dec 18 14:29:28 x4 kernel: R10: ffffea0007038a00 R11: dead000000100100 R12: ffff880216a76590 > Dec 18 14:29:28 x4 kernel: R13: ffffffff818383e0 R14: 0000000000000000 R15: ffff880215c83678 > Dec 18 14:29:28 x4 kernel: FS: 00007f4bb2b64740(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000 > Dec 18 14:29:28 x4 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > Dec 18 14:29:28 x4 kernel: CR2: 0000000100000023 CR3: 000000020698f000 CR4: 00000000000007e0 > Dec 18 14:29:28 x4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Dec 18 14:29:28 x4 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Dec 18 14:29:28 x4 kernel: Process kworker/1:0 (pid: 13, threadinfo ffff8802168b2000, task ffff88021687d730) > Dec 18 14:29:28 x4 kernel: Stack: > Dec 18 14:29:28 x4 kernel: ffffffff8125d2e9 ffff8801c0e29048 ffffffff8125e8cb ffff880216a769b8 > Dec 18 14:29:28 x4 kernel: ffffffff810de82f ffff8801c0e2b848 ffff880215c83678 ffff8801c0e2b900 > Dec 18 14:29:28 x4 kernel: 0000000000000001 ffff880216a76a80 ffff8801c0e29048 ffffffff8125eb7d > Dec 18 14:29:28 x4 kernel: Call Trace: > Dec 18 14:29:28 x4 kernel: [<ffffffff8125d2e9>] ? ttm_bo_cleanup_memtype_use+0x19/0x90 > Dec 18 14:29:28 x4 kernel: [<ffffffff8125e8cb>] ? ttm_bo_cleanup_refs_and_unlock+0x12b/0x2c0 > Dec 18 14:29:28 x4 kernel: [<ffffffff810de82f>] ? kfree+0xf/0xb0 > Dec 18 14:29:28 x4 kernel: [<ffffffff8125eb7d>] ? ttm_bo_delayed_delete+0x11d/0x1a0 > Dec 18 14:29:28 x4 kernel: [<ffffffff8125ec12>] ? ttm_bo_delayed_workqueue+0x12/0x30 > Dec 18 14:29:28 x4 kernel: [<ffffffff8106e5f9>] ? process_one_work+0x179/0x480 > Dec 18 14:29:28 x4 kernel: [<ffffffff8125ec00>] ? ttm_bo_delayed_delete+0x1a0/0x1a0 > Dec 18 14:29:28 x4 kernel: [<ffffffff8106f5b1>] ? worker_thread+0x1b1/0x540 > Dec 18 14:29:28 x4 kernel: [<ffffffff8106f400>] ? busy_worker_rebind_fn+0x100/0x100 > Dec 18 14:29:28 x4 kernel: [<ffffffff810741cf>] ? kthread+0xaf/0xc0 > Dec 18 14:29:28 x4 kernel: [<ffffffff81074120>] ? __kthread_bind+0x30/0x30 > Dec 18 14:29:28 x4 kernel: [<ffffffff814b052c>] ? ret_from_fork+0x7c/0xb0 > Dec 18 14:29:28 x4 kernel: [<ffffffff81074120>] ? __kthread_bind+0x30/0x30 > Dec 18 14:29:28 x4 kernel: Code: 8b 44 24 04 48 83 c4 08 5b 5d 41 5c c3 66 0f 1f 44 00 00 48 8b 86 f0 01 00 00 48 81 c6 f0 01 00 00 48 39 f0 74 11 0f 1f 44 00 00 <c6> 40 24 00 48 8b 00 48 39 f0 75 f4 f3 c3 66 2e 0f 1f 84 00 00 > Dec 18 14:29:28 x4 kernel: RIP [<ffffffff81296448>] radeon_vm_bo_invalidate+0x18/0x30 > Dec 18 14:29:28 x4 kernel: RSP <ffff8802168b3d78> > Dec 18 14:29:28 x4 kernel: CR2: 0000000100000023 > Dec 18 14:29:28 x4 kernel: ---[ end trace c5e6f68fefd3a70d ]--- > Dec 18 14:29:28 x4 kernel: BUG: unable to handle kernel paging request at ffffffffffffffd8 > Dec 18 14:29:28 x4 kernel: IP: [<ffffffff81074257>] kthread_data+0x7/0x10 > Dec 18 14:29:28 x4 kernel: PGD 180d067 PUD 180e067 PMD 0 > Dec 18 14:29:28 x4 kernel: Oops: 0000 [#3] SMP > Dec 18 14:29:28 x4 kernel: CPU 1 > Dec 18 14:29:28 x4 kernel: Pid: 13, comm: kworker/1:0 Tainted: G D W 3.7.0-rc7-00520-g85b144f #168 System manufacturer System Product Name/M4A78T-E > Dec 18 14:29:28 x4 kernel: RIP: 0010:[<ffffffff81074257>] [<ffffffff81074257>] kthread_data+0x7/0x10 > Dec 18 14:29:28 x4 kernel: RSP: 0018:ffff8802168b3aa0 EFLAGS: 00010002 > Dec 18 14:29:28 x4 kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000000015c7992d1 > Dec 18 14:29:28 x4 kernel: RDX: ffffffffff8a8b63 RSI: 0000000000000001 RDI: ffff88021687d730 > Dec 18 14:29:28 x4 kernel: RBP: ffff88021687d730 R08: 0000000000000000 R09: 0000000000000000 > Dec 18 14:29:28 x4 kernel: R10: ffff880216887980 R11: 0000000000000000 R12: ffff88021fc912c0 > Dec 18 14:29:28 x4 kernel: R13: 0000000000000001 R14: ffff88021687d720 R15: ffff88021687d730 > Dec 18 14:29:28 x4 kernel: FS: 00007f4bb2b64740(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000 > Dec 18 14:29:28 x4 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > Dec 18 14:29:28 x4 kernel: CR2: ffffffffffffffd8 CR3: 000000020698f000 CR4: 00000000000007e0 > Dec 18 14:29:28 x4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Dec 18 14:29:28 x4 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Dec 18 14:29:28 x4 kernel: Process kworker/1:0 (pid: 13, threadinfo ffff8802168b2000, task ffff88021687d730) > Dec 18 14:29:28 x4 kernel: Stack: > Dec 18 14:29:28 x4 kernel: ffffffff8106fb98 ffff88021687d9d0 ffffffff814ae8b5 00000000000112c0 > Dec 18 14:29:28 x4 kernel: ffff8802168b3fd8 00000000000112c0 ffff8802168b3fd8 0000000000000001 > Dec 18 14:29:28 x4 kernel: ffff88021687d8d8 ffff88021687d720 ffff880216878000 ffff88021687d720 > Dec 18 14:29:28 x4 kernel: Call Trace: > Dec 18 14:29:28 x4 kernel: [<ffffffff8106fb98>] ? wq_worker_sleeping+0x8/0xb0 > Dec 18 14:29:28 x4 kernel: [<ffffffff814ae8b5>] ? __schedule+0x3a5/0x5f0 > Dec 18 14:29:28 x4 kernel: [<ffffffff8105dbba>] ? do_exit+0x52a/0x830 > Dec 18 14:29:28 x4 kernel: [<ffffffff8103785e>] ? oops_end+0x8e/0xd0 > Dec 18 14:29:28 x4 kernel: [<ffffffff814a94c8>] ? no_context+0x251/0x25d > Dec 18 14:29:28 x4 kernel: [<ffffffff810512ce>] ? __do_page_fault+0x2ee/0x490 > Dec 18 14:29:28 x4 kernel: [<ffffffff81083e18>] ? find_busiest_group+0x28/0x480 > Dec 18 14:29:28 x4 kernel: [<ffffffff814b00af>] ? page_fault+0x1f/0x30 > Dec 18 14:29:28 x4 kernel: [<ffffffff81296448>] ? radeon_vm_bo_invalidate+0x18/0x30 > Dec 18 14:29:28 x4 kernel: [<ffffffff8125d2e9>] ? ttm_bo_cleanup_memtype_use+0x19/0x90 > Dec 18 14:29:28 x4 kernel: [<ffffffff8125e8cb>] ? ttm_bo_cleanup_refs_and_unlock+0x12b/0x2c0 > Dec 18 14:29:28 x4 kernel: [<ffffffff810de82f>] ? kfree+0xf/0xb0 > Dec 18 14:29:28 x4 kernel: [<ffffffff8125eb7d>] ? ttm_bo_delayed_delete+0x11d/0x1a0 > Dec 18 14:29:28 x4 kernel: [<ffffffff8125ec12>] ? ttm_bo_delayed_workqueue+0x12/0x30 > Dec 18 14:29:28 x4 kernel: [<ffffffff8106e5f9>] ? process_one_work+0x179/0x480 > Dec 18 14:29:28 x4 kernel: [<ffffffff8125ec00>] ? ttm_bo_delayed_delete+0x1a0/0x1a0 > Dec 18 14:29:28 x4 kernel: [<ffffffff8106f5b1>] ? worker_thread+0x1b1/0x540 > Dec 18 14:29:28 x4 kernel: [<ffffffff8106f400>] ? busy_worker_rebind_fn+0x100/0x100 > Dec 18 14:29:28 x4 kernel: [<ffffffff810741cf>] ? kthread+0xaf/0xc0 > Dec 18 14:29:28 x4 kernel: [<ffffffff81074120>] ? __kthread_bind+0x30/0x30 > Dec 18 14:29:28 x4 kernel: [<ffffffff814b052c>] ? ret_from_fork+0x7c/0xb0 > Dec 18 14:29:28 x4 kernel: [<ffffffff81074120>] ? __kthread_bind+0x30/0x30 > Dec 18 14:29:28 x4 kernel: Code: 74 03 c6 03 00 65 48 8b 04 25 c0 b9 00 00 48 8b 80 48 02 00 00 5b 48 8b 40 c8 48 d1 e8 83 e0 01 c3 0f 1f 00 48 8b 87 48 02 00 00 <48> 8b 40 d8 c3 0f 1f 40 00 65 48 8b 04 25 c0 b9 00 00 48 8b b8 > Dec 18 14:29:28 x4 kernel: RIP [<ffffffff81074257>] kthread_data+0x7/0x10 > Dec 18 14:29:28 x4 kernel: RSP <ffff8802168b3aa0> > Dec 18 14:29:28 x4 kernel: CR2: ffffffffffffffd8 > Dec 18 14:29:28 x4 kernel: ---[ end trace c5e6f68fefd3a70e ]--- > Dec 18 14:29:28 x4 kernel: Fixing recursive fault but reboot is needed! > Dec 18 14:29:28 x4 kernel: SysRq : Emergency Sync CCing Maarten -- Markus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-18 13:38 ` Markus Trippelsdorf 2012-12-18 13:51 ` Markus Trippelsdorf @ 2012-12-18 15:24 ` Maarten Lankhorst 2012-12-18 16:12 ` Markus Trippelsdorf 1 sibling, 1 reply; 20+ messages in thread From: Maarten Lankhorst @ 2012-12-18 15:24 UTC (permalink / raw) To: Markus Trippelsdorf; +Cc: Michel Dänzer, dri-devel Op 18-12-12 14:38, Markus Trippelsdorf schreef: > On 2012.12.18 at 12:20 +0100, Michel Dänzer wrote: >> On Mon, 2012-12-17 at 23:55 +0100, Markus Trippelsdorf wrote: >>> On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote: >>>> On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: >>>>> On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf >>>>> <markus@trippelsdorf.de> wrote: >>>>>> On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: >>>>>>> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf >>>>>>> <markus@trippelsdorf.de> wrote: >>>>>>>> As soon as I open the following website: >>>>>>>> http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html >>>>>>>> >>>>>>>> my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: >>>>>>> Is this a regression? Most likely a 3D driver bug unless you are only >>>>>>> seeing it with specific kernels. What browser are you using and do >>>>>>> you have hw accelerated webgl, etc. enabled? If so, what version of >>>>>>> mesa are you using? >>>>>> This is a regression, because it is caused by yesterdays merge of >>>>>> drm-next by Linus. IOW I only see this bug when running a >>>>>> v3.7-9432-g9360b53 kernel. >>>>> Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 >>>> Yes, the commit above causes the issue. >>>> >>>> 2d6cc72 GPU lockups >>> With 2d6cc72 reverted I get: >>> >>> Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------ >> Probably a separate issue, can you bisect this one as well? > Yes. Git-bisect points to: > > 85b144f860176ec18db927d6d9ecdfb24d9c6483 is the first bad commit > commit 85b144f860176ec18db927d6d9ecdfb24d9c6483 > Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> > Date: Thu Nov 29 11:36:54 2012 +0000 > > drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock > held, v3 > > (Please note that this bug is a little bit harder to reproduce. But > when you scroll up and down for ~10 seconds on the webpage mentioned > above it will trigger the oops. > So while I'm not 100% sure that the issue is caused by exactly this > commit, the vicinity should be right) > Those dmesg warnings sound suspicious, looks like something is going very wrong there. Can you revert the one before it? "drm/radeon: allow move_notify to be called without reservation" Reservation should be held at this point, that commit got in accidentally. I doubt not holding a reservation is causing it though, I don't really see how that commit could cause it however, so can you please double check it never happened before that point, and only started at that commit? also slap in a BUG_ON(!ttm_bo_is_reserved(bo)) in ttm_bo_cleanup_refs_and_unlock for good measure, and a BUG_ON(spin_trylock(&bdev->fence_lock)); to ttm_bo_wait. I really don't see how that specific commit can be wrong though, so awaiting your results first before I try to dig more into it. ~Maarten ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-18 15:24 ` Maarten Lankhorst @ 2012-12-18 16:12 ` Markus Trippelsdorf 2012-12-18 18:10 ` Maarten Lankhorst 2012-12-19 13:57 ` Maarten Lankhorst 0 siblings, 2 replies; 20+ messages in thread From: Markus Trippelsdorf @ 2012-12-18 16:12 UTC (permalink / raw) To: Maarten Lankhorst; +Cc: Michel Dänzer, dri-devel On 2012.12.18 at 16:24 +0100, Maarten Lankhorst wrote: > Op 18-12-12 14:38, Markus Trippelsdorf schreef: > > On 2012.12.18 at 12:20 +0100, Michel Dänzer wrote: > >> On Mon, 2012-12-17 at 23:55 +0100, Markus Trippelsdorf wrote: > >>> On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote: > >>>> On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: > >>>>> On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf > >>>>> <markus@trippelsdorf.de> wrote: > >>>>>> On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: > >>>>>>> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf > >>>>>>> <markus@trippelsdorf.de> wrote: > >>>>>>>> As soon as I open the following website: > >>>>>>>> http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html > >>>>>>>> > >>>>>>>> my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: > >>>>>>> Is this a regression? Most likely a 3D driver bug unless you are only > >>>>>>> seeing it with specific kernels. What browser are you using and do > >>>>>>> you have hw accelerated webgl, etc. enabled? If so, what version of > >>>>>>> mesa are you using? > >>>>>> This is a regression, because it is caused by yesterdays merge of > >>>>>> drm-next by Linus. IOW I only see this bug when running a > >>>>>> v3.7-9432-g9360b53 kernel. > >>>>> Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: > >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 > >>>> Yes, the commit above causes the issue. > >>>> > >>>> 2d6cc72 GPU lockups > >>> With 2d6cc72 reverted I get: > >>> > >>> Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------ > >> Probably a separate issue, can you bisect this one as well? > > Yes. Git-bisect points to: > > > > 85b144f860176ec18db927d6d9ecdfb24d9c6483 is the first bad commit > > commit 85b144f860176ec18db927d6d9ecdfb24d9c6483 > > Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> > > Date: Thu Nov 29 11:36:54 2012 +0000 > > > > drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock > > held, v3 > > > > (Please note that this bug is a little bit harder to reproduce. But > > when you scroll up and down for ~10 seconds on the webpage mentioned > > above it will trigger the oops. > > So while I'm not 100% sure that the issue is caused by exactly this > > commit, the vicinity should be right) > > > Those dmesg warnings sound suspicious, looks like something is going > very wrong there. > > Can you revert the one before it? "drm/radeon: allow move_notify to be > called without reservation" Reservation should be held at this point, > that commit got in accidentally. > > I doubt not holding a reservation is causing it though, I don't really > see how that commit could cause it however, so can you please double > check it never happened before that point, and only started at that > commit? > > also slap in a BUG_ON(!ttm_bo_is_reserved(bo)) in > ttm_bo_cleanup_refs_and_unlock for good measure, and a > BUG_ON(spin_trylock(&bdev->fence_lock)); to ttm_bo_wait. > > I really don't see how that specific commit can be wrong though, so > awaiting your results first before I try to dig more into it. I just reran git-bisect just on your commits (from 1a1494def to 97a875cbd) and I landed on the same commit as above: commit 85b144f86 (drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock held, v3) So now I'm pretty sure it's specifically this commit that started the issue. With your supposed debugging BUG_ONs added I still get: Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------ Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174 Dec 18 17:01:15 x4 kernel: Call Trace: Dec 18 17:01:15 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0 Dec 18 17:01:15 x4 kernel: [<ffffffff8129273c>] ? radeon_fence_ref+0x2c/0x40 Dec 18 17:01:15 x4 kernel: [<ffffffff8125e95c>] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0 Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0 Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0 Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340 Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170 Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0 Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110 Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0 Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200 Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40 Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160 Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150 Dec 18 17:01:15 x4 kernel: [<ffffffff812a529f>] ? radeon_gem_object_free+0x2f/0x40 Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 Dec 18 17:01:15 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160 Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 Dec 18 17:01:15 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0 Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b Dec 18 17:01:15 x4 kernel: ---[ end trace 485a2dd5755db51e ]--- Dec 18 17:01:15 x4 kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000024 Dec 18 17:01:15 x4 kernel: IP: [<ffffffff81296488>] radeon_vm_bo_invalidate+0x18/0x30 Dec 18 17:01:15 x4 kernel: PGD 211d09067 PUD 211d52067 PMD 0 Dec 18 17:01:15 x4 kernel: Oops: 0002 [#1] SMP Dec 18 17:01:15 x4 kernel: CPU 1 Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Tainted: G W 3.7.0-rc7-00520-g85b144f-dirty #174 System manufacturer System Product Name/M4A78T-E Dec 18 17:01:15 x4 kernel: RIP: 0010:[<ffffffff81296488>] [<ffffffff81296488>] radeon_vm_bo_invalidate+0x18/0x30 Dec 18 17:01:15 x4 kernel: RSP: 0018:ffff880211ddfaa8 EFLAGS: 00010203 Dec 18 17:01:15 x4 kernel: RAX: 0000000000000000 RBX: ffff8801f94e1c48 RCX: ffff880205de3128 Dec 18 17:01:15 x4 kernel: RDX: 0000000000000001 RSI: ffff8801f94e1df0 RDI: ffff8801f94e1df8 Dec 18 17:01:15 x4 kernel: RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 Dec 18 17:01:15 x4 kernel: R10: 0000000000000000 R11: ffff880216a766b8 R12: ffff880216a76590 Dec 18 17:01:15 x4 kernel: R13: ffffffff818383e0 R14: 0000000000000001 R15: ffff880215c83678 Dec 18 17:01:15 x4 kernel: FS: 00007fbcabc8c880(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000 Dec 18 17:01:15 x4 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 18 17:01:15 x4 kernel: CR2: 0000000000000024 CR3: 0000000211d07000 CR4: 00000000000007e0 Dec 18 17:01:15 x4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 18 17:01:15 x4 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 18 17:01:15 x4 kernel: Process X (pid: 157, threadinfo ffff880211dde000, task ffff880211dc0ba0) Dec 18 17:01:15 x4 kernel: Stack: Dec 18 17:01:15 x4 kernel: ffffffff8125d2e9 ffff8801f94e1c48 ffffffff8125e909 ffff880216a769b8 Dec 18 17:01:15 x4 kernel: 01ff880200000001 ffff8801f94e1c84 0000000000000001 ffff880216a766b8 Dec 18 17:01:15 x4 kernel: 0000000000000000 ffff880215c83678 ffff8801f94e1c48 ffffffff8125f17c Dec 18 17:01:15 x4 kernel: Call Trace: Dec 18 17:01:15 x4 kernel: [<ffffffff8125d2e9>] ? ttm_bo_cleanup_memtype_use+0x19/0x90 Dec 18 17:01:15 x4 kernel: [<ffffffff8125e909>] ? ttm_bo_cleanup_refs_and_unlock+0x139/0x2d0 Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0 Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0 Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340 Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170 Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0 Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110 Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0 Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200 Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40 Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160 Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150 Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 Dec 18 17:01:15 x4 kernel: [<ffffffff8111c310>] ? fsnotify_clear_marks_by_inode+0x20/0xd0 Dec 18 17:01:15 x4 kernel: [<ffffffff810fbc35>] ? __destroy_inode+0x15/0x60 Dec 18 17:01:15 x4 kernel: [<ffffffff810de220>] ? kmem_cache_free+0x10/0x90 Dec 18 17:01:15 x4 kernel: [<ffffffff810f8eaf>] ? dput+0x2f/0x300 Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 Dec 18 17:01:15 x4 kernel: [<ffffffff811005fb>] ? mntput_no_expire+0x7b/0x170 Dec 18 17:01:15 x4 kernel: [<ffffffff8107bb6b>] ? lg_global_unlock+0x3b/0x50 Dec 18 17:01:15 x4 kernel: [<ffffffff81071b9c>] ? task_work_run+0x8c/0xc0 Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b Dec 18 17:01:15 x4 kernel: Code: 8b 44 24 04 48 83 c4 08 5b 5d 41 5c c3 66 0f 1f 44 00 00 48 8b 86 f0 01 00 00 48 81 c6 f0 01 00 00 48 39 f0 74 11 0f 1f 44 00 00 <c6> 40 24 00 48 8b 00 48 39 f0 75 f4 f3 c3 66 2e 0f 1f 84 00 00 Dec 18 17:01:15 x4 kernel: RIP [<ffffffff81296488>] radeon_vm_bo_invalidate+0x18/0x30 Dec 18 17:01:15 x4 kernel: RSP <ffff880211ddfaa8> Dec 18 17:01:15 x4 kernel: CR2: 0000000000000024 Dec 18 17:01:15 x4 kernel: ---[ end trace 485a2dd5755db51f ]--- Dec 18 17:01:15 x4 kernel: [drm:drm_release] *ERROR* Device busy: 1 -- Markus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-18 16:12 ` Markus Trippelsdorf @ 2012-12-18 18:10 ` Maarten Lankhorst 2012-12-19 13:57 ` Maarten Lankhorst 1 sibling, 0 replies; 20+ messages in thread From: Maarten Lankhorst @ 2012-12-18 18:10 UTC (permalink / raw) To: Markus Trippelsdorf; +Cc: Michel Dänzer, dri-devel Op 18-12-12 17:12, Markus Trippelsdorf schreef: > On 2012.12.18 at 16:24 +0100, Maarten Lankhorst wrote: >> Op 18-12-12 14:38, Markus Trippelsdorf schreef: >>> On 2012.12.18 at 12:20 +0100, Michel Dänzer wrote: >>>> On Mon, 2012-12-17 at 23:55 +0100, Markus Trippelsdorf wrote: >>>>> On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote: >>>>>> On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: >>>>>>> On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf >>>>>>> <markus@trippelsdorf.de> wrote: >>>>>>>> On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: >>>>>>>>> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf >>>>>>>>> <markus@trippelsdorf.de> wrote: >>>>>>>>>> As soon as I open the following website: >>>>>>>>>> http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html >>>>>>>>>> >>>>>>>>>> my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: >>>>>>>>> Is this a regression? Most likely a 3D driver bug unless you are only >>>>>>>>> seeing it with specific kernels. What browser are you using and do >>>>>>>>> you have hw accelerated webgl, etc. enabled? If so, what version of >>>>>>>>> mesa are you using? >>>>>>>> This is a regression, because it is caused by yesterdays merge of >>>>>>>> drm-next by Linus. IOW I only see this bug when running a >>>>>>>> v3.7-9432-g9360b53 kernel. >>>>>>> Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: >>>>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 >>>>>> Yes, the commit above causes the issue. >>>>>> >>>>>> 2d6cc72 GPU lockups >>>>> With 2d6cc72 reverted I get: >>>>> >>>>> Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------ >>>> Probably a separate issue, can you bisect this one as well? >>> Yes. Git-bisect points to: >>> >>> 85b144f860176ec18db927d6d9ecdfb24d9c6483 is the first bad commit >>> commit 85b144f860176ec18db927d6d9ecdfb24d9c6483 >>> Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> >>> Date: Thu Nov 29 11:36:54 2012 +0000 >>> >>> drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock >>> held, v3 >>> >>> (Please note that this bug is a little bit harder to reproduce. But >>> when you scroll up and down for ~10 seconds on the webpage mentioned >>> above it will trigger the oops. >>> So while I'm not 100% sure that the issue is caused by exactly this >>> commit, the vicinity should be right) >>> >> Those dmesg warnings sound suspicious, looks like something is going >> very wrong there. >> >> Can you revert the one before it? "drm/radeon: allow move_notify to be >> called without reservation" Reservation should be held at this point, >> that commit got in accidentally. >> >> I doubt not holding a reservation is causing it though, I don't really >> see how that commit could cause it however, so can you please double >> check it never happened before that point, and only started at that >> commit? >> >> also slap in a BUG_ON(!ttm_bo_is_reserved(bo)) in >> ttm_bo_cleanup_refs_and_unlock for good measure, and a >> BUG_ON(spin_trylock(&bdev->fence_lock)); to ttm_bo_wait. >> >> I really don't see how that specific commit can be wrong though, so >> awaiting your results first before I try to dig more into it. > I just reran git-bisect just on your commits (from 1a1494def to 97a875cbd) > and I landed on the same commit as above: > > commit 85b144f86 (drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock held, v3) > > So now I'm pretty sure it's specifically this commit that started the > issue. > > With your supposed debugging BUG_ONs added I still get: > > Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------ > Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() > Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name > Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174 > Dec 18 17:01:15 x4 kernel: Call Trace: > Dec 18 17:01:15 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8129273c>] ? radeon_fence_ref+0x2c/0x40 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125e95c>] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0 > Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170 > Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200 > Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40 > Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160 > Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150 > Dec 18 17:01:15 x4 kernel: [<ffffffff812a529f>] ? radeon_gem_object_free+0x2f/0x40 > Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 > Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 > Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 > Dec 18 17:01:15 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160 > Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 > Dec 18 17:01:15 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0 > Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b So nothing changed.. did you revert the drm/radeon patch before it yet? And wtf is going on here? That patch shouldn't cause such issues by itself, and I don't see how the refcount on bo->sync_obj can be zero, with bo->sync_obj non-null. Refcounting seems to be messed up on the fence somewhere, but I don't think it's caused by this patch.. ~Maarten ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-18 16:12 ` Markus Trippelsdorf 2012-12-18 18:10 ` Maarten Lankhorst @ 2012-12-19 13:57 ` Maarten Lankhorst 2012-12-19 14:20 ` Markus Trippelsdorf 1 sibling, 1 reply; 20+ messages in thread From: Maarten Lankhorst @ 2012-12-19 13:57 UTC (permalink / raw) To: Markus Trippelsdorf; +Cc: Michel Dänzer, dri-devel Op 18-12-12 17:12, Markus Trippelsdorf schreef: > On 2012.12.18 at 16:24 +0100, Maarten Lankhorst wrote: >> Op 18-12-12 14:38, Markus Trippelsdorf schreef: >>> On 2012.12.18 at 12:20 +0100, Michel Dänzer wrote: >>>> On Mon, 2012-12-17 at 23:55 +0100, Markus Trippelsdorf wrote: >>>>> On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote: >>>>>> On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: >>>>>>> On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf >>>>>>> <markus@trippelsdorf.de> wrote: >>>>>>>> On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: >>>>>>>>> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf >>>>>>>>> <markus@trippelsdorf.de> wrote: >>>>>>>>>> As soon as I open the following website: >>>>>>>>>> http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html >>>>>>>>>> >>>>>>>>>> my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: >>>>>>>>> Is this a regression? Most likely a 3D driver bug unless you are only >>>>>>>>> seeing it with specific kernels. What browser are you using and do >>>>>>>>> you have hw accelerated webgl, etc. enabled? If so, what version of >>>>>>>>> mesa are you using? >>>>>>>> This is a regression, because it is caused by yesterdays merge of >>>>>>>> drm-next by Linus. IOW I only see this bug when running a >>>>>>>> v3.7-9432-g9360b53 kernel. >>>>>>> Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: >>>>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 >>>>>> Yes, the commit above causes the issue. >>>>>> >>>>>> 2d6cc72 GPU lockups >>>>> With 2d6cc72 reverted I get: >>>>> >>>>> Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------ >>>> Probably a separate issue, can you bisect this one as well? >>> Yes. Git-bisect points to: >>> >>> 85b144f860176ec18db927d6d9ecdfb24d9c6483 is the first bad commit >>> commit 85b144f860176ec18db927d6d9ecdfb24d9c6483 >>> Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> >>> Date: Thu Nov 29 11:36:54 2012 +0000 >>> >>> drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock >>> held, v3 >>> >>> (Please note that this bug is a little bit harder to reproduce. But >>> when you scroll up and down for ~10 seconds on the webpage mentioned >>> above it will trigger the oops. >>> So while I'm not 100% sure that the issue is caused by exactly this >>> commit, the vicinity should be right) >>> >> Those dmesg warnings sound suspicious, looks like something is going >> very wrong there. >> >> Can you revert the one before it? "drm/radeon: allow move_notify to be >> called without reservation" Reservation should be held at this point, >> that commit got in accidentally. >> >> I doubt not holding a reservation is causing it though, I don't really >> see how that commit could cause it however, so can you please double >> check it never happened before that point, and only started at that >> commit? >> >> also slap in a BUG_ON(!ttm_bo_is_reserved(bo)) in >> ttm_bo_cleanup_refs_and_unlock for good measure, and a >> BUG_ON(spin_trylock(&bdev->fence_lock)); to ttm_bo_wait. >> >> I really don't see how that specific commit can be wrong though, so >> awaiting your results first before I try to dig more into it. > I just reran git-bisect just on your commits (from 1a1494def to 97a875cbd) > and I landed on the same commit as above: > > commit 85b144f86 (drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock held, v3) > > So now I'm pretty sure it's specifically this commit that started the > issue. > > With your supposed debugging BUG_ONs added I still get: > > Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------ > Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() > Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name > Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174 > Dec 18 17:01:15 x4 kernel: Call Trace: > Dec 18 17:01:15 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8129273c>] ? radeon_fence_ref+0x2c/0x40 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125e95c>] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0 > Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170 > Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200 > Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40 > Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160 > Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150 > Dec 18 17:01:15 x4 kernel: [<ffffffff812a529f>] ? radeon_gem_object_free+0x2f/0x40 > Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 > Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 > Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 > Dec 18 17:01:15 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160 > Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 > Dec 18 17:01:15 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0 > Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b so the kref to fence is null here. This should be impossible and indicates a bug in refcounting somewhere, or possibly memory corruption. Lets first look where things could go wrong.. sync_obj member requires fence_lock to be taken, but radeon code in general doesn't do that, hm.. I think radeon_cs_sync_rings needs to take fence_lock during the iteration, then taking on a refcount to the fence, and radeon_crtc_page_flip and radeon_move_blit are lacking refcount on fence_lock as well. But that would probably still not explain why it crashes in radeon_vm_bo_invalidate shortly after, so it seems just as likely that it's operating on freed memory there or something. But none of the code touches refcounting for that bo, and I really don't see how I messed up anything there. I seem to be able to reproduce it if I add a hack though, can you test if you get the exact same issues if you apply this patch? I call it "aggressively evict MRU buffer, and never call ddestroy", and for me it triggers by merely starting X. :-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 0bf66f9..9a8f0d8 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -512,6 +512,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) spin_lock(&glob->lru_lock); ret = ttm_bo_reserve_locked(bo, false, true, false, 0); + goto skip; spin_lock(&bdev->fence_lock); (void) ttm_bo_wait(bo, false, false, true); if (!ret && !bo->sync_obj && 0) { @@ -529,6 +530,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) sync_obj = driver->sync_obj_ref(bo->sync_obj); spin_unlock(&bdev->fence_lock); +skip: if (!ret) { atomic_set(&bo->reserved, 0); wake_up_all(&bo->event_queue); @@ -542,8 +544,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) driver->sync_obj_flush(sync_obj); driver->sync_obj_unref(&sync_obj); } - schedule_delayed_work(&bdev->wq, - ((HZ / 100) < 1) ? 1 : HZ / 100); + schedule_delayed_work(&bdev->wq, HZ * 100); } /** @@ -699,8 +700,7 @@ static void ttm_bo_delayed_workqueue(struct work_struct *work) container_of(work, struct ttm_bo_device, wq.work); if (ttm_bo_delayed_delete(bdev, false)) { - schedule_delayed_work(&bdev->wq, - ((HZ / 100) < 1) ? 1 : HZ / 100); + schedule_delayed_work(&bdev->wq, HZ * 100); } } @@ -743,8 +743,7 @@ EXPORT_SYMBOL(ttm_bo_lock_delayed_workqueue); void ttm_bo_unlock_delayed_workqueue(struct ttm_bo_device *bdev, int resched) { if (resched) - schedule_delayed_work(&bdev->wq, - ((HZ / 100) < 1) ? 1 : HZ / 100); + schedule_delayed_work(&bdev->wq, HZ * 100); } EXPORT_SYMBOL(ttm_bo_unlock_delayed_workqueue); @@ -815,12 +814,15 @@ static int ttm_mem_evict_first(struct ttm_bo_device *bdev, retry: spin_lock(&glob->lru_lock); - if (list_empty(&man->lru)) { - spin_unlock(&glob->lru_lock); - return -EBUSY; - } + if (list_empty(&bdev->ddestroy)) { + if (list_empty(&man->lru)) { + spin_unlock(&glob->lru_lock); + return -EBUSY; + } + bo = list_entry(man->lru.prev, struct ttm_buffer_object, lru); + } else + bo = list_entry(bdev->ddestroy.prev, struct ttm_buffer_object, ddestroy); - bo = list_first_entry(&man->lru, struct ttm_buffer_object, lru); kref_get(&bo->list_kref); if (!list_empty(&bo->ddestroy)) { ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-19 13:57 ` Maarten Lankhorst @ 2012-12-19 14:20 ` Markus Trippelsdorf 2012-12-19 14:31 ` Maarten Lankhorst 0 siblings, 1 reply; 20+ messages in thread From: Markus Trippelsdorf @ 2012-12-19 14:20 UTC (permalink / raw) To: Maarten Lankhorst; +Cc: Michel Dänzer, dri-devel On 2012.12.19 at 14:57 +0100, Maarten Lankhorst wrote: > Op 18-12-12 17:12, Markus Trippelsdorf schreef: > > With your supposed debugging BUG_ONs added I still get: > > > > Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------ > > Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() > > Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name > > Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174 > > Dec 18 17:01:15 x4 kernel: Call Trace: > > Dec 18 17:01:15 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8129273c>] ? radeon_fence_ref+0x2c/0x40 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8125e95c>] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170 > > Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200 > > Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40 > > Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160 > > Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150 > > Dec 18 17:01:15 x4 kernel: [<ffffffff812a529f>] ? radeon_gem_object_free+0x2f/0x40 > > Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 > > Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160 > > Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0 > > Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b > so the kref to fence is null here. This should be impossible and > indicates a bug in refcounting somewhere, or possibly memory > corruption. > > Lets first look where things could go wrong.. > > sync_obj member requires fence_lock to be taken, but radeon code in > general doesn't do that, hm.. > > I think radeon_cs_sync_rings needs to take fence_lock during the > iteration, then taking on a refcount to the fence, and > radeon_crtc_page_flip and radeon_move_blit are lacking refcount on > fence_lock as well. > > But that would probably still not explain why it crashes in > radeon_vm_bo_invalidate shortly after, so it seems just as likely that > it's operating on freed memory there or something. > > But none of the code touches refcounting for that bo, and I really > don't see how I messed up anything there. > > I seem to be able to reproduce it if I add a hack though, can you test > if you get the exact same issues if you apply this patch? Your patch doesn't apply unfortunately: markus@x4 linux % patch -p1 --dry-run < ~/maarten.patch checking file drivers/gpu/drm/ttm/ttm_bo.c Hunk #1 succeeded at 512 with fuzz 1. Hunk #6 FAILED at 814. 1 out of 6 hunks FAILED markus@x4 linux % git describe v3.7-10833-g752451f markus@x4 linux % -- Markus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-19 14:20 ` Markus Trippelsdorf @ 2012-12-19 14:31 ` Maarten Lankhorst 0 siblings, 0 replies; 20+ messages in thread From: Maarten Lankhorst @ 2012-12-19 14:31 UTC (permalink / raw) To: Markus Trippelsdorf; +Cc: Michel Dänzer, dri-devel Op 19-12-12 15:20, Markus Trippelsdorf schreef: > On 2012.12.19 at 14:57 +0100, Maarten Lankhorst wrote: >> Op 18-12-12 17:12, Markus Trippelsdorf schreef: >>> With your supposed debugging BUG_ONs added I still get: >>> >>> Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------ >>> Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() >>> Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name >>> Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174 >>> Dec 18 17:01:15 x4 kernel: Call Trace: >>> Dec 18 17:01:15 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8129273c>] ? radeon_fence_ref+0x2c/0x40 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125e95c>] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff812a529f>] ? radeon_gem_object_free+0x2f/0x40 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0 >>> Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b >> so the kref to fence is null here. This should be impossible and >> indicates a bug in refcounting somewhere, or possibly memory >> corruption. >> >> Lets first look where things could go wrong.. >> >> sync_obj member requires fence_lock to be taken, but radeon code in >> general doesn't do that, hm.. >> >> I think radeon_cs_sync_rings needs to take fence_lock during the >> iteration, then taking on a refcount to the fence, and >> radeon_crtc_page_flip and radeon_move_blit are lacking refcount on >> fence_lock as well. >> >> But that would probably still not explain why it crashes in >> radeon_vm_bo_invalidate shortly after, so it seems just as likely that >> it's operating on freed memory there or something. >> >> But none of the code touches refcounting for that bo, and I really >> don't see how I messed up anything there. >> >> I seem to be able to reproduce it if I add a hack though, can you test >> if you get the exact same issues if you apply this patch? > Your patch doesn't apply unfortunately: > > markus@x4 linux % patch -p1 --dry-run < ~/maarten.patch > checking file drivers/gpu/drm/ttm/ttm_bo.c > Hunk #1 succeeded at 512 with fuzz 1. > Hunk #6 FAILED at 814. > 1 out of 6 hunks FAILED > markus@x4 linux % git describe > v3.7-10833-g752451f > markus@x4 linux % It applies on top of the regressed commit. It should probably not be too hard to make it apply manually on whatever you're using. But the real fix will be "drm/ttm: fix delayed ttm_bo_cleanup_refs_and_unlock delayed handling", which I cc'd you on. The patch I posted earlier in this thread will just aggressively stress test the codepath. ~Maarten ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-17 22:25 ` Markus Trippelsdorf 2012-12-17 22:55 ` Markus Trippelsdorf @ 2012-12-23 1:46 ` Alex Deucher 2012-12-23 8:43 ` Markus Trippelsdorf 1 sibling, 1 reply; 20+ messages in thread From: Alex Deucher @ 2012-12-23 1:46 UTC (permalink / raw) To: Markus Trippelsdorf; +Cc: dri-devel On Mon, Dec 17, 2012 at 5:25 PM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote: > On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: >> On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf >> <markus@trippelsdorf.de> wrote: >> > On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: >> >> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf >> >> <markus@trippelsdorf.de> wrote: >> >> > As soon as I open the following website: >> >> > http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html >> >> > >> >> > my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: >> >> >> >> Is this a regression? Most likely a 3D driver bug unless you are only >> >> seeing it with specific kernels. What browser are you using and do >> >> you have hw accelerated webgl, etc. enabled? If so, what version of >> >> mesa are you using? >> > >> > This is a regression, because it is caused by yesterdays merge of >> > drm-next by Linus. IOW I only see this bug when running a >> > v3.7-9432-g9360b53 kernel. >> >> Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 > > Yes, the commit above causes the issue. > Does booting with radeon.wb=0 fix the issue? Please make sure your kernel has this patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=86a1881d08f65a42c17071a59c0088dbe2870246 Alex > 2d6cc72 GPU lockups > 009ee7a runs fine > > -- > Markus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-23 1:46 ` Alex Deucher @ 2012-12-23 8:43 ` Markus Trippelsdorf 2012-12-23 10:09 ` Andy Furniss 0 siblings, 1 reply; 20+ messages in thread From: Markus Trippelsdorf @ 2012-12-23 8:43 UTC (permalink / raw) To: Alex Deucher; +Cc: dri-devel On 2012.12.22 at 20:46 -0500, Alex Deucher wrote: > On Mon, Dec 17, 2012 at 5:25 PM, Markus Trippelsdorf > <markus@trippelsdorf.de> wrote: > > On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: > >> On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf > >> <markus@trippelsdorf.de> wrote: > >> > On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: > >> >> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf > >> >> <markus@trippelsdorf.de> wrote: > >> >> > As soon as I open the following website: > >> >> > http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html > >> >> > > >> >> > my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: > >> >> > >> >> Is this a regression? Most likely a 3D driver bug unless you are only > >> >> seeing it with specific kernels. What browser are you using and do > >> >> you have hw accelerated webgl, etc. enabled? If so, what version of > >> >> mesa are you using? > >> > > >> > This is a regression, because it is caused by yesterdays merge of > >> > drm-next by Linus. IOW I only see this bug when running a > >> > v3.7-9432-g9360b53 kernel. > >> > >> Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: > >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 > > > > Yes, the commit above causes the issue. > > > > Does booting with radeon.wb=0 fix the issue? Please make sure your > kernel has this patch: > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=86a1881d08f65a42c17071a59c0088dbe2870246 My kernel has this patch and radeon.wb=0 doesn't help. It still freezes the machine as soon as you scroll on a website with many big images. -- Markus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-23 8:43 ` Markus Trippelsdorf @ 2012-12-23 10:09 ` Andy Furniss 2012-12-23 10:21 ` Markus Trippelsdorf 0 siblings, 1 reply; 20+ messages in thread From: Andy Furniss @ 2012-12-23 10:09 UTC (permalink / raw) To: Markus Trippelsdorf; +Cc: dri-devel Markus Trippelsdorf wrote: >> Does booting with radeon.wb=0 fix the issue? Please make sure your >> kernel has this patch: >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=86a1881d08f65a42c17071a59c0088dbe2870246 > > My kernel has this patch and radeon.wb=0 doesn't help. I think that should be no_wb=1 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: GPU lockup CP stall for more than 10000msec on latest vanilla git 2012-12-23 10:09 ` Andy Furniss @ 2012-12-23 10:21 ` Markus Trippelsdorf 0 siblings, 0 replies; 20+ messages in thread From: Markus Trippelsdorf @ 2012-12-23 10:21 UTC (permalink / raw) To: Andy Furniss; +Cc: dri-devel On 2012.12.23 at 10:09 +0000, Andy Furniss wrote: > Markus Trippelsdorf wrote: > > >> Does booting with radeon.wb=0 fix the issue? Please make sure your > >> kernel has this patch: > >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=86a1881d08f65a42c17071a59c0088dbe2870246 > > > > My kernel has this patch and radeon.wb=0 doesn't help. > > I think that should be no_wb=1 Yes, you're right. But even with radeon.no_wb=1 it still hangs: ... Dec 23 11:15:02 x4 kernel: radeon 0000:01:05.0: WB disabled Dec 23 11:15:02 x4 kernel: radeon 0000:01:05.0: fence driver on ring 0 use gpu addr 0x00000000a0000004 and cpu addr 0xffff8802163ad004 Dec 23 11:15:02 x4 kernel: radeon 0000:01:05.0: fence driver on ring 3 use gpu addr 0x00000000a0000c0c and cpu addr 0xffff8802163adc0c Dec 23 11:15:02 x4 kernel: radeon 0000:01:05.0: setting latency timer to 64 ... Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: GPU lockup (waiting for 0x000000000000089c last fence id 0x000000000000089b) Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: Saved 217 dwords of commands on ring 0. Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: GPU softreset Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_008010_GRBM_STATUS=0xA000B030 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_008014_GRBM_STATUS2=0x00000003 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_000E50_SRBM_STATUS=0x20005040 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_008674_CP_STALLED_STAT1 = 0x00000000 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_008678_CP_STALLED_STAT2 = 0x00000002 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_00867C_CP_BUSY_STAT = 0x0000D086 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_008680_CP_STAT = 0x80098645 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00007FEE Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00000001 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_008010_GRBM_STATUS=0xA000B030 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_008014_GRBM_STATUS2=0x00000003 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_000E50_SRBM_STATUS=0x2000C040 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_008674_CP_STALLED_STAT1 = 0x00000000 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_008678_CP_STALLED_STAT2 = 0x00000000 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_00867C_CP_BUSY_STAT = 0x00000000 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: R_008680_CP_STAT = 0x80100000 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: GPU reset succeeded, trying to resume Dec 23 11:16:04 x4 kernel: [drm] PCIE GART of 512M enabled (table at 0x00000000C0040000). Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: WB disabled Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: fence driver on ring 0 use gpu addr 0x00000000a0000004 and cpu addr 0xffff8802163ad004 Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: fence driver on ring 3 use gpu addr 0x00000000a0000c0c and cpu addr 0xffff8802163adc0c Dec 23 11:16:04 x4 kernel: radeon 0000:01:05.0: setting latency timer to 64 Dec 23 11:16:04 x4 kernel: [drm] ring test on 0 succeeded in 1 usecs Dec 23 11:16:05 x4 kernel: [drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD) Dec 23 11:16:05 x4 kernel: [drm:r600_resume] *ERROR* r600 startup failed on resume Dec 23 11:16:09 x4 kernel: SysRq : Emergency Sync Dec 23 11:16:09 x4 kernel: Emergency Sync complete Dec 23 11:16:15 x4 kernel: SysRq : Emergency Remount R/O Dec 23 11:16:15 x4 kernel: EXT4-fs (sdb2): re-mounted. Opts: (null) -- Markus ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2012-12-23 10:33 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-12-17 18:27 GPU lockup CP stall for more than 10000msec on latest vanilla git Markus Trippelsdorf 2012-12-17 21:32 ` Alex Deucher 2012-12-17 21:48 ` Markus Trippelsdorf 2012-12-17 21:58 ` Markus Trippelsdorf 2012-12-17 22:00 ` Alex Deucher 2012-12-17 22:25 ` Markus Trippelsdorf 2012-12-17 22:55 ` Markus Trippelsdorf 2012-12-18 11:20 ` Michel Dänzer 2012-12-18 13:38 ` Markus Trippelsdorf 2012-12-18 13:51 ` Markus Trippelsdorf 2012-12-18 15:24 ` Maarten Lankhorst 2012-12-18 16:12 ` Markus Trippelsdorf 2012-12-18 18:10 ` Maarten Lankhorst 2012-12-19 13:57 ` Maarten Lankhorst 2012-12-19 14:20 ` Markus Trippelsdorf 2012-12-19 14:31 ` Maarten Lankhorst 2012-12-23 1:46 ` Alex Deucher 2012-12-23 8:43 ` Markus Trippelsdorf 2012-12-23 10:09 ` Andy Furniss 2012-12-23 10:21 ` Markus Trippelsdorf
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.