From mboxrd@z Thu Jan 1 00:00:00 1970 From: Markus Trippelsdorf Subject: Re: GPU lockup CP stall for more than 10000msec on latest vanilla git Date: Tue, 18 Dec 2012 17:12:38 +0100 Message-ID: <20121218161238.GA213@x4> References: <20121217182752.GA351@x4> <20121217214819.GA228@x4> <20121217222519.GA229@x4> <20121217225534.GA219@x4> <1355829632.17142.59.camel@thor.local> <20121218133831.GA218@x4> <50D08ACB.4090605@canonical.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mail.ud10.udmedia.de (ud10.udmedia.de [194.117.254.50]) by gabe.freedesktop.org (Postfix) with ESMTP id 67556E5C49 for ; Tue, 18 Dec 2012 08:12:41 -0800 (PST) Content-Disposition: inline In-Reply-To: <50D08ACB.4090605@canonical.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org Errors-To: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org To: Maarten Lankhorst Cc: Michel =?iso-8859-1?Q?D=E4nzer?= , dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org On 2012.12.18 at 16:24 +0100, Maarten Lankhorst wrote: > Op 18-12-12 14:38, Markus Trippelsdorf schreef: > > On 2012.12.18 at 12:20 +0100, Michel D=E4nzer wrote: > >> On Mon, 2012-12-17 at 23:55 +0100, Markus Trippelsdorf wrote: = > >>> On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote: > >>>> On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: > >>>>> On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf > >>>>> wrote: > >>>>>> On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: > >>>>>>> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf > >>>>>>> wrote: > >>>>>>>> As soon as I open the following website: > >>>>>>>> http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_p= art_i.html > >>>>>>>> > >>>>>>>> my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: > >>>>>>> Is this a regression? Most likely a 3D driver bug unless you are= only > >>>>>>> seeing it with specific kernels. What browser are you using and = do > >>>>>>> you have hw accelerated webgl, etc. enabled? If so, what version= of > >>>>>>> mesa are you using? > >>>>>> This is a regression, because it is caused by yesterdays merge of > >>>>>> drm-next by Linus. IOW I only see this bug when running a > >>>>>> v3.7-9432-g9360b53 kernel. > >>>>> Can you bisect? I'm guessing it may be related to the new DMA ring= s. Possibly: > >>>>> http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux.git;a=3D= commitdiff;h=3D2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 > >>>> Yes, the commit above causes the issue. = > >>>> > >>>> 2d6cc72 GPU lockups > >>> With 2d6cc72 reverted I get: > >>> > >>> Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------ > >> Probably a separate issue, can you bisect this one as well? > > Yes. Git-bisect points to: > > > > 85b144f860176ec18db927d6d9ecdfb24d9c6483 is the first bad commit > > commit 85b144f860176ec18db927d6d9ecdfb24d9c6483 > > Author: Maarten Lankhorst > > Date: Thu Nov 29 11:36:54 2012 +0000 > > > > drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock > > held, v3 > > > > (Please note that this bug is a little bit harder to reproduce. But > > when you scroll up and down for ~10 seconds on the webpage mentioned > > above it will trigger the oops. > > So while I'm not 100% sure that the issue is caused by exactly this > > commit, the vicinity should be right) > > > Those dmesg warnings sound suspicious, looks like something is going > very wrong there. > = > Can you revert the one before it? "drm/radeon: allow move_notify to be > called without reservation" Reservation should be held at this point, > that commit got in accidentally. > = > I doubt not holding a reservation is causing it though, I don't really > see how that commit could cause it however, so can you please double > check it never happened before that point, and only started at that > commit? > = > also slap in a BUG_ON(!ttm_bo_is_reserved(bo)) in > ttm_bo_cleanup_refs_and_unlock for good measure, and a > BUG_ON(spin_trylock(&bdev->fence_lock)); to ttm_bo_wait. > = > I really don't see how that specific commit can be wrong though, so > awaiting your results first before I try to dig more into it. I just reran git-bisect just on your commits (from 1a1494def to 97a875cbd) and I landed on the same commit as above: commit 85b144f86 (drm/ttm: call ttm_bo_cleanup_refs with reservation and lr= u lock held, v3) So now I'm pretty sure it's specifically this commit that started the issue. With your supposed debugging BUG_ONs added I still get: Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------ Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence= _ref+0x2c/0x40() Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g8= 5b144f-dirty #174 Dec 18 17:01:15 x4 kernel: Call Trace: Dec 18 17:01:15 x4 kernel: [] ? warn_slowpath_common+0x74= /0xb0 Dec 18 17:01:15 x4 kernel: [] ? radeon_fence_ref+0x2c/0x40 Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_cleanup_refs_and_u= nlock+0x18c/0x2d0 Dec 18 17:01:15 x4 kernel: [] ? ttm_mem_evict_first+0x1dc= /0x2a0 Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_man_get_node+0x62/= 0xb0 Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_mem_space+0x28e/0x= 340 Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_move_buffer+0xfc/0= x170 Dec 18 17:01:15 x4 kernel: [] ? kmem_cache_alloc+0xb2/0xc0 Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_validate+0x95/0x110 Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_init+0x2ec/0x3b0 Dec 18 17:01:15 x4 kernel: [] ? radeon_bo_create+0x18a/0x= 200 Dec 18 17:01:15 x4 kernel: [] ? radeon_bo_clear_va+0x40/0= x40 Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_object_create+= 0x92/0x160 Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_create_ioctl+0= x6c/0x150 Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_object_free+0x= 2f/0x40 Dec 18 17:01:15 x4 kernel: [] ? drm_ioctl+0x420/0x4f0 Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_pwrite_ioctl+0= x20/0x20 Dec 18 17:01:15 x4 kernel: [] ? do_vfs_ioctl+0x2e4/0x4e0 Dec 18 17:01:15 x4 kernel: [] ? vfs_read+0x118/0x160 Dec 18 17:01:15 x4 kernel: [] ? sys_ioctl+0x4c/0xa0 Dec 18 17:01:15 x4 kernel: [] ? sys_read+0x51/0xa0 Dec 18 17:01:15 x4 kernel: [] ? system_call_fastpath+0x16= /0x1b Dec 18 17:01:15 x4 kernel: ---[ end trace 485a2dd5755db51e ]--- Dec 18 17:01:15 x4 kernel: BUG: unable to handle kernel NULL pointer derefe= rence at 0000000000000024 Dec 18 17:01:15 x4 kernel: IP: [] radeon_vm_bo_invalidate= +0x18/0x30 Dec 18 17:01:15 x4 kernel: PGD 211d09067 PUD 211d52067 PMD 0 Dec 18 17:01:15 x4 kernel: Oops: 0002 [#1] SMP Dec 18 17:01:15 x4 kernel: CPU 1 Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Tainted: G W 3.7.0-r= c7-00520-g85b144f-dirty #174 System manufacturer System Product Name/M4A78T= -E Dec 18 17:01:15 x4 kernel: RIP: 0010:[] [] radeon_vm_bo_invalidate+0x18/0x30 Dec 18 17:01:15 x4 kernel: RSP: 0018:ffff880211ddfaa8 EFLAGS: 00010203 Dec 18 17:01:15 x4 kernel: RAX: 0000000000000000 RBX: ffff8801f94e1c48 RCX:= ffff880205de3128 Dec 18 17:01:15 x4 kernel: RDX: 0000000000000001 RSI: ffff8801f94e1df0 RDI:= ffff8801f94e1df8 Dec 18 17:01:15 x4 kernel: RBP: 0000000000000002 R08: 0000000000000000 R09:= 0000000000000000 Dec 18 17:01:15 x4 kernel: R10: 0000000000000000 R11: ffff880216a766b8 R12:= ffff880216a76590 Dec 18 17:01:15 x4 kernel: R13: ffffffff818383e0 R14: 0000000000000001 R15:= ffff880215c83678 Dec 18 17:01:15 x4 kernel: FS: 00007fbcabc8c880(0000) GS:ffff88021fc80000(= 0000) knlGS:0000000000000000 Dec 18 17:01:15 x4 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 18 17:01:15 x4 kernel: CR2: 0000000000000024 CR3: 0000000211d07000 CR4:= 00000000000007e0 Dec 18 17:01:15 x4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:= 0000000000000000 Dec 18 17:01:15 x4 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:= 0000000000000400 Dec 18 17:01:15 x4 kernel: Process X (pid: 157, threadinfo ffff880211dde000= , task ffff880211dc0ba0) Dec 18 17:01:15 x4 kernel: Stack: Dec 18 17:01:15 x4 kernel: ffffffff8125d2e9 ffff8801f94e1c48 ffffffff8125e9= 09 ffff880216a769b8 Dec 18 17:01:15 x4 kernel: 01ff880200000001 ffff8801f94e1c84 00000000000000= 01 ffff880216a766b8 Dec 18 17:01:15 x4 kernel: 0000000000000000 ffff880215c83678 ffff8801f94e1c= 48 ffffffff8125f17c Dec 18 17:01:15 x4 kernel: Call Trace: Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_cleanup_memtype_us= e+0x19/0x90 Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_cleanup_refs_and_u= nlock+0x139/0x2d0 Dec 18 17:01:15 x4 kernel: [] ? ttm_mem_evict_first+0x1dc= /0x2a0 Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_man_get_node+0x62/= 0xb0 Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_mem_space+0x28e/0x= 340 Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_move_buffer+0xfc/0= x170 Dec 18 17:01:15 x4 kernel: [] ? kmem_cache_alloc+0xb2/0xc0 Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_validate+0x95/0x110 Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_init+0x2ec/0x3b0 Dec 18 17:01:15 x4 kernel: [] ? radeon_bo_create+0x18a/0x= 200 Dec 18 17:01:15 x4 kernel: [] ? radeon_bo_clear_va+0x40/0= x40 Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_object_create+= 0x92/0x160 Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_create_ioctl+0= x6c/0x150 Dec 18 17:01:15 x4 kernel: [] ? drm_ioctl+0x420/0x4f0 Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_pwrite_ioctl+0= x20/0x20 Dec 18 17:01:15 x4 kernel: [] ? fsnotify_clear_marks_by_i= node+0x20/0xd0 Dec 18 17:01:15 x4 kernel: [] ? __destroy_inode+0x15/0x60 Dec 18 17:01:15 x4 kernel: [] ? kmem_cache_free+0x10/0x90 Dec 18 17:01:15 x4 kernel: [] ? dput+0x2f/0x300 Dec 18 17:01:15 x4 kernel: [] ? do_vfs_ioctl+0x2e4/0x4e0 Dec 18 17:01:15 x4 kernel: [] ? mntput_no_expire+0x7b/0x1= 70 Dec 18 17:01:15 x4 kernel: [] ? lg_global_unlock+0x3b/0x50 Dec 18 17:01:15 x4 kernel: [] ? task_work_run+0x8c/0xc0 Dec 18 17:01:15 x4 kernel: [] ? sys_ioctl+0x4c/0xa0 Dec 18 17:01:15 x4 kernel: [] ? system_call_fastpath+0x16= /0x1b Dec 18 17:01:15 x4 kernel: Code: 8b 44 24 04 48 83 c4 08 5b 5d 41 5c c3 66 = 0f 1f 44 00 00 48 8b 86 f0 01 00 00 48 81 c6 f0 01 00 00 48 39 f0 74 11 0f = 1f 44 00 00 40 24 00 48 8b 00 48 39 f0 75 f4 f3 c3 66 2e 0f 1f 84 00 00 Dec 18 17:01:15 x4 kernel: RIP [] radeon_vm_bo_invalidat= e+0x18/0x30 Dec 18 17:01:15 x4 kernel: RSP Dec 18 17:01:15 x4 kernel: CR2: 0000000000000024 Dec 18 17:01:15 x4 kernel: ---[ end trace 485a2dd5755db51f ]--- Dec 18 17:01:15 x4 kernel: [drm:drm_release] *ERROR* Device busy: 1 -- = Markus