From mboxrd@z Thu Jan 1 00:00:00 1970 From: Maarten Lankhorst Subject: Re: GPU lockup CP stall for more than 10000msec on latest vanilla git Date: Wed, 19 Dec 2012 15:31:50 +0100 Message-ID: <50D1CFD6.8050005@canonical.com> References: <20121217214819.GA228@x4> <20121217222519.GA229@x4> <20121217225534.GA219@x4> <1355829632.17142.59.camel@thor.local> <20121218133831.GA218@x4> <50D08ACB.4090605@canonical.com> <20121218161238.GA213@x4> <50D1C7E4.1060701@canonical.com> <20121219142019.GA24579@x4> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from youngberry.canonical.com (youngberry.canonical.com [91.189.89.112]) by gabe.freedesktop.org (Postfix) with ESMTP id 263E8E5D2F for ; Wed, 19 Dec 2012 06:31:52 -0800 (PST) In-Reply-To: <20121219142019.GA24579@x4> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org Errors-To: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org To: Markus Trippelsdorf Cc: =?ISO-8859-1?Q?Michel_D=E4nzer?= , dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org Op 19-12-12 15:20, Markus Trippelsdorf schreef: > On 2012.12.19 at 14:57 +0100, Maarten Lankhorst wrote: >> Op 18-12-12 17:12, Markus Trippelsdorf schreef: >>> With your supposed debugging BUG_ONs added I still get: >>> >>> Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------ >>> Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() >>> Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name >>> Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174 >>> Dec 18 17:01:15 x4 kernel: Call Trace: >>> Dec 18 17:01:15 x4 kernel: [] ? warn_slowpath_common+0x74/0xb0 >>> Dec 18 17:01:15 x4 kernel: [] ? radeon_fence_ref+0x2c/0x40 >>> Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0 >>> Dec 18 17:01:15 x4 kernel: [] ? ttm_mem_evict_first+0x1dc/0x2a0 >>> Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_man_get_node+0x62/0xb0 >>> Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_mem_space+0x28e/0x340 >>> Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_move_buffer+0xfc/0x170 >>> Dec 18 17:01:15 x4 kernel: [] ? kmem_cache_alloc+0xb2/0xc0 >>> Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_validate+0x95/0x110 >>> Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_init+0x2ec/0x3b0 >>> Dec 18 17:01:15 x4 kernel: [] ? radeon_bo_create+0x18a/0x200 >>> Dec 18 17:01:15 x4 kernel: [] ? radeon_bo_clear_va+0x40/0x40 >>> Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_object_create+0x92/0x160 >>> Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_create_ioctl+0x6c/0x150 >>> Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_object_free+0x2f/0x40 >>> Dec 18 17:01:15 x4 kernel: [] ? drm_ioctl+0x420/0x4f0 >>> Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_pwrite_ioctl+0x20/0x20 >>> Dec 18 17:01:15 x4 kernel: [] ? do_vfs_ioctl+0x2e4/0x4e0 >>> Dec 18 17:01:15 x4 kernel: [] ? vfs_read+0x118/0x160 >>> Dec 18 17:01:15 x4 kernel: [] ? sys_ioctl+0x4c/0xa0 >>> Dec 18 17:01:15 x4 kernel: [] ? sys_read+0x51/0xa0 >>> Dec 18 17:01:15 x4 kernel: [] ? system_call_fastpath+0x16/0x1b >> so the kref to fence is null here. This should be impossible and >> indicates a bug in refcounting somewhere, or possibly memory >> corruption. >> >> Lets first look where things could go wrong.. >> >> sync_obj member requires fence_lock to be taken, but radeon code in >> general doesn't do that, hm.. >> >> I think radeon_cs_sync_rings needs to take fence_lock during the >> iteration, then taking on a refcount to the fence, and >> radeon_crtc_page_flip and radeon_move_blit are lacking refcount on >> fence_lock as well. >> >> But that would probably still not explain why it crashes in >> radeon_vm_bo_invalidate shortly after, so it seems just as likely that >> it's operating on freed memory there or something. >> >> But none of the code touches refcounting for that bo, and I really >> don't see how I messed up anything there. >> >> I seem to be able to reproduce it if I add a hack though, can you test >> if you get the exact same issues if you apply this patch? > Your patch doesn't apply unfortunately: > > markus@x4 linux % patch -p1 --dry-run < ~/maarten.patch > checking file drivers/gpu/drm/ttm/ttm_bo.c > Hunk #1 succeeded at 512 with fuzz 1. > Hunk #6 FAILED at 814. > 1 out of 6 hunks FAILED > markus@x4 linux % git describe > v3.7-10833-g752451f > markus@x4 linux % It applies on top of the regressed commit. It should probably not be too hard to make it apply manually on whatever you're using. But the real fix will be "drm/ttm: fix delayed ttm_bo_cleanup_refs_and_unlock delayed handling", which I cc'd you on. The patch I posted earlier in this thread will just aggressively stress test the codepath. ~Maarten