From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756821Ab3BRCwG (ORCPT ); Sun, 17 Feb 2013 21:52:06 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:49155 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754439Ab3BRCwE (ORCPT ); Sun, 17 Feb 2013 21:52:04 -0500 Message-ID: <51219742.1000301@oracle.com> Date: Sun, 17 Feb 2013 21:51:46 -0500 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130113 Thunderbird/17.0.2 MIME-Version: 1.0 To: ebiederm@xmission.com CC: Andrew Morton , serge.hallyn@canonical.com, Dave Jones , "linux-kernel@vger.kernel.org" , Oleg Nesterov Subject: Re: BUG in find_pid_ns References: <512117D5.3050602@oracle.com> <87ppzyqxu6.fsf@xmission.com> In-Reply-To: <87ppzyqxu6.fsf@xmission.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Source-IP: ucsinet22.oracle.com [156.151.31.94] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/17/2013 07:17 PM, ebiederm@xmission.com wrote: > The bad pointer value is 0xfffffffffffffff0. Hmm. > > If you have the failure location correct it looks like a corrupted hash > entry was found while following the hash chain. > > It looks like the memory has been set to -16 -EBUSY? Weird. > > It smells like something is stomping on the memory of a struct pid, with > the same hash value and thus in the same hash chain as the current pid. > > Can you reproduce this? I've just reproduced it again: [ 2404.518957] BUG: unable to handle kernel paging request at fffffffffffffff0 [ 2404.520024] IP: [] find_pid_ns+0x110/0x1f0 [ 2404.520024] PGD 5429067 PUD 542b067 PMD 0 [ 2404.520024] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 2404.520024] Dumping ftrace buffer: [ 2404.520024] (ftrace buffer empty) [ 2404.520024] Modules linked in: [ 2404.520024] CPU 3 [ 2404.520024] Pid: 6890, comm: trinity Tainted: G W 3.8.0-rc7-next-20130215-sasha-00027-gb399f44-dirty #288 [ 2404.520024] RIP: 0010:[] [] find_pid_ns+0x110/0x1f0 [ 2404.520024] RSP: 0018:ffff8800af1dfe18 EFLAGS: 00010286 [ 2404.520024] RAX: 0000000000000001 RBX: 0000000000004b72 RCX: 0000000000000000 [ 2404.520024] RDX: 0000000000000001 RSI: ffffffff85466e40 RDI: 0000000000000286 [ 2404.520024] RBP: ffff8800af1dfe48 R08: 0000000000000001 R09: 0000000000000001 [ 2404.520024] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff85466460 [ 2404.520024] R13: ffff8800bf8d3ef8 R14: fffffffffffffff0 R15: ffff8800a43d9a40 [ 2404.520024] FS: 00007f8300f79700(0000) GS:ffff8800bbc00000(0000) knlGS:0000000000000000 [ 2404.520024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2404.520024] CR2: fffffffffffffff0 CR3: 00000000af0b7000 CR4: 00000000000406e0 [ 2404.520024] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2404.520024] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 2404.520024] Process trinity (pid: 6890, threadinfo ffff8800af1de000, task ffff8800b060b000) [ 2404.520024] Stack: [ 2404.520024] ffffffff85466e40 0000000000004b72 ffff8800af1dfed8 0000000000000000 [ 2404.520024] 0000000000000003 20c49ba5e353f7cf ffff8800af1dfe58 ffffffff81131e5c [ 2404.520024] ffff8800af1dfec8 ffffffff8112400f ffffffff81123f9c 0000000000000000 [ 2404.520024] Call Trace: [ 2404.520024] [] find_vpid+0x2c/0x30 [ 2404.520024] [] kill_something_info+0x9f/0x270 [ 2404.673395] [] ? kill_something_info+0x2c/0x270 [ 2404.673395] [] sys_kill+0x88/0xa0 [ 2404.673395] [] ? syscall_trace_enter+0x24/0x2e0 [ 2404.694324] [] ? trace_hardirqs_on_caller+0x128/0x160 [ 2404.694324] [] ? tracesys+0x7e/0xe6 [ 2404.694324] [] tracesys+0xe1/0xe6 [ 2404.694324] Code: 4d 8b 75 00 e8 b2 0e 00 00 85 c0 0f 84 d2 00 00 00 80 3d fa 17 d5 04 00 0f 85 c5 00 00 00 e9 93 00 00 00 0f 1f 84 00 00 00 00 00 <41> 39 1e 75 2b 4d 39 66 08 75 25 41 8b 84 24 20 08 00 00 48 c1 [ 2404.733487] RIP [] find_pid_ns+0x110/0x1f0 [ 2404.740299] RSP [ 2404.740299] CR2: fffffffffffffff0 [ 2404.740299] ---[ end trace 9f8bc22bbe4fe990 ]--- I'm not sure what debug info I could throw in which will be helpful. Dump the entire chain or table if 'pnr' happens to look odd? > Memory corruption is hard to trace down with just a single data point. > > Looking a little closer Sasha you have rewritten > hlist_for_each_entry_rcu, and that seems to be the most recent patch > dealing with pids, and we are failing in hlist_for_each_entry_rcu. > > I haven't looked at your patch in enough detail to know if you have > missed something or not, but a brand new patch and a brand new failure > certainly look suspicious at first glance. Agreed, I've also took a second look at it when this BUG popped up. What surprises me about it is that if the new iteration is broken, the kernel would spectacularly break in a bunch of places instead of failing in the exact same place twice. Not ignoring the possibility it's broken though. Thanks, Sasha