public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50
@ 2008-09-26 15:20 Alexey Dobriyan
  2008-09-26 15:47 ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Alexey Dobriyan @ 2008-09-26 15:20 UTC (permalink / raw)
  To: viro, ebiederm; +Cc: akpm, torvalds, linux-kernel

Gentlemen, this happened while script was slowly rebuilding 300+ configs
sequentially. Very little recompling activity itself, much seeking.

This is first time I see this. No debugging was on, no preemption.

Version: 2.6.27-rc7-c0f4d6d4b14a75a341d972ff73fb9740e1ceb634 +
	atl1 fixlet + "notes" kobject fixlet, but they don't matter.


ffffffff802bc690 <proc_sys_compare>:
ffffffff802bc690:	48 83 ec 08          	sub    $0x8,%rsp
ffffffff802bc694:	8b 46 04             	mov    0x4(%rsi),%eax
ffffffff802bc697:	3b 42 04             	cmp    0x4(%rdx),%eax
ffffffff802bc69a:	49 89 f0             	mov    %rsi,%r8
ffffffff802bc69d:	74 11                	je     ffffffff802bc6b0 <proc_sys_compare+0x20>
ffffffff802bc69f:	b8 01 00 00 00       	mov    $0x1,%eax
ffffffff802bc6a4:	48 83 c4 08          	add    $0x8,%rsp
ffffffff802bc6a8:	c3                   	retq   
ffffffff802bc6a9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
ffffffff802bc6b0:	48 8b 76 08          	mov    0x8(%rsi),%rsi
ffffffff802bc6b4:	48 8b 7a 08          	mov    0x8(%rdx),%rdi
ffffffff802bc6b8:	89 c1                	mov    %eax,%ecx
ffffffff802bc6ba:	fc                   	cld    
ffffffff802bc6bb:	48 39 c9             	cmp    %rcx,%rcx
ffffffff802bc6be:	f3 a6                	repz cmpsb %es:(%rdi),%ds:(%rsi)
ffffffff802bc6c0:	75 dd                	jne    ffffffff802bc69f <proc_sys_compare+0xf>
ffffffff802bc6c2:	49 8b 40 e0          	mov    -0x20(%r8),%rax
ffffffff802bc6c6: ===>	48 8b 78 f0          	mov    -0x10(%rax),%rdi		<===
ffffffff802bc6ca:	e8 71 96 f7 ff       	callq  ffffffff80235d40 <sysctl_is_seen>
ffffffff802bc6cf:	85 c0                	test   %eax,%eax
ffffffff802bc6d1:	0f 94 c0             	sete   %al
ffffffff802bc6d4:	48 83 c4 08          	add    $0x8,%rsp
ffffffff802bc6d8:	0f b6 c0             	movzbl %al,%eax
ffffffff802bc6db:	c3                   	retq   
ffffffff802bc6dc:	0f 1f 40 00          	nopl   0x0(%rax)


[16526.029537] BUG: unable to handle kernel paging request at fffffffffffffff0
[16526.029643] IP: [<ffffffff802bc6c6>] proc_sys_compare+0x36/0x50
[16526.029709] PGD 203067 PUD 204067 PMD 0 
[16526.029769] Oops: 0000 [1] SMP 
[16526.029824] CPU 1 
[16526.029872] Modules linked in: af_packet iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables ext2 nls_utf8 ntfs fuse sr_mod cdrom
[16526.030079] Pid: 20385, comm: sh Not tainted 2.6.27-rc7 #1
[16526.030140] RIP: 0010:[<ffffffff802bc6c6>]  [<ffffffff802bc6c6>] proc_sys_compare+0x36/0x50
[16526.030246] RSP: 0018:ffff880108929c58  EFLAGS: 00010246
[16526.030306] RAX: 0000000000000000 RBX: ffff88005246e000 RCX: 0000000000000000
[16526.030374] RDX: ffff880108929d28 RSI: ffff88005246e0af RDI: ffff88017b4fb01c
[16526.030442] RBP: ffff88005246e018 R08: ffff88005246e030 R09: ffff88017b4fb011
[16526.030509] R10: 0000000000000000 R11: 0000000000000005 R12: 0000000032fdb346
[16526.030577] R13: ffff88005246e0c8 R14: ffff880108929d28 R15: 000000000000000b
[16526.030645] FS:  00002b014ad9af40(0000) GS:ffff88017f809680(0000) knlGS:0000000000000000
[16526.030748] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[16526.030811] CR2: fffffffffffffff0 CR3: 000000006a2d6000 CR4: 00000000000006e0
[16526.030879] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16526.030946] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[16526.031014] Process sh (pid: 20385, threadinfo ffff880108928000, task ffff88017f8bab80)
[16526.031118] Stack:  0000000000000000 ffffffff80293242 0000000080555ec0 0000000b7f429320
[16526.031233]  ffff88017b4fb011 ffff880108929d28 ffff88017b4fb01c ffff880108929e68
[16526.031345]  ffff880108929d28 ffff880108929d38 ffff88017b4fb000 ffffffff8028900c
[16526.031419] Call Trace:
[16526.031506]  [<ffffffff80293242>] ? __d_lookup+0xe2/0x120
[16526.031568]  [<ffffffff8028900c>] ? do_lookup+0x3c/0x220
[16526.031629]  [<ffffffff8028a895>] ? __link_path_walk+0x815/0xd90
[16526.031693]  [<ffffffff80266ce7>] ? __do_fault+0x1b7/0x400
[16526.031755]  [<ffffffff8028ae64>] ? path_walk+0x54/0xb0
[16526.033334]  [<ffffffff8028af8a>] ? do_path_lookup+0x7a/0x150
[16526.033397]  [<ffffffff8028bdca>] ? __path_lookup_intent_open+0x6a/0xd0
[16526.033464]  [<ffffffff8028c082>] ? do_filp_open+0xc2/0x820
[16526.033526]  [<ffffffff80296bbf>] ? alloc_fd+0x3f/0x130
[16526.033587]  [<ffffffff8027ed88>] ? do_sys_open+0x58/0xb0
[16526.033648]  [<ffffffff8020b65b>] ? system_call_fastpath+0x16/0x1b
[16526.033711] 
[16526.033754] 
[16526.033797] Code: 89 f0 74 11 b8 01 00 00 00 48 83 c4 08 c3 0f 1f 80 00 00 00 00 48 8b 76 08 48 8b 7a 08 89 c1 fc 48 39 c9 f3 a6 75 dd 49 8b 40 e0 <48> 8b 78 f0 e8 71 96 f7 ff 85 c0 0f 94 c0 48 83 c4 08 0f b6 c0 
[16526.034186] RIP  [<ffffffff802bc6c6>] proc_sys_compare+0x36/0x50
[16526.034251]  RSP <ffff880108929c58>
[16526.034301] CR2: fffffffffffffff0
[16526.034622] ---[ end trace ca38d5cc981185f5 ]---

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50
  2008-09-26 15:20 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50 Alexey Dobriyan
@ 2008-09-26 15:47 ` Linus Torvalds
  2008-09-27  8:44   ` Eric W. Biederman
  2008-09-28 14:18   ` Al Viro
  0 siblings, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2008-09-26 15:47 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: viro, ebiederm, akpm, linux-kernel



On Fri, 26 Sep 2008, Alexey Dobriyan wrote:
>
> Gentlemen, this happened while script was slowly rebuilding 300+ configs
> sequentially. Very little recompling activity itself, much seeking.
> 
> This is first time I see this. No debugging was on, no preemption.
> 
> Version: 2.6.27-rc7-c0f4d6d4b14a75a341d972ff73fb9740e1ceb634 +
> 	atl1 fixlet + "notes" kobject fixlet, but they don't matter.
> 
> ffffffff802bc690 <proc_sys_compare>:
....
> ffffffff802bc6c0:	75 dd                	jne    ffffffff802bc69f <proc_sys_compare+0xf>
> ffffffff802bc6c2:	49 8b 40 e0          	mov    -0x20(%r8),%rax
> ffffffff802bc6c6: ===>	48 8b 78 f0          	mov    -0x10(%rax),%rdi		<===
> ffffffff802bc6ca:	e8 71 96 f7 ff       	callq  ffffffff80235d40 <sysctl_is_seen>

That would be the

	sysctl_is_seen(PROC_I(dentry->d_inode)->sysctl)

call, and it really looks like 'dentry->d_inode' is NULL:

> [16526.029537] BUG: unable to handle kernel paging request at fffffffffffffff0

The whole PROC_I() thing just offsets from the inode:

	container_of(inode, struct proc_inode, vfs_inode);

and 'sysctl' is indeed 16 bytes below the vfs inode on x86-64:

	struct proc_inode {
		...
	        struct ctl_table_header *sysctl;
	        struct ctl_table *sysctl_entry;
	        struct inode vfs_inode;
	};

and as far as I can tell, there is nothing to say that a /proc inode 
cannot be a negative dentry. Sure, we try to get rid of them, but during a 
parallel lookup, we will have added the dentry with a NULL inode in the 
other lookup.

So assuming that you have an inode at that point seems to be utter crap.

Now, the whole _function_ is utter crap and should probably be dropped, 
but whatever. That's just another sysctl insanity. In the meantime, 
something like this does look appropriate, no?

Al, did I miss something?

		Linus
---
 fs/proc/proc_sysctl.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index f9a8b89..9435fd0 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -386,6 +386,8 @@ static int proc_sys_compare(struct dentry *dir, struct qstr *qstr,
 		return 1;
 	if (memcmp(qstr->name, name->name, name->len))
 		return 1;
+	if (!dentry->d_inode)
+		return 1;
 	return !sysctl_is_seen(PROC_I(dentry->d_inode)->sysctl);
 }
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50
  2008-09-26 15:47 ` Linus Torvalds
@ 2008-09-27  8:44   ` Eric W. Biederman
  2008-09-28 20:38     ` Linus Torvalds
  2008-09-28 14:18   ` Al Viro
  1 sibling, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2008-09-27  8:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alexey Dobriyan, viro, akpm, linux-kernel

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, 26 Sep 2008, Alexey Dobriyan wrote:
>>
>> Gentlemen, this happened while script was slowly rebuilding 300+ configs
>> sequentially. Very little recompling activity itself, much seeking.
>> 
>> This is first time I see this. No debugging was on, no preemption.

Consistent with the fact that is new code.

>> Version: 2.6.27-rc7-c0f4d6d4b14a75a341d972ff73fb9740e1ceb634 +
>> 	atl1 fixlet + "notes" kobject fixlet, but they don't matter.
>> 
>> ffffffff802bc690 <proc_sys_compare>:
> ....
>> ffffffff802bc6c0: 75 dd jne ffffffff802bc69f <proc_sys_compare+0xf>
>> ffffffff802bc6c2:	49 8b 40 e0          	mov    -0x20(%r8),%rax
>> ffffffff802bc6c6: ===> 48 8b 78 f0 mov -0x10(%rax),%rdi <===
>> ffffffff802bc6ca: e8 71 96 f7 ff callq ffffffff80235d40 <sysctl_is_seen>
>
> That would be the
>
> 	sysctl_is_seen(PROC_I(dentry->d_inode)->sysctl)
>
> call, and it really looks like 'dentry->d_inode' is NULL:

Agreed.

>> [16526.029537] BUG: unable to handle kernel paging request at fffffffffffffff0
>
> The whole PROC_I() thing just offsets from the inode:
>
> 	container_of(inode, struct proc_inode, vfs_inode);
>
> and 'sysctl' is indeed 16 bytes below the vfs inode on x86-64:
>
> 	struct proc_inode {
> 		...
> 	        struct ctl_table_header *sysctl;
> 	        struct ctl_table *sysctl_entry;
> 	        struct inode vfs_inode;
> 	};
>
> and as far as I can tell, there is nothing to say that a /proc inode 
> cannot be a negative dentry. Sure, we try to get rid of them, but during a 
> parallel lookup, we will have added the dentry with a NULL inode in the 
> other lookup.

Where is that? 

This is an issue because if we can get negative dentries the other
dentry operations are wrong as well.n

We hold the directory mutex in real_lookup
before calling i_op->lookup.  So lookups should be serialized.
proc_sys_lookup always either succeeds and calls d_add with
a valid inode or fails and returns an error code in which case
real_lookup puts the dentry before it is hashed.

We hold the directory mutex in readdir before calling
file->f_op->readdir.  Deep inside of proc_sys_readdir
proc_sys_fill_cache only adds the dentry if it has an inode.

do_revalidate doesn't look like it could get us there.
Nor does any of the paths that call ->d_delete.

It looks like don't free the inode from an non-negative
dentry until after it is unhashed.

So I'm totally stumped as to how we can get there.

> So assuming that you have an inode at that point seems to be utter crap.

Clearly we don't have an inode there but I don't see we can end up with
a negative inode.

> Now, the whole _function_ is utter crap and should probably be dropped, 
> but whatever. That's just another sysctl insanity. In the meantime, 
> something like this does look appropriate, no?

If we can't avoid negative dentries that looks appropriate.

But won't .d_revalidate and .d_delete be susceptible to the same
problem if we do have negative dentries?  

Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50
  2008-09-26 15:47 ` Linus Torvalds
  2008-09-27  8:44   ` Eric W. Biederman
@ 2008-09-28 14:18   ` Al Viro
  2008-09-28 19:28     ` Hugh Dickins
  2008-09-28 20:46     ` Linus Torvalds
  1 sibling, 2 replies; 12+ messages in thread
From: Al Viro @ 2008-09-28 14:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alexey Dobriyan, ebiederm, akpm, linux-kernel

On Fri, Sep 26, 2008 at 08:47:51AM -0700, Linus Torvalds wrote:

> and as far as I can tell, there is nothing to say that a /proc inode 
> cannot be a negative dentry. Sure, we try to get rid of them, but during a 
> parallel lookup, we will have added the dentry with a NULL inode in the 
> other lookup.
> 
> So assuming that you have an inode at that point seems to be utter crap.
> 
> Now, the whole _function_ is utter crap and should probably be dropped, 
> but whatever. That's just another sysctl insanity. In the meantime, 
> something like this does look appropriate, no?
> 
> Al, did I miss something?

The real underlying bug, whatever it is.  If this sucker ever becomes
negative, we have a big problem.  Where _could_ that happen?  Remember,
we do not allow ->rmdir() and ->unlink() to succeed there.  So d_delete()
callers in namei.c are out of question.  We also never do d_add() with
NULL inode in there.  We _might_ be doing a bogus d_rehash() on a negative
/prooc/sys/<something> dentry that had never been hashed to start with
somewhere in generic code, but... I don't see where that could happen.
vfs_rename_dir() with negative new_dentry would have to get it from
something and that would have to be ->lookup().  And that sucker returns
ERR_PTR() or a positive dentry in all cases here.  d_splice_alias() is not
used there at all; d_move_locked() would scream bloody murder if dentry
it's rehashing is negative.  d_materialize_unique() and d_add_unique()
are not used.  So just WTF is creating this sucker?

IOW, your patch will probably be enough to stop the visible problem, but
I would dearly like to understand what's really causing it.  It appears to
be a refcounting breakage somewhere and we have *another* bug report that
smells like that - it seems like we sometimes end up with negative dentry
on alias list of an inode (outside of /proc/sys, AFAICT).  Something really
fishy is going on...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50
  2008-09-28 14:18   ` Al Viro
@ 2008-09-28 19:28     ` Hugh Dickins
  2008-09-28 20:55       ` Linus Torvalds
  2008-09-28 20:46     ` Linus Torvalds
  1 sibling, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2008-09-28 19:28 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Alexey Dobriyan, ebiederm, akpm, linux-kernel

On Sun, 28 Sep 2008, Al Viro wrote:
> On Fri, Sep 26, 2008 at 08:47:51AM -0700, Linus Torvalds wrote:
> 
> > and as far as I can tell, there is nothing to say that a /proc inode 
> > cannot be a negative dentry. Sure, we try to get rid of them, but during a 
> > parallel lookup, we will have added the dentry with a NULL inode in the 
> > other lookup.
> > 
> > So assuming that you have an inode at that point seems to be utter crap.
> > 
> > Now, the whole _function_ is utter crap and should probably be dropped, 
> > but whatever. That's just another sysctl insanity. In the meantime, 
> > something like this does look appropriate, no?
> > 
> > Al, did I miss something?
> 
> The real underlying bug, whatever it is.  If this sucker ever becomes
> negative, we have a big problem.  Where _could_ that happen?  Remember,
> we do not allow ->rmdir() and ->unlink() to succeed there.  So d_delete()
> callers in namei.c are out of question.  We also never do d_add() with
> NULL inode in there.  We _might_ be doing a bogus d_rehash() on a negative
> /prooc/sys/<something> dentry that had never been hashed to start with
> somewhere in generic code, but... I don't see where that could happen.
> vfs_rename_dir() with negative new_dentry would have to get it from
> something and that would have to be ->lookup().  And that sucker returns
> ERR_PTR() or a positive dentry in all cases here.  d_splice_alias() is not
> used there at all; d_move_locked() would scream bloody murder if dentry
> it's rehashing is negative.  d_materialize_unique() and d_add_unique()
> are not used.  So just WTF is creating this sucker?
> 
> IOW, your patch will probably be enough to stop the visible problem, but
> I would dearly like to understand what's really causing it.  It appears to
> be a refcounting breakage somewhere and we have *another* bug report that
> smells like that - it seems like we sometimes end up with negative dentry
> on alias list of an inode (outside of /proc/sys, AFAICT).  Something really
> fishy is going on...

I got a couple of earlier instances of this on powerpc
http://lkml.org/lkml/2008/8/14/289
but saw nothing more of it, so asked Al to forget about it.

But today I've got it again, this time on x86_64, with kdb in
(but not serial console), similar kernel builds with swapping
loads as before.  Though with Andrew's latest mmotm, so some
details different from 2.6.27-rc, and could be an mmotm bug.

The dentry in question (it's for /proc/sys/kernel/ngroups_max)
looks as if the __d_drop and d_kill of prune_one_dentry() came
in on one cpu just after __d_lookup() had found the entry on
parent's hashlist, just before it acquired dentry->d_lock.

That's plausible, isn't it, and would account for the rarity,
and would say Linus's patch is good?

Do ask me for any details you'd like out of the dentry.

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50
  2008-09-27  8:44   ` Eric W. Biederman
@ 2008-09-28 20:38     ` Linus Torvalds
  0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2008-09-28 20:38 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Alexey Dobriyan, viro, akpm, linux-kernel



On Sat, 27 Sep 2008, Eric W. Biederman wrote:
> 
> We hold the directory mutex in real_lookup
> before calling i_op->lookup.

Irrelevant.

The d_compare function si called _before_ we get the directory mutex. It's 
done purely under dentry->dlock (and the RCU read lock).

So we have absolutely no directory-level locking here.

> So lookups should be serialized.

lookups are serialized before calling into the filesystem with ->lookup, 
but _not_ at the d_compare level. If we serialized d_compare, the dentry 
cache would be no use at all, we'd serialize all lookups, cached or not.

(Of course, sane filesystems will not have d_compare at all, just the 
memcmp, but we're talking /proc here - although I forget why it wants to 
do that insane d_compare thing)

So forget the directory mutex. d_compare is much more low-level than that. 
It can hit a dentry that is being unhashed at the same time for whatever 
reason (memory pressure, whatever).

The fact is, the "->lookup()" function is meant to be easy for filesystems 
to use, but d_compare i really low-level dentry magic and is meant to look 
at just the *name*. It's meant to be a replacement for memcpy() for things 
like case-independent comparisons or strange utf-8 rules (ie "equivalent" 
characters). It has no real locking, since the names are supposed to be 
"stable" anyway (the dentry->d_lock should guarantee that the dentry isn't 
being renamed etc).

I may be missing something, of course, but the dentry eas actually found 
_before_ takign even the dentry->d_lock, so the dentry we call compare on 
may not even be *valid* any more, because something might have unhashed it 
_after_ we found it, but _before_ we got d_lock.

d_compare() really is pretty special (d_revalidate is too, for that 
matter, although at least _slightly_ less so: by that time we have at 
least tested that the UNHASHED bit isn't set because we raced with 
removal etc).

		Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50
  2008-09-28 14:18   ` Al Viro
  2008-09-28 19:28     ` Hugh Dickins
@ 2008-09-28 20:46     ` Linus Torvalds
  2008-09-28 20:50       ` Linus Torvalds
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2008-09-28 20:46 UTC (permalink / raw)
  To: Al Viro; +Cc: Alexey Dobriyan, ebiederm, akpm, linux-kernel



On Sun, 28 Sep 2008, Al Viro wrote:
>
> > Al, did I miss something?
> 
> The real underlying bug, whatever it is.  If this sucker ever becomes
> negative, we have a big problem.  Where _could_ that happen?  Remember,
> we do not allow ->rmdir() and ->unlink() to succeed there.

What about pure memory pressure? We're holding only the RCU read-side lock 
when looking up dentries, and if there is any memory pressure, the 
dentries may be unhashed and the inodes removed in parallel. Yes, yes, we 
end up not actually _releasing_ the dentry, since it's all RCU, but it 
will set D_UNHASHED and be scheduled for releasing later under RCU.

And d_compare() is called before we have done any validation that the name 
is still active, including checking whether it even got released already! 

I dunno. Do we want to move the D_UNHASHED check up earlier? Or am I still 
missing something?

		Linus

----
 fs/dcache.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 80e9395..e7a1a99 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1395,6 +1395,10 @@ struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 		if (dentry->d_parent != parent)
 			goto next;
 
+		/* non-existing due to RCU? */
+		if (d_unhashed(dentry))
+			goto next;
+
 		/*
 		 * It is safe to compare names since d_move() cannot
 		 * change the qstr (protected by d_lock).
@@ -1410,10 +1414,8 @@ struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 				goto next;
 		}
 
-		if (!d_unhashed(dentry)) {
-			atomic_inc(&dentry->d_count);
-			found = dentry;
-		}
+		atomic_inc(&dentry->d_count);
+		found = dentry;
 		spin_unlock(&dentry->d_lock);
 		break;
 next:

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50
  2008-09-28 20:46     ` Linus Torvalds
@ 2008-09-28 20:50       ` Linus Torvalds
  0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2008-09-28 20:50 UTC (permalink / raw)
  To: Al Viro; +Cc: Alexey Dobriyan, ebiederm, akpm, linux-kernel



On Sun, 28 Sep 2008, Linus Torvalds wrote:
> 
> I dunno. Do we want to move the D_UNHASHED check up earlier? Or am I still 
> missing something?

Oh, it's not D_UNHASHED, it's DCACHE_UNHASHED. Whatever. The patch still 
looks fine, I had just forgotten the naming of the d_flags fields.

		Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50
  2008-09-28 19:28     ` Hugh Dickins
@ 2008-09-28 20:55       ` Linus Torvalds
  2008-09-28 20:59         ` Linus Torvalds
  2008-09-29  3:05         ` Eric W. Biederman
  0 siblings, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2008-09-28 20:55 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Al Viro, Alexey Dobriyan, ebiederm, akpm, linux-kernel



On Sun, 28 Sep 2008, Hugh Dickins wrote:
> 
> I got a couple of earlier instances of this on powerpc
> http://lkml.org/lkml/2008/8/14/289
> but saw nothing more of it, so asked Al to forget about it.
> 
> But today I've got it again, this time on x86_64, with kdb in
> (but not serial console), similar kernel builds with swapping
> loads as before.  Though with Andrew's latest mmotm, so some
> details different from 2.6.27-rc, and could be an mmotm bug.

Ok, you were definitely under memory pressure, and yes, it looks like the 
exact same bug on ppc64 - access to a pointer that is two poointers offset 
down from NULL.

> The dentry in question (it's for /proc/sys/kernel/ngroups_max)
> looks as if the __d_drop and d_kill of prune_one_dentry() came
> in on one cpu just after __d_lookup() had found the entry on
> parent's hashlist, just before it acquired dentry->d_lock.

Yes. 

> That's plausible, isn't it, and would account for the rarity,
> and would say Linus's patch is good?
> 
> Do ask me for any details you'd like out of the dentry.

I actually like my second patch better - it looks simpler, and it means 
that the rules for filesystems using d_compare() are a bit clearer: at 
least we'll only pass them dentries to look at that haven't gone through 
d_drop (and we do hold dentry->d_lock that serializes all of that).

So here it is again (I sent it out just minutes ago, but you weren't on 
that cc, you must have picked this up off the kernel list)

NOTE! Totally untested patch! It looks sane and really obvious, but maybe 
it has some insane and non-obvious bug.

		Linus

---
 fs/dcache.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 80e9395..e7a1a99 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1395,6 +1395,10 @@ struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 		if (dentry->d_parent != parent)
 			goto next;
 
+		/* non-existing due to RCU? */
+		if (d_unhashed(dentry))
+			goto next;
+
 		/*
 		 * It is safe to compare names since d_move() cannot
 		 * change the qstr (protected by d_lock).
@@ -1410,10 +1414,8 @@ struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 				goto next;
 		}
 
-		if (!d_unhashed(dentry)) {
-			atomic_inc(&dentry->d_count);
-			found = dentry;
-		}
+		atomic_inc(&dentry->d_count);
+		found = dentry;
 		spin_unlock(&dentry->d_lock);
 		break;
 next:

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50
  2008-09-28 20:55       ` Linus Torvalds
@ 2008-09-28 20:59         ` Linus Torvalds
  2008-09-28 22:07           ` Hugh Dickins
  2008-09-29  3:05         ` Eric W. Biederman
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2008-09-28 20:59 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Al Viro, Alexey Dobriyan, ebiederm, akpm, linux-kernel



On Sun, 28 Sep 2008, Linus Torvalds wrote:
> 
> NOTE! Totally untested patch! It looks sane and really obvious, but maybe 
> it has some insane and non-obvious bug.

Oh. I think I see at least a _potential_ insane and non-obvious bug: if 
somebody actually is going to do a __d_drop() _inside_ their d_compare(), 
this would fail horribly because we now assume that the dentry is still 
fine, since we held d_lock.

Of course, I think that would be very very buggy of a filesystem to do (we 
don't even pass in the dentry as an argument - you have to figure it out 
from the qstr, and a filesystem really should not do that!), but /proc 
_does_ look up the dentry in question, maybe some other insane filesystem 
does too and then does the __d_drop.

I'm not seeing it, though. So I still think the patch is sane and good, 
but somebody really needs to double- or triple-check me on it.

		Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50
  2008-09-28 20:59         ` Linus Torvalds
@ 2008-09-28 22:07           ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2008-09-28 22:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Alexey Dobriyan, ebiederm, akpm, linux-kernel

On Sun, 28 Sep 2008, Linus Torvalds wrote:
> On Sun, 28 Sep 2008, Linus Torvalds wrote:
> > 
> > NOTE! Totally untested patch! It looks sane and really obvious, but maybe 
> > it has some insane and non-obvious bug.

Looks good to me, nicer than the first, and would have prevented my
oops today (if I'm interpreting it correctly: certainly I do have
DCACHE_UNHASHED set).

> 
> Oh. I think I see at least a _potential_ insane and non-obvious bug: if 
> somebody actually is going to do a __d_drop() _inside_ their d_compare(), 
> this would fail horribly because we now assume that the dentry is still 
> fine, since we held d_lock.
> 
> Of course, I think that would be very very buggy of a filesystem to do (we 
> don't even pass in the dentry as an argument - you have to figure it out 
> from the qstr, and a filesystem really should not do that!), but /proc 
> _does_ look up the dentry in question, maybe some other insane filesystem 
> does too and then does the __d_drop.

I agree that would be insane.  There's no end to the weird things
a filesystem _could_ do in its d_compare, but it is supposed to be
about comparison, and every filesystem I can see in the tree treats
it as such.

> 
> I'm not seeing it, though. So I still think the patch is sane and good, 
> but somebody really needs to double- or triple-check me on it.

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50
  2008-09-28 20:55       ` Linus Torvalds
  2008-09-28 20:59         ` Linus Torvalds
@ 2008-09-29  3:05         ` Eric W. Biederman
  1 sibling, 0 replies; 12+ messages in thread
From: Eric W. Biederman @ 2008-09-29  3:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Hugh Dickins, Al Viro, Alexey Dobriyan, akpm, linux-kernel

Linus Torvalds <torvalds@linux-foundation.org> writes:

> I actually like my second patch better - it looks simpler, and it means 
> that the rules for filesystems using d_compare() are a bit clearer: at 
> least we'll only pass them dentries to look at that haven't gone through 
> d_drop (and we do hold dentry->d_lock that serializes all of that).
>
> So here it is again (I sent it out just minutes ago, but you weren't on 
> that cc, you must have picked this up off the kernel list)
>
> NOTE! Totally untested patch! It looks sane and really obvious, but maybe 
> it has some insane and non-obvious bug.

We definitely have a race between d_kill setting dentry->d_inode = NULL
and proc_sys_compare reading d_inode. 

We don't generate negative dentries for /proc/sys.

In dput atomic_dec_and_lock takes the lock before setting the count to 0.
So there is no race there.

Testing for d_unhashed and getting us out of rcu limbo before calling
into the filesystem methods makes the reasoning a lot clearer.


Looks good to me.

Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-09-29  3:05 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-26 15:20 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50 Alexey Dobriyan
2008-09-26 15:47 ` Linus Torvalds
2008-09-27  8:44   ` Eric W. Biederman
2008-09-28 20:38     ` Linus Torvalds
2008-09-28 14:18   ` Al Viro
2008-09-28 19:28     ` Hugh Dickins
2008-09-28 20:55       ` Linus Torvalds
2008-09-28 20:59         ` Linus Torvalds
2008-09-28 22:07           ` Hugh Dickins
2008-09-29  3:05         ` Eric W. Biederman
2008-09-28 20:46     ` Linus Torvalds
2008-09-28 20:50       ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox