* fs: lockup on rename_mutex in fs/dcache.c:1035 @ 2014-10-26 1:39 Sasha Levin 2014-10-26 2:56 ` Al Viro 2014-10-26 3:01 ` Eric W. Biederman 0 siblings, 2 replies; 11+ messages in thread From: Sasha Levin @ 2014-10-26 1:39 UTC (permalink / raw) To: Al Viro; +Cc: Eric W. Biederman, linux-fsdevel, LKML, Dave Jones Hi all, While fuzzing with trinity inside a KVM tools guest running the latest -next kernel, I've stumbled on the following spew: [ 6172.870045] ============================================= [ 6172.870045] [ INFO: possible recursive locking detected ] [ 6172.870045] 3.18.0-rc1-next-20141023-sasha-00036-g4dcabd5 #1415 Not tainted [ 6172.870045] --------------------------------------------- [ 6172.870045] trinity-c55/12752 is trying to acquire lock: [ 6172.870045] (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) [ 6172.870045] [ 6172.870045] but task is already holding lock: [ 6172.870045] (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) [ 6172.870045] [ 6172.870045] other info that might help us debug this: [ 6172.900904] FAULT_INJECTION: forcing a failure [ 6172.870045] Possible unsafe locking scenario: [ 6172.870045] [ 6172.870045] CPU0 [ 6172.870045] ---- [ 6172.870045] lock(rename_lock); [ 6172.870045] lock(rename_lock); [ 6172.870045] [ 6172.870045] *** DEADLOCK *** [ 6172.870045] [ 6172.870045] May be due to missing lock nesting notation [ 6172.870045] [ 6172.870045] 1 lock held by trinity-c55/12752: [ 6172.870045] #0: (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) [ 6172.870045] [ 6172.870045] stack backtrace: [ 6172.870045] CPU: 1 PID: 12752 Comm: trinity-c55 Not tainted 3.18.0-rc1-next-20141023-sasha-00036-g4dcabd5 #1415 [ 6172.870045] ffff88070945b000 0000000000000000 ffffffffb6312fd0 ffff8806d69ff728 [ 6172.870045] ffffffffa8ff75ca 0000000000000011 ffffffffb6312fd0 ffff8806d69ff828 [ 6172.870045] ffffffff9f3a012b ffff880107fd76c0 ffff880107fd76c0 0000000000000001 [ 6172.870045] Call Trace: [ 6172.870045] dump_stack (lib/dump_stack.c:52) [ 6172.870045] validate_chain.isra.10 (kernel/locking/lockdep.c:2134) [ 6172.870045] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86) [ 6172.870045] ? sched_clock_cpu (kernel/sched/clock.c:311) [ 6172.870045] __lock_acquire (kernel/locking/lockdep.c:3184) [ 6172.870045] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3604) [ 6172.870045] ? d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) [ 6172.870045] ? put_lock_stats.isra.7 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254) [ 6172.870045] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151) [ 6172.870045] ? d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) [ 6172.870045] d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) [ 6172.870045] ? d_walk (fs/dcache.c:1087) [ 6172.870045] ? d_drop (fs/dcache.c:1336) [ 6172.870045] ? select_collect (fs/dcache.c:1323) [ 6172.870045] d_invalidate (fs/dcache.c:1381) [ 6172.870045] lookup_fast (fs/namei.c:1440) [ 6172.870045] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183) [ 6172.870045] ? lockref_put_or_lock (lib/lockref.c:135) [ 6172.870045] ? dput (fs/dcache.c:626) [ 6172.870045] walk_component (fs/namei.c:1545) [ 6172.870045] ? security_inode_permission (security/security.c:573) [ 6172.870045] ? __inode_permission (fs/namei.c:418) [ 6172.870045] link_path_walk (fs/namei.c:1805) [ 6172.870045] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) [ 6172.870045] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2559 kernel/locking/lockdep.c:2601) [ 6172.870045] ? getname_flags (fs/namei.c:145) [ 6172.870045] path_lookupat (fs/namei.c:1956) [ 6172.870045] ? kmem_cache_alloc (include/linux/rcupdate.h:479 include/trace/events/kmem.h:53 mm/slub.c:2463) [ 6172.870045] ? getname_flags (fs/namei.c:145) [ 6172.870045] filename_lookup (fs/namei.c:2000) [ 6172.870045] user_path_at_empty (fs/namei.c:2151) [ 6172.870045] ? preempt_count_sub (kernel/sched/core.c:2662) [ 6172.870045] user_path_at (fs/namei.c:2162) [ 6172.870045] vfs_fstatat (fs/stat.c:106) [ 6172.870045] SYSC_newlstat (fs/stat.c:284) [ 6172.870045] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1598) [ 6172.870045] ? tracesys_phase2 (arch/x86/kernel/entry_64.S:518) [ 6172.870045] SyS_newlstat (fs/stat.c:277) [ 6172.870045] tracesys_phase2 (arch/x86/kernel/entry_64.S:529) The machine proceeded to actually lock up, so this isn't a false positive. Thanks, Sasha ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs: lockup on rename_mutex in fs/dcache.c:1035 2014-10-26 1:39 fs: lockup on rename_mutex in fs/dcache.c:1035 Sasha Levin @ 2014-10-26 2:56 ` Al Viro 2014-10-26 3:01 ` Eric W. Biederman 1 sibling, 0 replies; 11+ messages in thread From: Al Viro @ 2014-10-26 2:56 UTC (permalink / raw) To: Sasha Levin; +Cc: Eric W. Biederman, linux-fsdevel, LKML, Dave Jones On Sat, Oct 25, 2014 at 09:39:23PM -0400, Sasha Levin wrote: > [ 6172.870045] trinity-c55/12752 is trying to acquire lock: > [ 6172.870045] (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) > [ 6172.870045] > [ 6172.870045] but task is already holding lock: > [ 6172.870045] (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) Umm... So we either have left d_walk() without dropping rename_lock, or we have called d_walk() from something called from d_walk()? And the trace would seem to point towards the former... Ouch. For that to happen, we would need to * get to rename_retry with retry being true * after that get D_WALK_NORETRY from enter() * somehow get to rename_retry *again* Moreover, we couldn't get there via if (need_seqretry(&rename_lock, seq)) { spin_unlock(&this_parent->d_lock); goto rename_retry; - seq is 1 by that point, and need_seqretry() returns 0 in that case. IOW, it must have been this: /* * might go back up the wrong parent if we have had a rename * or deletion */ if (this_parent != child->d_parent || (child->d_flags & DCACHE_DENTRY_KILLED) || need_seqretry(&rename_lock, seq)) { spin_unlock(&this_parent->d_lock); rcu_read_unlock(); goto rename_retry; } And we had been holding rename_lock in that walk, so d_move() should've been excluded... Which leaves us with (child->d_flags & DCACHE_DENTRY_KILLED) being true... Hrm. AFAICS, it *is* possible to hit that one - just have the last reference to child dropped between spin_unlock(&child->d_lock); and spin_lock(&this_parent->d_lock); a few lines above. And yes, if that happens we are in shit - rename_retry will see retry being false and return, without noticing that on this pass we had been holding rename_lock. Easily fixed, fortunately - delta below ought to take care of that... Comments? AFAICS, it's -stable fodder, the bug going all way back to at least 3.7... diff --git a/fs/dcache.c b/fs/dcache.c index 3ffef7f..65f4aff 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1114,12 +1114,13 @@ resume: out_unlock: spin_unlock(&this_parent->d_lock); +out: done_seqretry(&rename_lock, seq); return; rename_retry: if (!retry) - return; + goto out; seq = 1; goto again; } ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: fs: lockup on rename_mutex in fs/dcache.c:1035 2014-10-26 1:39 fs: lockup on rename_mutex in fs/dcache.c:1035 Sasha Levin 2014-10-26 2:56 ` Al Viro @ 2014-10-26 3:01 ` Eric W. Biederman 2014-10-26 3:06 ` Al Viro 1 sibling, 1 reply; 11+ messages in thread From: Eric W. Biederman @ 2014-10-26 3:01 UTC (permalink / raw) To: Sasha Levin; +Cc: Al Viro, linux-fsdevel, LKML, Dave Jones Sasha Levin <sasha.levin@oracle.com> writes: > Hi all, > > While fuzzing with trinity inside a KVM tools guest running the latest -next > kernel, I've stumbled on the following spew: Weird. I took a quick look and I don't see any changes in d_walk that in Al's tree or in Linus's kernel for years. Has read_seqbegin_or_lock changed somewhere? >From a quick reading of the code it simply isn't possible for d_walk to take the lock twice short of memory corruption. And the fact that the code has not changed in years seems to suggest it isn't the obvious cause of d_walk talking the rename_lock twice. Eric > [ 6172.870045] ============================================= > [ 6172.870045] [ INFO: possible recursive locking detected ] > [ 6172.870045] 3.18.0-rc1-next-20141023-sasha-00036-g4dcabd5 #1415 Not tainted > [ 6172.870045] --------------------------------------------- > [ 6172.870045] trinity-c55/12752 is trying to acquire lock: > [ 6172.870045] (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) > [ 6172.870045] > [ 6172.870045] but task is already holding lock: > [ 6172.870045] (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) > [ 6172.870045] > [ 6172.870045] other info that might help us debug this: > [ 6172.900904] FAULT_INJECTION: forcing a failure > [ 6172.870045] Possible unsafe locking scenario: > [ 6172.870045] > [ 6172.870045] CPU0 > [ 6172.870045] ---- > [ 6172.870045] lock(rename_lock); > [ 6172.870045] lock(rename_lock); > [ 6172.870045] > [ 6172.870045] *** DEADLOCK *** > [ 6172.870045] > [ 6172.870045] May be due to missing lock nesting notation > [ 6172.870045] > [ 6172.870045] 1 lock held by trinity-c55/12752: > [ 6172.870045] #0: (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) > [ 6172.870045] > [ 6172.870045] stack backtrace: > [ 6172.870045] CPU: 1 PID: 12752 Comm: trinity-c55 Not tainted 3.18.0-rc1-next-20141023-sasha-00036-g4dcabd5 #1415 > [ 6172.870045] ffff88070945b000 0000000000000000 ffffffffb6312fd0 ffff8806d69ff728 > [ 6172.870045] ffffffffa8ff75ca 0000000000000011 ffffffffb6312fd0 ffff8806d69ff828 > [ 6172.870045] ffffffff9f3a012b ffff880107fd76c0 ffff880107fd76c0 0000000000000001 > [ 6172.870045] Call Trace: > [ 6172.870045] dump_stack (lib/dump_stack.c:52) > [ 6172.870045] validate_chain.isra.10 (kernel/locking/lockdep.c:2134) > [ 6172.870045] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86) > [ 6172.870045] ? sched_clock_cpu (kernel/sched/clock.c:311) > [ 6172.870045] __lock_acquire (kernel/locking/lockdep.c:3184) > [ 6172.870045] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3604) > [ 6172.870045] ? d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) > [ 6172.870045] ? put_lock_stats.isra.7 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254) > [ 6172.870045] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151) > [ 6172.870045] ? d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) > [ 6172.870045] d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035) > [ 6172.870045] ? d_walk (fs/dcache.c:1087) > [ 6172.870045] ? d_drop (fs/dcache.c:1336) > [ 6172.870045] ? select_collect (fs/dcache.c:1323) > [ 6172.870045] d_invalidate (fs/dcache.c:1381) > [ 6172.870045] lookup_fast (fs/namei.c:1440) > [ 6172.870045] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183) > [ 6172.870045] ? lockref_put_or_lock (lib/lockref.c:135) > [ 6172.870045] ? dput (fs/dcache.c:626) > [ 6172.870045] walk_component (fs/namei.c:1545) > [ 6172.870045] ? security_inode_permission (security/security.c:573) > [ 6172.870045] ? __inode_permission (fs/namei.c:418) > [ 6172.870045] link_path_walk (fs/namei.c:1805) > [ 6172.870045] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) > [ 6172.870045] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2559 kernel/locking/lockdep.c:2601) > [ 6172.870045] ? getname_flags (fs/namei.c:145) > [ 6172.870045] path_lookupat (fs/namei.c:1956) > [ 6172.870045] ? kmem_cache_alloc (include/linux/rcupdate.h:479 include/trace/events/kmem.h:53 mm/slub.c:2463) > [ 6172.870045] ? getname_flags (fs/namei.c:145) > [ 6172.870045] filename_lookup (fs/namei.c:2000) > [ 6172.870045] user_path_at_empty (fs/namei.c:2151) > [ 6172.870045] ? preempt_count_sub (kernel/sched/core.c:2662) > [ 6172.870045] user_path_at (fs/namei.c:2162) > [ 6172.870045] vfs_fstatat (fs/stat.c:106) > [ 6172.870045] SYSC_newlstat (fs/stat.c:284) > [ 6172.870045] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1598) > [ 6172.870045] ? tracesys_phase2 (arch/x86/kernel/entry_64.S:518) > [ 6172.870045] SyS_newlstat (fs/stat.c:277) > [ 6172.870045] tracesys_phase2 (arch/x86/kernel/entry_64.S:529) > > The machine proceeded to actually lock up, so this isn't a false positive. > > > Thanks, > Sasha ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs: lockup on rename_mutex in fs/dcache.c:1035 2014-10-26 3:01 ` Eric W. Biederman @ 2014-10-26 3:06 ` Al Viro 2014-10-26 3:51 ` Al Viro 0 siblings, 1 reply; 11+ messages in thread From: Al Viro @ 2014-10-26 3:06 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones On Sat, Oct 25, 2014 at 08:01:41PM -0700, Eric W. Biederman wrote: > Sasha Levin <sasha.levin@oracle.com> writes: > > > Hi all, > > > > While fuzzing with trinity inside a KVM tools guest running the latest -next > > kernel, I've stumbled on the following spew: > > Weird. I took a quick look and I don't see any changes in d_walk that > in Al's tree or in Linus's kernel for years. > > Has read_seqbegin_or_lock changed somewhere? > > >From a quick reading of the code it simply isn't possible for d_walk to > take the lock twice short of memory corruption. And the fact that the > code has not changed in years seems to suggest it isn't the obvious > cause of d_walk talking the rename_lock twice. It is a fairly obvious case of d_walk() forgetting to drop rename_lock. See upthread for analysis and (hopefully) a fix. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs: lockup on rename_mutex in fs/dcache.c:1035 2014-10-26 3:06 ` Al Viro @ 2014-10-26 3:51 ` Al Viro 2014-10-26 3:57 ` Al Viro 0 siblings, 1 reply; 11+ messages in thread From: Al Viro @ 2014-10-26 3:51 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones On Sun, Oct 26, 2014 at 03:06:08AM +0000, Al Viro wrote: > > >From a quick reading of the code it simply isn't possible for d_walk to > > take the lock twice short of memory corruption. And the fact that the > > code has not changed in years seems to suggest it isn't the obvious > > cause of d_walk talking the rename_lock twice. > > It is a fairly obvious case of d_walk() forgetting to drop rename_lock. > See upthread for analysis and (hopefully) a fix. ... except that it's not a full fix. If we get there that way with retry being true, we will immediately deadlock at again:... Which might very well has happened in this case - i.e. it might be a single call of d_walk() taking the sucker twice that way. Hmm... Actually, the comment in there is simply wrong - if the child got killed between unlocking the child and locking the parent, it's not ascending to the wrong parent, it's having no way to find the next sibling. OK, so basically it came from Nick's "fs: dcache avoid starvation in dcache multi-step operations" and mistake was in the assumption that once we hold rename_lock, nothing is going to need rename_retry. Which isn't true - dentry_kill() on child while we are trying to get ->d_lock on parent requires a restart and that isn't excluded by rename_lock at all. Well, brute-force fix would be this, but I wonder if it's going to create livelocks... diff --git a/fs/dcache.c b/fs/dcache.c index 3ffef7f..e3d8499 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1118,6 +1118,7 @@ out_unlock: return; rename_retry: + done_seqretry(&rename_lock, seq); if (!retry) return; seq = 1; ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: fs: lockup on rename_mutex in fs/dcache.c:1035 2014-10-26 3:51 ` Al Viro @ 2014-10-26 3:57 ` Al Viro 2014-10-26 18:56 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Al Viro @ 2014-10-26 3:57 UTC (permalink / raw) To: Linus Torvalds Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones, Eric W. Biederman [context for Linus] Fuzzer has triggered deadlock in d_walk() with rename_lock taken twice. AFAICS, the plausible scenario is (child->d_flags & DCACHE_DENTRY_KILLED) || triggering while ascending to parent, on the pass with rename_lock already held exclusive. In that case we go to rename_retry and either return without unlocking rename_lock, or try to take in exclusive one more time, again without unlocking it first. > Hmm... Actually, the comment in there is simply wrong - if the child > got killed between unlocking the child and locking the parent, it's > not ascending to the wrong parent, it's having no way to find the next > sibling. > > OK, so basically it came from Nick's "fs: dcache avoid starvation in dcache > multi-step operations" and mistake was in the assumption that once we > hold rename_lock, nothing is going to need rename_retry. Which isn't > true - dentry_kill() on child while we are trying to get ->d_lock on > parent requires a restart and that isn't excluded by rename_lock at > all. > > Well, brute-force fix would be this, but I wonder if it's going to > create livelocks... > > diff --git a/fs/dcache.c b/fs/dcache.c > index 3ffef7f..e3d8499 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -1118,6 +1118,7 @@ out_unlock: > return; > > rename_retry: > + done_seqretry(&rename_lock, seq); > if (!retry) > return; > seq = 1; > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs: lockup on rename_mutex in fs/dcache.c:1035 2014-10-26 3:57 ` Al Viro @ 2014-10-26 18:56 ` Linus Torvalds 2014-10-26 19:13 ` Al Viro 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2014-10-26 18:56 UTC (permalink / raw) To: Al Viro; +Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones, Eric W. Biederman On Sat, Oct 25, 2014 at 8:57 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > [context for Linus] > > Fuzzer has triggered deadlock in d_walk() with rename_lock taken twice. > AFAICS, the plausible scenario is > (child->d_flags & DCACHE_DENTRY_KILLED) || > triggering while ascending to parent, on the pass with rename_lock already > held exclusive. In that case we go to rename_retry and either return without > unlocking rename_lock, or try to take in exclusive one more time, again > without unlocking it first. Your patch looks fine, and I don't think we can livelock - because we always set 'seq' to 1 if we retry, and that causes us to get the exclusive lock, so we'd better not then enter the retry loop again. Although I guess not only renames cause it - looks like we get to that mis-named "rename_retry" for a deletion too according to the comment. Although I'm not sure that comment is correct - I don't think we change d_parent for deletes, do we? So at a quick glance, the one-liner patch looks fine. That said, I really detest how that code is written. Especially if the rename_retry really can only happen once, I get the feeling that we would be *much* better off doing this explicitly with a wrapper function, and do something like void d_walk(,..) { /* Try non-exclusive first */ seq = read_seqbegin(lock); retry = __d_walk(.. seq); if (retry) { read_seqlock_excl(lock); // lock->seqcount is now guaranteed stable retry = __d_walk(.. lock->seqcount); read_sequnlock_excl(lock); WARN_ON_ONCE(retry, "Really?"); } } or whatever. But maybe I'm missing some reason why the above is just crazy, and I'm a moron. Entirely possible. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs: lockup on rename_mutex in fs/dcache.c:1035 2014-10-26 18:56 ` Linus Torvalds @ 2014-10-26 19:13 ` Al Viro 2014-10-26 21:57 ` Al Viro 0 siblings, 1 reply; 11+ messages in thread From: Al Viro @ 2014-10-26 19:13 UTC (permalink / raw) To: Linus Torvalds Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones, Eric W. Biederman On Sun, Oct 26, 2014 at 11:56:08AM -0700, Linus Torvalds wrote: > On Sat, Oct 25, 2014 at 8:57 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > Your patch looks fine, and I don't think we can livelock - because we > always set 'seq' to 1 if we retry, and that causes us to get the > exclusive lock, so we'd better not then enter the retry loop again. > Although I guess not only renames cause it - looks like we get to that > mis-named "rename_retry" for a deletion too according to the comment. > Although I'm not sure that comment is correct - I don't think we > change d_parent for deletes, do we? The comment is not correct. dentry_kill() won't screw the pointer to parent; it will, however, screw the pointer to next sibling. It used to screw the pointer to parent (which is what the first part of condition was about). After Nick's series back in January 2011 that has stopped being true. However, dentry_kill() does list_del(&dentry->d_u.d_child). Which means that we can't continue past that point if it has happened. Trond has noticed the breakage a bit later and added explicit check for ->d_flags, but the damage was more extensive - Nick had missed the restarts-on-killed logics hidden in the check for changed ->d_parent and assumed that it's all about renames, meaning that once rename_lock has been taken exclusive we won't have restarts at all. With restart-on-killed restored that wasn't true anymore, invalidating the assumption that we only get to rename_retry without rename_lock held exclusive. With deadlocks happening if we ever get there on such pass. We used to have several copies of that logics, before it all got consolidated into d_walk(), which probably contributed to the mess... BTW, the first part of condition (->d_parent hasn't changed since before the "unlock child, lock parent" sequence) is really pointless these days - check for rename_lock will fail if ->d_parent has changed... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs: lockup on rename_mutex in fs/dcache.c:1035 2014-10-26 19:13 ` Al Viro @ 2014-10-26 21:57 ` Al Viro 2014-10-26 23:33 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Al Viro @ 2014-10-26 21:57 UTC (permalink / raw) To: Linus Torvalds Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones, Eric W. Biederman On Sun, Oct 26, 2014 at 07:13:32PM +0000, Al Viro wrote: > The comment is not correct. dentry_kill() won't screw the pointer to > parent; it will, however, screw the pointer to next sibling. > > It used to screw the pointer to parent (which is what the first part of > condition was about). After Nick's series back in January 2011 that > has stopped being true. However, dentry_kill() does > list_del(&dentry->d_u.d_child). Which means that we can't continue > past that point if it has happened. Trond has noticed the breakage > a bit later and added explicit check for ->d_flags, but the damage > was more extensive - Nick had missed the restarts-on-killed logics > hidden in the check for changed ->d_parent and assumed that it's > all about renames, meaning that once rename_lock has been taken exclusive > we won't have restarts at all. With restart-on-killed restored that > wasn't true anymore, invalidating the assumption that we only get to > rename_retry without rename_lock held exclusive. With deadlocks happening > if we ever get there on such pass. Actually, it's even worse than just list_del() possibly screwing .next - that could be worked around by use of __list_del() (and skipping them until we come to one that isn't marked DCACHE_DENTRY_KILLED). Note that d_child shares memory with d_rcu, and _that_ is really nasty - if __dentry_kill() has progressed to dentry_free(), we have ->d_u.d_child.next hopelessly trashed. OTOH, we could make d_rcu share memory with d_alias instead. Hrm... OK, then we'd have rcu_read_lock(); ascend: if (this_parent != parent) { struct dentry *child = this_parent; this_parent = child->d_parent; spin_unlock(&child->d_lock); spin_lock(&this_parent->d_lock); if (need_seqretry(&rename_lock, seq)) { spin_unlock(&this_parent->d_lock); rcu_read_unlock(); goto rename_retry; } next = child->d_child.next; while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED)) { if (next == &this_parent.d_subdirs) goto ascend; child = list_entry(next, struct dentry, d_child); next = next->next; } rcu_read_unlock(); goto resume; } rcu_read_unlock(); if (need_seqretry(&rename_lock, seq)) { spin_unlock(&this_parent->d_lock); goto rename_retry; } in d_walk(), __list_del() instead of list_del() in __dentry_kill(), d_u.d_child turning into d_child everywhere, while d_alias turns into d_u.d_alias... It looks like that way we would get no retries on the second pass. Comments? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs: lockup on rename_mutex in fs/dcache.c:1035 2014-10-26 21:57 ` Al Viro @ 2014-10-26 23:33 ` Linus Torvalds 2014-10-26 23:42 ` Al Viro 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2014-10-26 23:33 UTC (permalink / raw) To: Al Viro; +Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones, Eric W. Biederman On Sun, Oct 26, 2014 at 2:57 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > .. snip .. > in d_walk(), __list_del() instead of list_del() in __dentry_kill(), d_u.d_child > turning into d_child everywhere, while d_alias turns into d_u.d_alias... > > It looks like that way we would get no retries on the second pass. Comments? Since I missed the whole issue with d_child.next, I'm not sure any comments from me would be helpful. It does sound like trying to be more careful with d_child and using d_alias instead is a good idea. d_alias is only used under the dentry lock, and not in any horribly subtle situations, right? So that sounds like a good change regardless of this particular fix - making the union happen in a less nasty place.. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs: lockup on rename_mutex in fs/dcache.c:1035 2014-10-26 23:33 ` Linus Torvalds @ 2014-10-26 23:42 ` Al Viro 0 siblings, 0 replies; 11+ messages in thread From: Al Viro @ 2014-10-26 23:42 UTC (permalink / raw) To: Linus Torvalds Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones, Eric W. Biederman On Sun, Oct 26, 2014 at 04:33:11PM -0700, Linus Torvalds wrote: > On Sun, Oct 26, 2014 at 2:57 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > .. snip .. > > in d_walk(), __list_del() instead of list_del() in __dentry_kill(), d_u.d_child > > turning into d_child everywhere, while d_alias turns into d_u.d_alias... > > > > It looks like that way we would get no retries on the second pass. Comments? > > Since I missed the whole issue with d_child.next, I'm not sure any > comments from me would be helpful. > > It does sound like trying to be more careful with d_child and using > d_alias instead is a good idea. d_alias is only used under the dentry > lock, and not in any horribly subtle situations, right? So that sounds > like a good change regardless of this particular fix - making the > union happen in a less nasty place.. OK... See vfs.git#experimental-d_walk - the first commit there moves d_rcu from d_child to d_alias, the second is basically "skip the killed siblings rather than restarting the tree walk". ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-10-26 23:42 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-10-26 1:39 fs: lockup on rename_mutex in fs/dcache.c:1035 Sasha Levin 2014-10-26 2:56 ` Al Viro 2014-10-26 3:01 ` Eric W. Biederman 2014-10-26 3:06 ` Al Viro 2014-10-26 3:51 ` Al Viro 2014-10-26 3:57 ` Al Viro 2014-10-26 18:56 ` Linus Torvalds 2014-10-26 19:13 ` Al Viro 2014-10-26 21:57 ` Al Viro 2014-10-26 23:33 ` Linus Torvalds 2014-10-26 23:42 ` Al Viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).