fs: lockup on rename_mutex in fs/dcache.c:1035

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* fs: lockup on rename_mutex in fs/dcache.c:1035
@ 2014-10-26  1:39 Sasha Levin
  2014-10-26  2:56 ` Al Viro
  2014-10-26  3:01 ` Eric W. Biederman
  0 siblings, 2 replies; 11+ messages in thread
From: Sasha Levin @ 2014-10-26  1:39 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric W. Biederman, linux-fsdevel, LKML, Dave Jones

Hi all,

While fuzzing with trinity inside a KVM tools guest running the latest -next
kernel, I've stumbled on the following spew:

[ 6172.870045] =============================================
[ 6172.870045] [ INFO: possible recursive locking detected ]
[ 6172.870045] 3.18.0-rc1-next-20141023-sasha-00036-g4dcabd5 #1415 Not tainted
[ 6172.870045] ---------------------------------------------
[ 6172.870045] trinity-c55/12752 is trying to acquire lock:
[ 6172.870045] (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
[ 6172.870045]
[ 6172.870045] but task is already holding lock:
[ 6172.870045] (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
[ 6172.870045]
[ 6172.870045] other info that might help us debug this:
[ 6172.900904] FAULT_INJECTION: forcing a failure
[ 6172.870045]  Possible unsafe locking scenario:
[ 6172.870045]
[ 6172.870045]        CPU0
[ 6172.870045]        ----
[ 6172.870045]   lock(rename_lock);
[ 6172.870045]   lock(rename_lock);
[ 6172.870045]
[ 6172.870045]  *** DEADLOCK ***
[ 6172.870045]
[ 6172.870045]  May be due to missing lock nesting notation
[ 6172.870045]
[ 6172.870045] 1 lock held by trinity-c55/12752:
[ 6172.870045] #0: (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
[ 6172.870045]
[ 6172.870045] stack backtrace:
[ 6172.870045] CPU: 1 PID: 12752 Comm: trinity-c55 Not tainted 3.18.0-rc1-next-20141023-sasha-00036-g4dcabd5 #1415
[ 6172.870045]  ffff88070945b000 0000000000000000 ffffffffb6312fd0 ffff8806d69ff728
[ 6172.870045]  ffffffffa8ff75ca 0000000000000011 ffffffffb6312fd0 ffff8806d69ff828
[ 6172.870045]  ffffffff9f3a012b ffff880107fd76c0 ffff880107fd76c0 0000000000000001
[ 6172.870045] Call Trace:
[ 6172.870045] dump_stack (lib/dump_stack.c:52)
[ 6172.870045] validate_chain.isra.10 (kernel/locking/lockdep.c:2134)
[ 6172.870045] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[ 6172.870045] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 6172.870045] __lock_acquire (kernel/locking/lockdep.c:3184)
[ 6172.870045] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3604)
[ 6172.870045] ? d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
[ 6172.870045] ? put_lock_stats.isra.7 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[ 6172.870045] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 6172.870045] ? d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
[ 6172.870045] d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
[ 6172.870045] ? d_walk (fs/dcache.c:1087)
[ 6172.870045] ? d_drop (fs/dcache.c:1336)
[ 6172.870045] ? select_collect (fs/dcache.c:1323)
[ 6172.870045] d_invalidate (fs/dcache.c:1381)
[ 6172.870045] lookup_fast (fs/namei.c:1440)
[ 6172.870045] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 6172.870045] ? lockref_put_or_lock (lib/lockref.c:135)
[ 6172.870045] ? dput (fs/dcache.c:626)
[ 6172.870045] walk_component (fs/namei.c:1545)
[ 6172.870045] ? security_inode_permission (security/security.c:573)
[ 6172.870045] ? __inode_permission (fs/namei.c:418)
[ 6172.870045] link_path_walk (fs/namei.c:1805)
[ 6172.870045] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 6172.870045] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2559 kernel/locking/lockdep.c:2601)
[ 6172.870045] ? getname_flags (fs/namei.c:145)
[ 6172.870045] path_lookupat (fs/namei.c:1956)
[ 6172.870045] ? kmem_cache_alloc (include/linux/rcupdate.h:479 include/trace/events/kmem.h:53 mm/slub.c:2463)
[ 6172.870045] ? getname_flags (fs/namei.c:145)
[ 6172.870045] filename_lookup (fs/namei.c:2000)
[ 6172.870045] user_path_at_empty (fs/namei.c:2151)
[ 6172.870045] ? preempt_count_sub (kernel/sched/core.c:2662)
[ 6172.870045] user_path_at (fs/namei.c:2162)
[ 6172.870045] vfs_fstatat (fs/stat.c:106)
[ 6172.870045] SYSC_newlstat (fs/stat.c:284)
[ 6172.870045] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1598)
[ 6172.870045] ? tracesys_phase2 (arch/x86/kernel/entry_64.S:518)
[ 6172.870045] SyS_newlstat (fs/stat.c:277)
[ 6172.870045] tracesys_phase2 (arch/x86/kernel/entry_64.S:529)

The machine proceeded to actually lock up, so this isn't a false positive.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs: lockup on rename_mutex in fs/dcache.c:1035
  2014-10-26  1:39 fs: lockup on rename_mutex in fs/dcache.c:1035 Sasha Levin
@ 2014-10-26  2:56 ` Al Viro
  2014-10-26  3:01 ` Eric W. Biederman
  1 sibling, 0 replies; 11+ messages in thread
From: Al Viro @ 2014-10-26  2:56 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Eric W. Biederman, linux-fsdevel, LKML, Dave Jones

On Sat, Oct 25, 2014 at 09:39:23PM -0400, Sasha Levin wrote:
> [ 6172.870045] trinity-c55/12752 is trying to acquire lock:
> [ 6172.870045] (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
> [ 6172.870045]
> [ 6172.870045] but task is already holding lock:
> [ 6172.870045] (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)

Umm...  So we either have left d_walk() without dropping rename_lock, or
we have called d_walk() from something called from d_walk()?  And the
trace would seem to point towards the former...

Ouch.  For that to happen, we would need to
	* get to rename_retry with retry being true
	* after that get D_WALK_NORETRY from enter()
	* somehow get to rename_retry *again*

Moreover, we couldn't get there via
        if (need_seqretry(&rename_lock, seq)) {
                spin_unlock(&this_parent->d_lock);
                goto rename_retry;
- seq is 1 by that point, and need_seqretry() returns 0 in that case.  IOW,
it must have been this:
                /*
                 * might go back up the wrong parent if we have had a rename
                 * or deletion
                 */
                if (this_parent != child->d_parent ||
                         (child->d_flags & DCACHE_DENTRY_KILLED) ||
                         need_seqretry(&rename_lock, seq)) {
                        spin_unlock(&this_parent->d_lock);
                        rcu_read_unlock();
                        goto rename_retry;
                }
And we had been holding rename_lock in that walk, so d_move() should've
been excluded...  Which leaves us with
                         (child->d_flags & DCACHE_DENTRY_KILLED)
being true...  Hrm.  AFAICS, it *is* possible to hit that one - just have
the last reference to child dropped between
                spin_unlock(&child->d_lock);
and
                spin_lock(&this_parent->d_lock);
a few lines above.  And yes, if that happens we are in shit - rename_retry
will see retry being false and return, without noticing that on this pass
we had been holding rename_lock.  Easily fixed, fortunately - delta below
ought to take care of that...

Comments?  AFAICS, it's -stable fodder, the bug going all way back to
at least 3.7...

diff --git a/fs/dcache.c b/fs/dcache.c
index 3ffef7f..65f4aff 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1114,12 +1114,13 @@ resume:
 
 out_unlock:
 	spin_unlock(&this_parent->d_lock);
+out:
 	done_seqretry(&rename_lock, seq);
 	return;
 
 rename_retry:
 	if (!retry)
-		return;
+		goto out;
 	seq = 1;
 	goto again;
 }

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: fs: lockup on rename_mutex in fs/dcache.c:1035
  2014-10-26  1:39 fs: lockup on rename_mutex in fs/dcache.c:1035 Sasha Levin
  2014-10-26  2:56 ` Al Viro
@ 2014-10-26  3:01 ` Eric W. Biederman
  2014-10-26  3:06   ` Al Viro
  1 sibling, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 2014-10-26  3:01 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Al Viro, linux-fsdevel, LKML, Dave Jones

Sasha Levin <sasha.levin@oracle.com> writes:

> Hi all,
>
> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel, I've stumbled on the following spew:

Weird.  I took a quick look and I don't see any changes in d_walk that
in Al's tree or in Linus's kernel for years.

Has read_seqbegin_or_lock changed somewhere?

>From a quick reading of the code it simply isn't possible for d_walk to
take the lock twice short of memory corruption.  And the fact that the
code has not changed in years seems to suggest it isn't the obvious
cause of d_walk talking the rename_lock twice.

Eric

> [ 6172.870045] =============================================
> [ 6172.870045] [ INFO: possible recursive locking detected ]
> [ 6172.870045] 3.18.0-rc1-next-20141023-sasha-00036-g4dcabd5 #1415 Not tainted
> [ 6172.870045] ---------------------------------------------
> [ 6172.870045] trinity-c55/12752 is trying to acquire lock:
> [ 6172.870045] (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
> [ 6172.870045]
> [ 6172.870045] but task is already holding lock:
> [ 6172.870045] (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
> [ 6172.870045]
> [ 6172.870045] other info that might help us debug this:
> [ 6172.900904] FAULT_INJECTION: forcing a failure
> [ 6172.870045]  Possible unsafe locking scenario:
> [ 6172.870045]
> [ 6172.870045]        CPU0
> [ 6172.870045]        ----
> [ 6172.870045]   lock(rename_lock);
> [ 6172.870045]   lock(rename_lock);
> [ 6172.870045]
> [ 6172.870045]  *** DEADLOCK ***
> [ 6172.870045]
> [ 6172.870045]  May be due to missing lock nesting notation
> [ 6172.870045]
> [ 6172.870045] 1 lock held by trinity-c55/12752:
> [ 6172.870045] #0: (rename_lock){+.+...}, at: d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
> [ 6172.870045]
> [ 6172.870045] stack backtrace:
> [ 6172.870045] CPU: 1 PID: 12752 Comm: trinity-c55 Not tainted 3.18.0-rc1-next-20141023-sasha-00036-g4dcabd5 #1415
> [ 6172.870045]  ffff88070945b000 0000000000000000 ffffffffb6312fd0 ffff8806d69ff728
> [ 6172.870045]  ffffffffa8ff75ca 0000000000000011 ffffffffb6312fd0 ffff8806d69ff828
> [ 6172.870045]  ffffffff9f3a012b ffff880107fd76c0 ffff880107fd76c0 0000000000000001
> [ 6172.870045] Call Trace:
> [ 6172.870045] dump_stack (lib/dump_stack.c:52)
> [ 6172.870045] validate_chain.isra.10 (kernel/locking/lockdep.c:2134)
> [ 6172.870045] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> [ 6172.870045] ? sched_clock_cpu (kernel/sched/clock.c:311)
> [ 6172.870045] __lock_acquire (kernel/locking/lockdep.c:3184)
> [ 6172.870045] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3604)
> [ 6172.870045] ? d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
> [ 6172.870045] ? put_lock_stats.isra.7 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [ 6172.870045] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
> [ 6172.870045] ? d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
> [ 6172.870045] d_walk (include/linux/spinlock.h:309 fs/dcache.c:1035)
> [ 6172.870045] ? d_walk (fs/dcache.c:1087)
> [ 6172.870045] ? d_drop (fs/dcache.c:1336)
> [ 6172.870045] ? select_collect (fs/dcache.c:1323)
> [ 6172.870045] d_invalidate (fs/dcache.c:1381)
> [ 6172.870045] lookup_fast (fs/namei.c:1440)
> [ 6172.870045] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [ 6172.870045] ? lockref_put_or_lock (lib/lockref.c:135)
> [ 6172.870045] ? dput (fs/dcache.c:626)
> [ 6172.870045] walk_component (fs/namei.c:1545)
> [ 6172.870045] ? security_inode_permission (security/security.c:573)
> [ 6172.870045] ? __inode_permission (fs/namei.c:418)
> [ 6172.870045] link_path_walk (fs/namei.c:1805)
> [ 6172.870045] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 6172.870045] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2559 kernel/locking/lockdep.c:2601)
> [ 6172.870045] ? getname_flags (fs/namei.c:145)
> [ 6172.870045] path_lookupat (fs/namei.c:1956)
> [ 6172.870045] ? kmem_cache_alloc (include/linux/rcupdate.h:479 include/trace/events/kmem.h:53 mm/slub.c:2463)
> [ 6172.870045] ? getname_flags (fs/namei.c:145)
> [ 6172.870045] filename_lookup (fs/namei.c:2000)
> [ 6172.870045] user_path_at_empty (fs/namei.c:2151)
> [ 6172.870045] ? preempt_count_sub (kernel/sched/core.c:2662)
> [ 6172.870045] user_path_at (fs/namei.c:2162)
> [ 6172.870045] vfs_fstatat (fs/stat.c:106)
> [ 6172.870045] SYSC_newlstat (fs/stat.c:284)
> [ 6172.870045] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1598)
> [ 6172.870045] ? tracesys_phase2 (arch/x86/kernel/entry_64.S:518)
> [ 6172.870045] SyS_newlstat (fs/stat.c:277)
> [ 6172.870045] tracesys_phase2 (arch/x86/kernel/entry_64.S:529)
>
> The machine proceeded to actually lock up, so this isn't a false positive.
>
>
> Thanks,
> Sasha

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs: lockup on rename_mutex in fs/dcache.c:1035
  2014-10-26  3:01 ` Eric W. Biederman
@ 2014-10-26  3:06   ` Al Viro
  2014-10-26  3:51     ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2014-10-26  3:06 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones

On Sat, Oct 25, 2014 at 08:01:41PM -0700, Eric W. Biederman wrote:
> Sasha Levin <sasha.levin@oracle.com> writes:
> 
> > Hi all,
> >
> > While fuzzing with trinity inside a KVM tools guest running the latest -next
> > kernel, I've stumbled on the following spew:
> 
> Weird.  I took a quick look and I don't see any changes in d_walk that
> in Al's tree or in Linus's kernel for years.
> 
> Has read_seqbegin_or_lock changed somewhere?
> 
> >From a quick reading of the code it simply isn't possible for d_walk to
> take the lock twice short of memory corruption.  And the fact that the
> code has not changed in years seems to suggest it isn't the obvious
> cause of d_walk talking the rename_lock twice.

It is a fairly obvious case of d_walk() forgetting to drop rename_lock.
See upthread for analysis and (hopefully) a fix.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs: lockup on rename_mutex in fs/dcache.c:1035
  2014-10-26  3:06   ` Al Viro
@ 2014-10-26  3:51     ` Al Viro
  2014-10-26  3:57       ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2014-10-26  3:51 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones

On Sun, Oct 26, 2014 at 03:06:08AM +0000, Al Viro wrote:
> > >From a quick reading of the code it simply isn't possible for d_walk to
> > take the lock twice short of memory corruption.  And the fact that the
> > code has not changed in years seems to suggest it isn't the obvious
> > cause of d_walk talking the rename_lock twice.
> 
> It is a fairly obvious case of d_walk() forgetting to drop rename_lock.
> See upthread for analysis and (hopefully) a fix.

... except that it's not a full fix.  If we get there that way with
retry being true, we will immediately deadlock at again:...

Which might very well has happened in this case - i.e. it might be
a single call of d_walk() taking the sucker twice that way.

Hmm...  Actually, the comment in there is simply wrong - if the child
got killed between unlocking the child and locking the parent, it's
not ascending to the wrong parent, it's having no way to find the next
sibling.

OK, so basically it came from Nick's "fs: dcache avoid starvation in dcache
multi-step operations" and mistake was in the assumption that once we
hold rename_lock, nothing is going to need rename_retry.  Which isn't
true - dentry_kill() on child while we are trying to get ->d_lock on
parent requires a restart and that isn't excluded by rename_lock at
all.

Well, brute-force fix would be this, but I wonder if it's going to
create livelocks...

diff --git a/fs/dcache.c b/fs/dcache.c
index 3ffef7f..e3d8499 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1118,6 +1118,7 @@ out_unlock:
 	return;

 rename_retry:
+	done_seqretry(&rename_lock, seq);
 	if (!retry)
 		return;
 	seq = 1;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: fs: lockup on rename_mutex in fs/dcache.c:1035
  2014-10-26  3:51     ` Al Viro
@ 2014-10-26  3:57       ` Al Viro
  2014-10-26 18:56         ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2014-10-26  3:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones, Eric W. Biederman

[context for Linus]

Fuzzer has triggered deadlock in d_walk() with rename_lock taken twice.
AFAICS, the plausible scenario is
                         (child->d_flags & DCACHE_DENTRY_KILLED) ||
triggering while ascending to parent, on the pass with rename_lock already
held exclusive.  In that case we go to rename_retry and either return without
unlocking rename_lock, or try to take in exclusive one more time, again
without unlocking it first.

> Hmm...  Actually, the comment in there is simply wrong - if the child
> got killed between unlocking the child and locking the parent, it's
> not ascending to the wrong parent, it's having no way to find the next
> sibling.
> 
> OK, so basically it came from Nick's "fs: dcache avoid starvation in dcache
> multi-step operations" and mistake was in the assumption that once we
> hold rename_lock, nothing is going to need rename_retry.  Which isn't
> true - dentry_kill() on child while we are trying to get ->d_lock on
> parent requires a restart and that isn't excluded by rename_lock at
> all.
> 
> Well, brute-force fix would be this, but I wonder if it's going to
> create livelocks...
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 3ffef7f..e3d8499 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1118,6 +1118,7 @@ out_unlock:
>  	return;
>  
>  rename_retry:
> +	done_seqretry(&rename_lock, seq);
>  	if (!retry)
>  		return;
>  	seq = 1;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs: lockup on rename_mutex in fs/dcache.c:1035
  2014-10-26  3:57       ` Al Viro
@ 2014-10-26 18:56         ` Linus Torvalds
  2014-10-26 19:13           ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2014-10-26 18:56 UTC (permalink / raw)
  To: Al Viro; +Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones, Eric W. Biederman

On Sat, Oct 25, 2014 at 8:57 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> [context for Linus]
>
> Fuzzer has triggered deadlock in d_walk() with rename_lock taken twice.
> AFAICS, the plausible scenario is
>                          (child->d_flags & DCACHE_DENTRY_KILLED) ||
> triggering while ascending to parent, on the pass with rename_lock already
> held exclusive.  In that case we go to rename_retry and either return without
> unlocking rename_lock, or try to take in exclusive one more time, again
> without unlocking it first.

Your patch looks fine, and I don't think we can livelock - because we
always set 'seq' to 1 if we retry, and that causes us to get the
exclusive lock, so we'd better not then enter the retry loop again.
Although I guess not only renames cause it - looks like we get to that
mis-named "rename_retry" for a deletion too according to the comment.
Although I'm not sure that comment is correct - I don't think we
change d_parent for deletes, do we?

So at a quick glance, the one-liner patch looks fine. That said, I
really detest how that code is written. Especially if the rename_retry
really can only happen once, I get the feeling that we would be *much*
better off doing this explicitly with a wrapper function, and do
something like

   void d_walk(,..)
   {
       /* Try non-exclusive first */
       seq = read_seqbegin(lock);
       retry = __d_walk(.. seq);
       if (retry) {
              read_seqlock_excl(lock);
              // lock->seqcount is now guaranteed stable
              retry = __d_walk(.. lock->seqcount);
              read_sequnlock_excl(lock);
              WARN_ON_ONCE(retry, "Really?");
       }
   }

or whatever. But maybe I'm missing some reason why the above is just
crazy, and I'm a moron. Entirely possible.

                     Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs: lockup on rename_mutex in fs/dcache.c:1035
  2014-10-26 18:56         ` Linus Torvalds
@ 2014-10-26 19:13           ` Al Viro
  2014-10-26 21:57             ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2014-10-26 19:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones, Eric W. Biederman

On Sun, Oct 26, 2014 at 11:56:08AM -0700, Linus Torvalds wrote:
> On Sat, Oct 25, 2014 at 8:57 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> Your patch looks fine, and I don't think we can livelock - because we
> always set 'seq' to 1 if we retry, and that causes us to get the
> exclusive lock, so we'd better not then enter the retry loop again.
> Although I guess not only renames cause it - looks like we get to that
> mis-named "rename_retry" for a deletion too according to the comment.
> Although I'm not sure that comment is correct - I don't think we
> change d_parent for deletes, do we?

The comment is not correct.  dentry_kill() won't screw the pointer to
parent; it will, however, screw the pointer to next sibling.

It used to screw the pointer to parent (which is what the first part of
condition was about).  After Nick's series back in January 2011 that
has stopped being true.  However, dentry_kill() does
list_del(&dentry->d_u.d_child).  Which means that we can't continue
past that point if it has happened.  Trond has noticed the breakage
a bit later and added explicit check for ->d_flags, but the damage
was more extensive - Nick had missed the restarts-on-killed logics
hidden in the check for changed ->d_parent and assumed that it's
all about renames, meaning that once rename_lock has been taken exclusive
we won't have restarts at all.  With restart-on-killed restored that
wasn't true anymore, invalidating the assumption that we only get to
rename_retry without rename_lock held exclusive.  With deadlocks happening
if we ever get there on such pass.

We used to have several copies of that logics, before it all got consolidated
into d_walk(), which probably contributed to the mess...

BTW, the first part of condition (->d_parent hasn't changed since before the
"unlock child, lock parent" sequence) is really pointless these days - check
for rename_lock will fail if ->d_parent has changed...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs: lockup on rename_mutex in fs/dcache.c:1035
  2014-10-26 19:13           ` Al Viro
@ 2014-10-26 21:57             ` Al Viro
  2014-10-26 23:33               ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2014-10-26 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones, Eric W. Biederman

On Sun, Oct 26, 2014 at 07:13:32PM +0000, Al Viro wrote:
> The comment is not correct.  dentry_kill() won't screw the pointer to
> parent; it will, however, screw the pointer to next sibling.
> 
> It used to screw the pointer to parent (which is what the first part of
> condition was about).  After Nick's series back in January 2011 that
> has stopped being true.  However, dentry_kill() does
> list_del(&dentry->d_u.d_child).  Which means that we can't continue
> past that point if it has happened.  Trond has noticed the breakage
> a bit later and added explicit check for ->d_flags, but the damage
> was more extensive - Nick had missed the restarts-on-killed logics
> hidden in the check for changed ->d_parent and assumed that it's
> all about renames, meaning that once rename_lock has been taken exclusive
> we won't have restarts at all.  With restart-on-killed restored that
> wasn't true anymore, invalidating the assumption that we only get to
> rename_retry without rename_lock held exclusive.  With deadlocks happening
> if we ever get there on such pass.

Actually, it's even worse than just list_del() possibly screwing .next -
that could be worked around by use of __list_del() (and skipping them
until we come to one that isn't marked DCACHE_DENTRY_KILLED).  Note that
d_child shares memory with d_rcu, and _that_ is really nasty - if
__dentry_kill() has progressed to dentry_free(), we have ->d_u.d_child.next
hopelessly trashed.

OTOH, we could make d_rcu share memory with d_alias instead.  Hrm...
OK, then we'd have
	rcu_read_lock();
ascend:
	if (this_parent != parent) {
		struct dentry *child = this_parent;
		this_parent = child->d_parent;

                spin_unlock(&child->d_lock);
                spin_lock(&this_parent->d_lock);

		if (need_seqretry(&rename_lock, seq)) {
			spin_unlock(&this_parent->d_lock);
			rcu_read_unlock();
			goto rename_retry;
		}
		next = child->d_child.next;
		while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED)) {
			if (next == &this_parent.d_subdirs)
				goto ascend;
			child = list_entry(next, struct dentry, d_child);
			next = next->next;
		}
		rcu_read_unlock();
		goto resume;
	}
	rcu_read_unlock();
	if (need_seqretry(&rename_lock, seq)) {
		spin_unlock(&this_parent->d_lock);
		goto rename_retry;
	}
in d_walk(), __list_del() instead of list_del() in __dentry_kill(), d_u.d_child
turning into d_child everywhere, while d_alias turns into d_u.d_alias...

It looks like that way we would get no retries on the second pass.  Comments?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs: lockup on rename_mutex in fs/dcache.c:1035
  2014-10-26 21:57             ` Al Viro
@ 2014-10-26 23:33               ` Linus Torvalds
  2014-10-26 23:42                 ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2014-10-26 23:33 UTC (permalink / raw)
  To: Al Viro; +Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones, Eric W. Biederman

On Sun, Oct 26, 2014 at 2:57 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>  .. snip ..
> in d_walk(), __list_del() instead of list_del() in __dentry_kill(), d_u.d_child
> turning into d_child everywhere, while d_alias turns into d_u.d_alias...
>
> It looks like that way we would get no retries on the second pass.  Comments?

Since I missed the whole issue with d_child.next, I'm not sure any
comments from me would be helpful.

It does sound like trying to be more careful with d_child and using
d_alias instead is a good idea. d_alias is only used under the dentry
lock, and not in any horribly subtle situations, right? So that sounds
like a good change regardless of this particular fix - making the
union happen in a less nasty place..

                     Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs: lockup on rename_mutex in fs/dcache.c:1035
  2014-10-26 23:33               ` Linus Torvalds
@ 2014-10-26 23:42                 ` Al Viro
  0 siblings, 0 replies; 11+ messages in thread
From: Al Viro @ 2014-10-26 23:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, linux-fsdevel, LKML, Dave Jones, Eric W. Biederman

On Sun, Oct 26, 2014 at 04:33:11PM -0700, Linus Torvalds wrote:
> On Sun, Oct 26, 2014 at 2:57 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >  .. snip ..
> > in d_walk(), __list_del() instead of list_del() in __dentry_kill(), d_u.d_child
> > turning into d_child everywhere, while d_alias turns into d_u.d_alias...
> >
> > It looks like that way we would get no retries on the second pass.  Comments?
> 
> Since I missed the whole issue with d_child.next, I'm not sure any
> comments from me would be helpful.
> 
> It does sound like trying to be more careful with d_child and using
> d_alias instead is a good idea. d_alias is only used under the dentry
> lock, and not in any horribly subtle situations, right? So that sounds
> like a good change regardless of this particular fix - making the
> union happen in a less nasty place..

OK...  See vfs.git#experimental-d_walk - the first commit there moves
d_rcu from d_child to d_alias, the second is basically "skip the killed
siblings rather than restarting the tree walk".

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-10-26 23:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-26  1:39 fs: lockup on rename_mutex in fs/dcache.c:1035 Sasha Levin
2014-10-26  2:56 ` Al Viro
2014-10-26  3:01 ` Eric W. Biederman
2014-10-26  3:06   ` Al Viro
2014-10-26  3:51     ` Al Viro
2014-10-26  3:57       ` Al Viro
2014-10-26 18:56         ` Linus Torvalds
2014-10-26 19:13           ` Al Viro
2014-10-26 21:57             ` Al Viro
2014-10-26 23:33               ` Linus Torvalds
2014-10-26 23:42                 ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).