vfs-scale, chroot

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* vfs-scale, chroot
@ 2011-01-12 19:47 J. R. Okajima
  2011-01-12 20:04 ` Trond Myklebust
  2011-01-13  1:06 ` Nick Piggin
  0 siblings, 2 replies; 6+ messages in thread
From: J. R. Okajima @ 2011-01-12 19:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel


Hello Nick,

I've got a crash around d_lock.

# mount -t nfs host:/dir /nfs
# chroot /nfs

BUG: spinlock recursion on CPU#1, chroot/2524
 lock: ffff88001d106880, .magic: dead4ead, .owner: chroot/2524, .owner_cpu: 1
Call Trace:
 [<ffffffff8128ca52>] ? spin_bug+0xa2/0xf0
 [<ffffffff8128cd53>] ? do_raw_spin_lock+0x193/0x1b0
 [<ffffffff814eaa5e>] ? _raw_spin_lock_nested+0x4e/0x60
 [<ffffffff81137aef>] ? nameidata_dentry_drop_rcu+0xcf/0x1b0
 [<ffffffff814eaab3>] ? _raw_spin_lock+0x43/0x50
 [<ffffffff81137aef>] ? nameidata_dentry_drop_rcu+0xcf/0x1b0
 [<ffffffff81137c1b>] ? d_revalidate+0x4b/0x70
 [<ffffffff81139a75>] ? link_path_walk+0x655/0x1210
 [<ffffffff81138702>] ? path_init_rcu+0x1c2/0x370
 [<ffffffff811387e5>] ? path_init_rcu+0x2a5/0x370
 [<ffffffff81138702>] ? path_init_rcu+0x1c2/0x370
 [<ffffffff811059b3>] ? might_fault+0x53/0xb0
 [<ffffffff8113a9fe>] ? do_path_lookup+0x8e/0x1d0
 [<ffffffff8113b936>] ? user_path_at+0xa6/0xe0
 [<ffffffff8114c167>] ? vfsmount_lock_local_unlock+0x77/0x90
 [<ffffffff814ebab9>] ? retint_swapgs+0x13/0x1b
 [<ffffffff81090b15>] ? trace_hardirqs_on_caller+0x145/0x190
 [<ffffffff8112841e>] ? sys_chdir+0x2e/0x90
 [<ffffffff8100bfd2>] ? system_call_fastpath+0x16/0x1b

It looks like nameidata_dentry_drop_rcu() is trying spin_lock() twice
for the same dentry when parent == dentry.

- NFS ->d_revalidate() returns -ECHILD for LOOKUP_RCU
- VFS d_revalidate() will try ->d_revalidate() again after dropping
  LOOKUP_RCU by nameidata_dentry_drop_rcu().
- nameidata_dentry_drop_rcu() calls
	spin_lock(&parent->d_lock);
	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
- it may happen on all fs which specifies FS_REVAL_DOT

If we have a function like below, it may be useful.
But are there so many cases like this problem?
If it is not so many, then the fix will be adding several
"if (!IS_ROOT(dentry)" into nameidata_dentry_drop_rcu(), I think.

int d_lock_parent_child(parent, child)
{
	err = Success;
	spin_lock(&parent->d_lock);
	if (!IS_ROOT(dentry)) {
		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
		if (unlikely(parent != dentry->d_parent)) {
			spin_unlock(&parent->d_lock);
			err = Error_Unmatch;
		}
	} else
		err = Success_Root;
	return err;
}

void d_unlock_parent_child(int stat, parent, child)
{
	Assert(stat == Error_Unmatch);
	if (stat == Success)
		spin_unlock(&dentry->d_lock);
	spin_unlock(&parent->d_lock);
}


J. R. Okajima

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: vfs-scale, chroot
  2011-01-12 19:47 vfs-scale, chroot J. R. Okajima
@ 2011-01-12 20:04 ` Trond Myklebust
  2011-01-13  1:25   ` Nick Piggin
  2011-01-13  1:06 ` Nick Piggin
  1 sibling, 1 reply; 6+ messages in thread
From: Trond Myklebust @ 2011-01-12 20:04 UTC (permalink / raw)
  To: J. R. Okajima; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Thu, 2011-01-13 at 04:47 +0900, J. R. Okajima wrote: 
> Hello Nick,
> 
> I've got a crash around d_lock.
> 
> # mount -t nfs host:/dir /nfs
> # chroot /nfs
> 
> BUG: spinlock recursion on CPU#1, chroot/2524
>  lock: ffff88001d106880, .magic: dead4ead, .owner: chroot/2524, .owner_cpu: 1
> Call Trace:
>  [<ffffffff8128ca52>] ? spin_bug+0xa2/0xf0
>  [<ffffffff8128cd53>] ? do_raw_spin_lock+0x193/0x1b0
>  [<ffffffff814eaa5e>] ? _raw_spin_lock_nested+0x4e/0x60
>  [<ffffffff81137aef>] ? nameidata_dentry_drop_rcu+0xcf/0x1b0
>  [<ffffffff814eaab3>] ? _raw_spin_lock+0x43/0x50
>  [<ffffffff81137aef>] ? nameidata_dentry_drop_rcu+0xcf/0x1b0
>  [<ffffffff81137c1b>] ? d_revalidate+0x4b/0x70
>  [<ffffffff81139a75>] ? link_path_walk+0x655/0x1210
>  [<ffffffff81138702>] ? path_init_rcu+0x1c2/0x370
>  [<ffffffff811387e5>] ? path_init_rcu+0x2a5/0x370
>  [<ffffffff81138702>] ? path_init_rcu+0x1c2/0x370
>  [<ffffffff811059b3>] ? might_fault+0x53/0xb0
>  [<ffffffff8113a9fe>] ? do_path_lookup+0x8e/0x1d0
>  [<ffffffff8113b936>] ? user_path_at+0xa6/0xe0
>  [<ffffffff8114c167>] ? vfsmount_lock_local_unlock+0x77/0x90
>  [<ffffffff814ebab9>] ? retint_swapgs+0x13/0x1b
>  [<ffffffff81090b15>] ? trace_hardirqs_on_caller+0x145/0x190
>  [<ffffffff8112841e>] ? sys_chdir+0x2e/0x90
>  [<ffffffff8100bfd2>] ? system_call_fastpath+0x16/0x1b
> 
> It looks like nameidata_dentry_drop_rcu() is trying spin_lock() twice
> for the same dentry when parent == dentry.
> 
> - NFS ->d_revalidate() returns -ECHILD for LOOKUP_RCU
> - VFS d_revalidate() will try ->d_revalidate() again after dropping
>   LOOKUP_RCU by nameidata_dentry_drop_rcu().
> - nameidata_dentry_drop_rcu() calls
> 	spin_lock(&parent->d_lock);
> 	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> - it may happen on all fs which specifies FS_REVAL_DOT
> 
> If we have a function like below, it may be useful.
> But are there so many cases like this problem?
> If it is not so many, then the fix will be adding several
> "if (!IS_ROOT(dentry)" into nameidata_dentry_drop_rcu(), I think.
> 
> int d_lock_parent_child(parent, child)
> {
> 	err = Success;
> 	spin_lock(&parent->d_lock);
> 	if (!IS_ROOT(dentry)) {
> 		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> 		if (unlikely(parent != dentry->d_parent)) {
> 			spin_unlock(&parent->d_lock);
> 			err = Error_Unmatch;
> 		}
> 	} else
> 		err = Success_Root;
> 	return err;
> }
> 
> void d_unlock_parent_child(int stat, parent, child)
> {
> 	Assert(stat == Error_Unmatch);
> 	if (stat == Success)
> 		spin_unlock(&dentry->d_lock);
> 	spin_unlock(&parent->d_lock);
> }

BTW, Nick: Given that some filesystems such as NFS are _always_ going to
reject LOOKUP_RCU, it would appear to be completely out of place to use
the 'unlikely()' keyword when testing the results of path_walk_rcu() and
friends. In particular when the kernel is running with nfsroot, we're
saying that 100% of all cases are 'unlikely'...

Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: vfs-scale, chroot
  2011-01-12 20:04 ` Trond Myklebust
@ 2011-01-13  1:25   ` Nick Piggin
  2011-01-13  3:55     ` Trond Myklebust
  0 siblings, 1 reply; 6+ messages in thread
From: Nick Piggin @ 2011-01-13  1:25 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: J. R. Okajima, linux-fsdevel, linux-kernel

On Thu, Jan 13, 2011 at 7:04 AM, Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:

> BTW, Nick: Given that some filesystems such as NFS are _always_ going to
> reject LOOKUP_RCU,

That's not very optimistic of you... why is that a given, my I ask?
Or do you just mean as-in the current code?

> it would appear to be completely out of place to use
> the 'unlikely()' keyword when testing the results of path_walk_rcu() and
> friends. In particular when the kernel is running with nfsroot, we're
> saying that 100% of all cases are 'unlikely'...

Well that _is_ an accepted use of branch annotations. For example it
is used when scheduling realtime tasks, because even if some systems
will do 99.9% of their scheduling on realtime tasks, it is not the common
case.

I'm not saying that applies here, but: if path walk performance is
important, then we should use local caching and rcu-walk. If not, then
why do we care about slightly slower branch?

The annotations really help to reduce icache penalty of added
complexity which is why I like them, but I'm happy to remove them
where they don't make sense of course.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: vfs-scale, chroot
  2011-01-13  1:25   ` Nick Piggin
@ 2011-01-13  3:55     ` Trond Myklebust
  2011-01-13  4:27       ` Nick Piggin
  0 siblings, 1 reply; 6+ messages in thread
From: Trond Myklebust @ 2011-01-13  3:55 UTC (permalink / raw)
  To: Nick Piggin; +Cc: J. R. Okajima, linux-fsdevel, linux-kernel

On Thu, 2011-01-13 at 12:25 +1100, Nick Piggin wrote: 
> On Thu, Jan 13, 2011 at 7:04 AM, Trond Myklebust
> <Trond.Myklebust@netapp.com> wrote:
> 
> > BTW, Nick: Given that some filesystems such as NFS are _always_ going to
> > reject LOOKUP_RCU,
> 
> That's not very optimistic of you... why is that a given, my I ask?
> Or do you just mean as-in the current code?

In the current code, the ECHILD is unconditional, so adding the
unlikely() is clearly pre-empting the reality on the ground.

In the longer run, we can convert the NFS d_revalidate() to accept
LOOKUP_RCU in the case where we believe we can trust the cache without
having to issue an RPC call. That will mainly be in the case where we
hold an NFSv4 delegation, or where we believe the parent directory mtime
has not changed. However that is not something I'm going to have the
cycles to do in this merge window. I assume there will be other
filesystems whose maintainers have similar constraints.

However, even if we were to make these changes, then we're not going to
be accepting LOOKUP_RCU in the 99.999% case that is usually a
requirement for unlikely() to be an acceptable optimisation. For a
number of common workload, directories change all the time, and in NFS,
that inevitably means we need to revalidate their contents with an RPC
call (unless we happen to hold an NFSv4 delegation).

> > it would appear to be completely out of place to use
> > the 'unlikely()' keyword when testing the results of path_walk_rcu() and
> > friends. In particular when the kernel is running with nfsroot, we're
> > saying that 100% of all cases are 'unlikely'...
> 
> Well that _is_ an accepted use of branch annotations. For example it
> is used when scheduling realtime tasks, because even if some systems
> will do 99.9% of their scheduling on realtime tasks, it is not the common
> case.
> 
> I'm not saying that applies here, but: if path walk performance is
> important, then we should use local caching and rcu-walk. If not, then
> why do we care about slightly slower branch?
> 
> The annotations really help to reduce icache penalty of added
> complexity which is why I like them, but I'm happy to remove them
> where they don't make sense of course.

You are adding overhead to existing common paths by doubling the calls
to d_revalidate() in ways which afaics will break branch prediction
(first setting, then clearing LOOKUP_RCU). You can at least try not to
maximise that impact by adding further branch prediction impediments.

Trond

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: vfs-scale, chroot
  2011-01-13  3:55     ` Trond Myklebust
@ 2011-01-13  4:27       ` Nick Piggin
  0 siblings, 0 replies; 6+ messages in thread
From: Nick Piggin @ 2011-01-13  4:27 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: J. R. Okajima, linux-fsdevel, linux-kernel

On Thu, Jan 13, 2011 at 2:55 PM, Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:
> On Thu, 2011-01-13 at 12:25 +1100, Nick Piggin wrote:
>> On Thu, Jan 13, 2011 at 7:04 AM, Trond Myklebust
>> <Trond.Myklebust@netapp.com> wrote:
>>
>> > BTW, Nick: Given that some filesystems such as NFS are _always_ going to
>> > reject LOOKUP_RCU,
>>
>> That's not very optimistic of you... why is that a given, my I ask?
>> Or do you just mean as-in the current code?
>
> In the current code, the ECHILD is unconditional, so adding the
> unlikely() is clearly pre-empting the reality on the ground.

Sure, I just hoped you didn't mean NFS can not be converted
to use it in future. nfs in particular is one of the more "difficult"
implementations I would like to see able to use rcu-walk.

> In the longer run, we can convert the NFS d_revalidate() to accept
> LOOKUP_RCU in the case where we believe we can trust the cache without
> having to issue an RPC call. That will mainly be in the case where we
> hold an NFSv4 delegation, or where we believe the parent directory mtime
> has not changed. However that is not something I'm going to have the
> cycles to do in this merge window. I assume there will be other
> filesystems whose maintainers have similar constraints.

Sure. And in the cases where lookup cannot be satisfied from RAM, the
filesystem simply should not be deployed for path-lookup performance
critical (in terms of CPU cycles) applications. It's a nice self-confirming
loop for the rcu-walk performance tradeoff.

> However, even if we were to make these changes, then we're not going to
> be accepting LOOKUP_RCU in the 99.999% case that is usually a
> requirement for unlikely() to be an acceptable optimisation. For a
> number of common workload, directories change all the time, and in NFS,
> that inevitably means we need to revalidate their contents with an RPC
> call (unless we happen to hold an NFSv4 delegation).

99.999%, no. 1 out of every 100 000 operations? That means that
if a correct branch annotation reduces 0.1 cycles (being pessimistic),
then an incorrect one costs 10000 cycles more than if the annotation
had not existed (on the order of 100 cache misses or 500 branch
mispredicts). That is totally out of proportion.

It really depends on a lot of things.

Branch annotation tends not to kill dynamic branch prediction hardware,
rather it should provide a default so that the hardware defaults to the
more common path if the branch is not in the history.

The main impact is moving uncommon case out of icache. This does
cost a bit of code and an extra unconditional branch when it is wrong.

I would say it is more between the order of 90%-99% (ie. order of 0.1s
of cycles improvement in the correct case, and 1s or 10s of added in
the incorrect case).

>> The annotations really help to reduce icache penalty of added
>> complexity which is why I like them, but I'm happy to remove them
>> where they don't make sense of course.
>
> You are adding overhead to existing common paths by doubling the calls
> to d_revalidate() in ways which afaics will break branch prediction
> (first setting, then clearing LOOKUP_RCU).

Well it should be set once, and clear for the rest of the path walk, so only
a dumb bimodal predictor would break, but without going into details, I'm
not denying that path walking is suboptimal for rcu-walk incapable
filesystems.

> You can at least try not to
> maximise that impact by adding further branch prediction impediments.

There is simply even more code, more branches, more loads, etc for
unconverted filesystems as well. The goal was to *first* minimise the cost
of rcu-walk, *then* minimise the cost of ref-walk.

So if we're looking at micro optimising performance for specific filesystems
and re tuning some of my assumptions based on real world cases, I would
like to wait for filesystems to be converted.

Yes that leaves 2.6.38 a tiny bit slower for unconverted filesystems, but it
really shouldn't be a big deal. If you desperately care about a few cycles
here, (I applaud you but) then convert to rcu-walk and it will blow your
mind.

I am not abandoning your plight :) The fact is that I did spend a long time
looking at profiles and assembly for rcu-walk converted cases, and only on
a very particular set of workloads. So I am really keen to re tune some of
those assumptions as we see how things get used in the real world, I
just don't know if it is worth redoing before fs developers get a bit of time
to explore how rcu-walk might work.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: vfs-scale, chroot
  2011-01-12 19:47 vfs-scale, chroot J. R. Okajima
  2011-01-12 20:04 ` Trond Myklebust
@ 2011-01-13  1:06 ` Nick Piggin
  1 sibling, 0 replies; 6+ messages in thread
From: Nick Piggin @ 2011-01-13  1:06 UTC (permalink / raw)
  To: J. R. Okajima; +Cc: linux-fsdevel, linux-kernel

On Thu, Jan 13, 2011 at 6:47 AM, J. R. Okajima <hooanon05@yahoo.co.jp> wrote:
>
> Hello Nick,
>
> I've got a crash around d_lock.
>
> # mount -t nfs host:/dir /nfs
> # chroot /nfs
>
> BUG: spinlock recursion on CPU#1, chroot/2524
>  lock: ffff88001d106880, .magic: dead4ead, .owner: chroot/2524, .owner_cpu: 1
> Call Trace:
>  [<ffffffff8128ca52>] ? spin_bug+0xa2/0xf0
>  [<ffffffff8128cd53>] ? do_raw_spin_lock+0x193/0x1b0
>  [<ffffffff814eaa5e>] ? _raw_spin_lock_nested+0x4e/0x60
>  [<ffffffff81137aef>] ? nameidata_dentry_drop_rcu+0xcf/0x1b0
>  [<ffffffff814eaab3>] ? _raw_spin_lock+0x43/0x50
>  [<ffffffff81137aef>] ? nameidata_dentry_drop_rcu+0xcf/0x1b0
>  [<ffffffff81137c1b>] ? d_revalidate+0x4b/0x70
>  [<ffffffff81139a75>] ? link_path_walk+0x655/0x1210
>  [<ffffffff81138702>] ? path_init_rcu+0x1c2/0x370
>  [<ffffffff811387e5>] ? path_init_rcu+0x2a5/0x370
>  [<ffffffff81138702>] ? path_init_rcu+0x1c2/0x370
>  [<ffffffff811059b3>] ? might_fault+0x53/0xb0
>  [<ffffffff8113a9fe>] ? do_path_lookup+0x8e/0x1d0
>  [<ffffffff8113b936>] ? user_path_at+0xa6/0xe0
>  [<ffffffff8114c167>] ? vfsmount_lock_local_unlock+0x77/0x90
>  [<ffffffff814ebab9>] ? retint_swapgs+0x13/0x1b
>  [<ffffffff81090b15>] ? trace_hardirqs_on_caller+0x145/0x190
>  [<ffffffff8112841e>] ? sys_chdir+0x2e/0x90
>  [<ffffffff8100bfd2>] ? system_call_fastpath+0x16/0x1b
>
> It looks like nameidata_dentry_drop_rcu() is trying spin_lock() twice
> for the same dentry when parent == dentry.
>
> - NFS ->d_revalidate() returns -ECHILD for LOOKUP_RCU
> - VFS d_revalidate() will try ->d_revalidate() again after dropping
>  LOOKUP_RCU by nameidata_dentry_drop_rcu().
> - nameidata_dentry_drop_rcu() calls
>        spin_lock(&parent->d_lock);
>        spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> - it may happen on all fs which specifies FS_REVAL_DOT
>
> If we have a function like below, it may be useful.
> But are there so many cases like this problem?
> If it is not so many, then the fix will be adding several
> "if (!IS_ROOT(dentry)" into nameidata_dentry_drop_rcu(), I think.

Hey, thanks for the report and the excellent debugging work! I agree
with your analysis. I'll prepare a patch for it shortly.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-01-13  4:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-12 19:47 vfs-scale, chroot J. R. Okajima
2011-01-12 20:04 ` Trond Myklebust
2011-01-13  1:25   ` Nick Piggin
2011-01-13  3:55     ` Trond Myklebust
2011-01-13  4:27       ` Nick Piggin
2011-01-13  1:06 ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).