* vfs-scale, chroot @ 2011-01-12 19:47 J. R. Okajima 2011-01-12 20:04 ` Trond Myklebust 2011-01-13 1:06 ` Nick Piggin 0 siblings, 2 replies; 6+ messages in thread From: J. R. Okajima @ 2011-01-12 19:47 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel Hello Nick, I've got a crash around d_lock. # mount -t nfs host:/dir /nfs # chroot /nfs BUG: spinlock recursion on CPU#1, chroot/2524 lock: ffff88001d106880, .magic: dead4ead, .owner: chroot/2524, .owner_cpu: 1 Call Trace: [<ffffffff8128ca52>] ? spin_bug+0xa2/0xf0 [<ffffffff8128cd53>] ? do_raw_spin_lock+0x193/0x1b0 [<ffffffff814eaa5e>] ? _raw_spin_lock_nested+0x4e/0x60 [<ffffffff81137aef>] ? nameidata_dentry_drop_rcu+0xcf/0x1b0 [<ffffffff814eaab3>] ? _raw_spin_lock+0x43/0x50 [<ffffffff81137aef>] ? nameidata_dentry_drop_rcu+0xcf/0x1b0 [<ffffffff81137c1b>] ? d_revalidate+0x4b/0x70 [<ffffffff81139a75>] ? link_path_walk+0x655/0x1210 [<ffffffff81138702>] ? path_init_rcu+0x1c2/0x370 [<ffffffff811387e5>] ? path_init_rcu+0x2a5/0x370 [<ffffffff81138702>] ? path_init_rcu+0x1c2/0x370 [<ffffffff811059b3>] ? might_fault+0x53/0xb0 [<ffffffff8113a9fe>] ? do_path_lookup+0x8e/0x1d0 [<ffffffff8113b936>] ? user_path_at+0xa6/0xe0 [<ffffffff8114c167>] ? vfsmount_lock_local_unlock+0x77/0x90 [<ffffffff814ebab9>] ? retint_swapgs+0x13/0x1b [<ffffffff81090b15>] ? trace_hardirqs_on_caller+0x145/0x190 [<ffffffff8112841e>] ? sys_chdir+0x2e/0x90 [<ffffffff8100bfd2>] ? system_call_fastpath+0x16/0x1b It looks like nameidata_dentry_drop_rcu() is trying spin_lock() twice for the same dentry when parent == dentry. - NFS ->d_revalidate() returns -ECHILD for LOOKUP_RCU - VFS d_revalidate() will try ->d_revalidate() again after dropping LOOKUP_RCU by nameidata_dentry_drop_rcu(). - nameidata_dentry_drop_rcu() calls spin_lock(&parent->d_lock); spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED); - it may happen on all fs which specifies FS_REVAL_DOT If we have a function like below, it may be useful. But are there so many cases like this problem? If it is not so many, then the fix will be adding several "if (!IS_ROOT(dentry)" into nameidata_dentry_drop_rcu(), I think. int d_lock_parent_child(parent, child) { err = Success; spin_lock(&parent->d_lock); if (!IS_ROOT(dentry)) { spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED); if (unlikely(parent != dentry->d_parent)) { spin_unlock(&parent->d_lock); err = Error_Unmatch; } } else err = Success_Root; return err; } void d_unlock_parent_child(int stat, parent, child) { Assert(stat == Error_Unmatch); if (stat == Success) spin_unlock(&dentry->d_lock); spin_unlock(&parent->d_lock); } J. R. Okajima ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: vfs-scale, chroot 2011-01-12 19:47 vfs-scale, chroot J. R. Okajima @ 2011-01-12 20:04 ` Trond Myklebust 2011-01-13 1:25 ` Nick Piggin 2011-01-13 1:06 ` Nick Piggin 1 sibling, 1 reply; 6+ messages in thread From: Trond Myklebust @ 2011-01-12 20:04 UTC (permalink / raw) To: J. R. Okajima; +Cc: Nick Piggin, linux-fsdevel, linux-kernel On Thu, 2011-01-13 at 04:47 +0900, J. R. Okajima wrote: > Hello Nick, > > I've got a crash around d_lock. > > # mount -t nfs host:/dir /nfs > # chroot /nfs > > BUG: spinlock recursion on CPU#1, chroot/2524 > lock: ffff88001d106880, .magic: dead4ead, .owner: chroot/2524, .owner_cpu: 1 > Call Trace: > [<ffffffff8128ca52>] ? spin_bug+0xa2/0xf0 > [<ffffffff8128cd53>] ? do_raw_spin_lock+0x193/0x1b0 > [<ffffffff814eaa5e>] ? _raw_spin_lock_nested+0x4e/0x60 > [<ffffffff81137aef>] ? nameidata_dentry_drop_rcu+0xcf/0x1b0 > [<ffffffff814eaab3>] ? _raw_spin_lock+0x43/0x50 > [<ffffffff81137aef>] ? nameidata_dentry_drop_rcu+0xcf/0x1b0 > [<ffffffff81137c1b>] ? d_revalidate+0x4b/0x70 > [<ffffffff81139a75>] ? link_path_walk+0x655/0x1210 > [<ffffffff81138702>] ? path_init_rcu+0x1c2/0x370 > [<ffffffff811387e5>] ? path_init_rcu+0x2a5/0x370 > [<ffffffff81138702>] ? path_init_rcu+0x1c2/0x370 > [<ffffffff811059b3>] ? might_fault+0x53/0xb0 > [<ffffffff8113a9fe>] ? do_path_lookup+0x8e/0x1d0 > [<ffffffff8113b936>] ? user_path_at+0xa6/0xe0 > [<ffffffff8114c167>] ? vfsmount_lock_local_unlock+0x77/0x90 > [<ffffffff814ebab9>] ? retint_swapgs+0x13/0x1b > [<ffffffff81090b15>] ? trace_hardirqs_on_caller+0x145/0x190 > [<ffffffff8112841e>] ? sys_chdir+0x2e/0x90 > [<ffffffff8100bfd2>] ? system_call_fastpath+0x16/0x1b > > It looks like nameidata_dentry_drop_rcu() is trying spin_lock() twice > for the same dentry when parent == dentry. > > - NFS ->d_revalidate() returns -ECHILD for LOOKUP_RCU > - VFS d_revalidate() will try ->d_revalidate() again after dropping > LOOKUP_RCU by nameidata_dentry_drop_rcu(). > - nameidata_dentry_drop_rcu() calls > spin_lock(&parent->d_lock); > spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED); > - it may happen on all fs which specifies FS_REVAL_DOT > > If we have a function like below, it may be useful. > But are there so many cases like this problem? > If it is not so many, then the fix will be adding several > "if (!IS_ROOT(dentry)" into nameidata_dentry_drop_rcu(), I think. > > int d_lock_parent_child(parent, child) > { > err = Success; > spin_lock(&parent->d_lock); > if (!IS_ROOT(dentry)) { > spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED); > if (unlikely(parent != dentry->d_parent)) { > spin_unlock(&parent->d_lock); > err = Error_Unmatch; > } > } else > err = Success_Root; > return err; > } > > void d_unlock_parent_child(int stat, parent, child) > { > Assert(stat == Error_Unmatch); > if (stat == Success) > spin_unlock(&dentry->d_lock); > spin_unlock(&parent->d_lock); > } BTW, Nick: Given that some filesystems such as NFS are _always_ going to reject LOOKUP_RCU, it would appear to be completely out of place to use the 'unlikely()' keyword when testing the results of path_walk_rcu() and friends. In particular when the kernel is running with nfsroot, we're saying that 100% of all cases are 'unlikely'... Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: vfs-scale, chroot 2011-01-12 20:04 ` Trond Myklebust @ 2011-01-13 1:25 ` Nick Piggin 2011-01-13 3:55 ` Trond Myklebust 0 siblings, 1 reply; 6+ messages in thread From: Nick Piggin @ 2011-01-13 1:25 UTC (permalink / raw) To: Trond Myklebust; +Cc: J. R. Okajima, linux-fsdevel, linux-kernel On Thu, Jan 13, 2011 at 7:04 AM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > BTW, Nick: Given that some filesystems such as NFS are _always_ going to > reject LOOKUP_RCU, That's not very optimistic of you... why is that a given, my I ask? Or do you just mean as-in the current code? > it would appear to be completely out of place to use > the 'unlikely()' keyword when testing the results of path_walk_rcu() and > friends. In particular when the kernel is running with nfsroot, we're > saying that 100% of all cases are 'unlikely'... Well that _is_ an accepted use of branch annotations. For example it is used when scheduling realtime tasks, because even if some systems will do 99.9% of their scheduling on realtime tasks, it is not the common case. I'm not saying that applies here, but: if path walk performance is important, then we should use local caching and rcu-walk. If not, then why do we care about slightly slower branch? The annotations really help to reduce icache penalty of added complexity which is why I like them, but I'm happy to remove them where they don't make sense of course. Thanks, Nick ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: vfs-scale, chroot 2011-01-13 1:25 ` Nick Piggin @ 2011-01-13 3:55 ` Trond Myklebust 2011-01-13 4:27 ` Nick Piggin 0 siblings, 1 reply; 6+ messages in thread From: Trond Myklebust @ 2011-01-13 3:55 UTC (permalink / raw) To: Nick Piggin; +Cc: J. R. Okajima, linux-fsdevel, linux-kernel On Thu, 2011-01-13 at 12:25 +1100, Nick Piggin wrote: > On Thu, Jan 13, 2011 at 7:04 AM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > BTW, Nick: Given that some filesystems such as NFS are _always_ going to > > reject LOOKUP_RCU, > > That's not very optimistic of you... why is that a given, my I ask? > Or do you just mean as-in the current code? In the current code, the ECHILD is unconditional, so adding the unlikely() is clearly pre-empting the reality on the ground. In the longer run, we can convert the NFS d_revalidate() to accept LOOKUP_RCU in the case where we believe we can trust the cache without having to issue an RPC call. That will mainly be in the case where we hold an NFSv4 delegation, or where we believe the parent directory mtime has not changed. However that is not something I'm going to have the cycles to do in this merge window. I assume there will be other filesystems whose maintainers have similar constraints. However, even if we were to make these changes, then we're not going to be accepting LOOKUP_RCU in the 99.999% case that is usually a requirement for unlikely() to be an acceptable optimisation. For a number of common workload, directories change all the time, and in NFS, that inevitably means we need to revalidate their contents with an RPC call (unless we happen to hold an NFSv4 delegation). > > it would appear to be completely out of place to use > > the 'unlikely()' keyword when testing the results of path_walk_rcu() and > > friends. In particular when the kernel is running with nfsroot, we're > > saying that 100% of all cases are 'unlikely'... > > Well that _is_ an accepted use of branch annotations. For example it > is used when scheduling realtime tasks, because even if some systems > will do 99.9% of their scheduling on realtime tasks, it is not the common > case. > > I'm not saying that applies here, but: if path walk performance is > important, then we should use local caching and rcu-walk. If not, then > why do we care about slightly slower branch? > > The annotations really help to reduce icache penalty of added > complexity which is why I like them, but I'm happy to remove them > where they don't make sense of course. You are adding overhead to existing common paths by doubling the calls to d_revalidate() in ways which afaics will break branch prediction (first setting, then clearing LOOKUP_RCU). You can at least try not to maximise that impact by adding further branch prediction impediments. Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: vfs-scale, chroot 2011-01-13 3:55 ` Trond Myklebust @ 2011-01-13 4:27 ` Nick Piggin 0 siblings, 0 replies; 6+ messages in thread From: Nick Piggin @ 2011-01-13 4:27 UTC (permalink / raw) To: Trond Myklebust; +Cc: J. R. Okajima, linux-fsdevel, linux-kernel On Thu, Jan 13, 2011 at 2:55 PM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > On Thu, 2011-01-13 at 12:25 +1100, Nick Piggin wrote: >> On Thu, Jan 13, 2011 at 7:04 AM, Trond Myklebust >> <Trond.Myklebust@netapp.com> wrote: >> >> > BTW, Nick: Given that some filesystems such as NFS are _always_ going to >> > reject LOOKUP_RCU, >> >> That's not very optimistic of you... why is that a given, my I ask? >> Or do you just mean as-in the current code? > > In the current code, the ECHILD is unconditional, so adding the > unlikely() is clearly pre-empting the reality on the ground. Sure, I just hoped you didn't mean NFS can not be converted to use it in future. nfs in particular is one of the more "difficult" implementations I would like to see able to use rcu-walk. > In the longer run, we can convert the NFS d_revalidate() to accept > LOOKUP_RCU in the case where we believe we can trust the cache without > having to issue an RPC call. That will mainly be in the case where we > hold an NFSv4 delegation, or where we believe the parent directory mtime > has not changed. However that is not something I'm going to have the > cycles to do in this merge window. I assume there will be other > filesystems whose maintainers have similar constraints. Sure. And in the cases where lookup cannot be satisfied from RAM, the filesystem simply should not be deployed for path-lookup performance critical (in terms of CPU cycles) applications. It's a nice self-confirming loop for the rcu-walk performance tradeoff. > However, even if we were to make these changes, then we're not going to > be accepting LOOKUP_RCU in the 99.999% case that is usually a > requirement for unlikely() to be an acceptable optimisation. For a > number of common workload, directories change all the time, and in NFS, > that inevitably means we need to revalidate their contents with an RPC > call (unless we happen to hold an NFSv4 delegation). 99.999%, no. 1 out of every 100 000 operations? That means that if a correct branch annotation reduces 0.1 cycles (being pessimistic), then an incorrect one costs 10000 cycles more than if the annotation had not existed (on the order of 100 cache misses or 500 branch mispredicts). That is totally out of proportion. It really depends on a lot of things. Branch annotation tends not to kill dynamic branch prediction hardware, rather it should provide a default so that the hardware defaults to the more common path if the branch is not in the history. The main impact is moving uncommon case out of icache. This does cost a bit of code and an extra unconditional branch when it is wrong. I would say it is more between the order of 90%-99% (ie. order of 0.1s of cycles improvement in the correct case, and 1s or 10s of added in the incorrect case). >> The annotations really help to reduce icache penalty of added >> complexity which is why I like them, but I'm happy to remove them >> where they don't make sense of course. > > You are adding overhead to existing common paths by doubling the calls > to d_revalidate() in ways which afaics will break branch prediction > (first setting, then clearing LOOKUP_RCU). Well it should be set once, and clear for the rest of the path walk, so only a dumb bimodal predictor would break, but without going into details, I'm not denying that path walking is suboptimal for rcu-walk incapable filesystems. > You can at least try not to > maximise that impact by adding further branch prediction impediments. There is simply even more code, more branches, more loads, etc for unconverted filesystems as well. The goal was to *first* minimise the cost of rcu-walk, *then* minimise the cost of ref-walk. So if we're looking at micro optimising performance for specific filesystems and re tuning some of my assumptions based on real world cases, I would like to wait for filesystems to be converted. Yes that leaves 2.6.38 a tiny bit slower for unconverted filesystems, but it really shouldn't be a big deal. If you desperately care about a few cycles here, (I applaud you but) then convert to rcu-walk and it will blow your mind. I am not abandoning your plight :) The fact is that I did spend a long time looking at profiles and assembly for rcu-walk converted cases, and only on a very particular set of workloads. So I am really keen to re tune some of those assumptions as we see how things get used in the real world, I just don't know if it is worth redoing before fs developers get a bit of time to explore how rcu-walk might work. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: vfs-scale, chroot 2011-01-12 19:47 vfs-scale, chroot J. R. Okajima 2011-01-12 20:04 ` Trond Myklebust @ 2011-01-13 1:06 ` Nick Piggin 1 sibling, 0 replies; 6+ messages in thread From: Nick Piggin @ 2011-01-13 1:06 UTC (permalink / raw) To: J. R. Okajima; +Cc: linux-fsdevel, linux-kernel On Thu, Jan 13, 2011 at 6:47 AM, J. R. Okajima <hooanon05@yahoo.co.jp> wrote: > > Hello Nick, > > I've got a crash around d_lock. > > # mount -t nfs host:/dir /nfs > # chroot /nfs > > BUG: spinlock recursion on CPU#1, chroot/2524 > lock: ffff88001d106880, .magic: dead4ead, .owner: chroot/2524, .owner_cpu: 1 > Call Trace: > [<ffffffff8128ca52>] ? spin_bug+0xa2/0xf0 > [<ffffffff8128cd53>] ? do_raw_spin_lock+0x193/0x1b0 > [<ffffffff814eaa5e>] ? _raw_spin_lock_nested+0x4e/0x60 > [<ffffffff81137aef>] ? nameidata_dentry_drop_rcu+0xcf/0x1b0 > [<ffffffff814eaab3>] ? _raw_spin_lock+0x43/0x50 > [<ffffffff81137aef>] ? nameidata_dentry_drop_rcu+0xcf/0x1b0 > [<ffffffff81137c1b>] ? d_revalidate+0x4b/0x70 > [<ffffffff81139a75>] ? link_path_walk+0x655/0x1210 > [<ffffffff81138702>] ? path_init_rcu+0x1c2/0x370 > [<ffffffff811387e5>] ? path_init_rcu+0x2a5/0x370 > [<ffffffff81138702>] ? path_init_rcu+0x1c2/0x370 > [<ffffffff811059b3>] ? might_fault+0x53/0xb0 > [<ffffffff8113a9fe>] ? do_path_lookup+0x8e/0x1d0 > [<ffffffff8113b936>] ? user_path_at+0xa6/0xe0 > [<ffffffff8114c167>] ? vfsmount_lock_local_unlock+0x77/0x90 > [<ffffffff814ebab9>] ? retint_swapgs+0x13/0x1b > [<ffffffff81090b15>] ? trace_hardirqs_on_caller+0x145/0x190 > [<ffffffff8112841e>] ? sys_chdir+0x2e/0x90 > [<ffffffff8100bfd2>] ? system_call_fastpath+0x16/0x1b > > It looks like nameidata_dentry_drop_rcu() is trying spin_lock() twice > for the same dentry when parent == dentry. > > - NFS ->d_revalidate() returns -ECHILD for LOOKUP_RCU > - VFS d_revalidate() will try ->d_revalidate() again after dropping > LOOKUP_RCU by nameidata_dentry_drop_rcu(). > - nameidata_dentry_drop_rcu() calls > spin_lock(&parent->d_lock); > spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED); > - it may happen on all fs which specifies FS_REVAL_DOT > > If we have a function like below, it may be useful. > But are there so many cases like this problem? > If it is not so many, then the fix will be adding several > "if (!IS_ROOT(dentry)" into nameidata_dentry_drop_rcu(), I think. Hey, thanks for the report and the excellent debugging work! I agree with your analysis. I'll prepare a patch for it shortly. Thanks, Nick ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-01-13 4:27 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-01-12 19:47 vfs-scale, chroot J. R. Okajima 2011-01-12 20:04 ` Trond Myklebust 2011-01-13 1:25 ` Nick Piggin 2011-01-13 3:55 ` Trond Myklebust 2011-01-13 4:27 ` Nick Piggin 2011-01-13 1:06 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).