[PATCH] vfs: Speed up deactivate_super for non-modular filesystems

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
       [not found]                                     ` <1336201977.7346.22.camel@marge.simpson.net>
@ 2012-05-07 21:51                                       ` Eric W. Biederman
  2012-05-07 22:17                                         ` Al Viro
  0 siblings, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2012-05-07 21:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling, Paul E. McKenney, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

Recently it was observed that a distilled version of vsftp was taking a
surprising amount of time reaping zombies.  A measurement was taken
and vsftp was taking about 4ms (one jiffie) to reap each zombie and
those 4ms were spent spleeping in rcu_barrier in deactivate_locked_super.

The reason vsftp was sleeping in deactivate_locked_super is because
vsftp creates a pid namespace for each connection, and with that
pid namespace comes an internal mount of /proc.  That internal mount
of proc is unmounted when the last process in the pid namespace is
reaped.

/proc and similar non-modular filesystems do not need a rcu_barrier
in deactivate_locked_super.  Being non-modular there is no danger
of the rcu callback running after the module is unloaded.

Therefore do the easy thing and remove 4ms+ from unmount times by only
calling rcu_barrier for modular filesystems in unmount.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/super.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index cf00177..c739ef8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -261,7 +261,8 @@ void deactivate_locked_super(struct super_block *s)
 		 * We need to call rcu_barrier so all the delayed rcu free
 		 * inodes are flushed before we release the fs module.
 		 */
-		rcu_barrier();
+		if (fs->owner)
+			rcu_barrier();
 		put_filesystem(fs);
 		put_super(s);
 	} else {
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-07 21:51                                       ` [PATCH] vfs: Speed up deactivate_super for non-modular filesystems Eric W. Biederman
@ 2012-05-07 22:17                                         ` Al Viro
  2012-05-07 23:56                                           ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Al Viro @ 2012-05-07 22:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling, Paul E. McKenney, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:

> /proc and similar non-modular filesystems do not need a rcu_barrier
> in deactivate_locked_super.  Being non-modular there is no danger
> of the rcu callback running after the module is unloaded.

There's more than just a module unload there, though - actual freeing
struct super_block also happens past that rcu_barrier()...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-07 22:17                                         ` Al Viro
@ 2012-05-07 23:56                                           ` Paul E. McKenney
  2012-05-08  1:07                                             ` Eric W. Biederman
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2012-05-07 23:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
> 
> > /proc and similar non-modular filesystems do not need a rcu_barrier
> > in deactivate_locked_super.  Being non-modular there is no danger
> > of the rcu callback running after the module is unloaded.
> 
> There's more than just a module unload there, though - actual freeing
> struct super_block also happens past that rcu_barrier()...

Is there anything in there for which synchronous operation is required?
If not, one approach would be to drop the rcu_barrier() calls to a
workqueue or something similar.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-07 23:56                                           ` Paul E. McKenney
@ 2012-05-08  1:07                                             ` Eric W. Biederman
  2012-05-08  4:53                                               ` Mike Galbraith
  2012-05-09  7:55                                               ` Nick Piggin
  0 siblings, 2 replies; 10+ messages in thread
From: Eric W. Biederman @ 2012-05-08  1:07 UTC (permalink / raw)
  To: paulmck
  Cc: Al Viro, Andrew Morton, Oleg Nesterov, LKML, Pavel Emelyanov,
	Cyrill Gorcunov, Louis Rilling, Mike Galbraith, Christoph Hellwig,
	linux-fsdevel

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>> 
>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>> > in deactivate_locked_super.  Being non-modular there is no danger
>> > of the rcu callback running after the module is unloaded.
>> 
>> There's more than just a module unload there, though - actual freeing
>>  struct super_block also happens past that rcu_barrier()...

Al.  I have not closely audited the entire code path but at a quick
sample I see no evidence that anything depends on inode->i_sb being
rcu safe.  Do you know of any such location?

It has only been a year and a half since Nick added this code which
isn't very much time to have grown strange dependencies like that.

> Is there anything in there for which synchronous operation is required?
> If not, one approach would be to drop the rcu_barrier() calls to a
> workqueue or something similar.

We need to drain all of the rcu callbacks before we free the slab
and unload the module.

This actually makes deactivate_locked_super the totally wrong place
for the rcu_barrier.  We want the rcu_barrier in the module exit
routine where we destroy the inode cache.

What I see as the real need is the filesystem modules need to do:
	rcu_barrier()
	kmem_cache_destroy(cache);

Perhaps we can add some helpers to make it easy.  But I think
I would be happy today with simply moving the rcu_barrier into
every filesystems module exit path, just before the file system
module destoryed it's inode cache.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-08  1:07                                             ` Eric W. Biederman
@ 2012-05-08  4:53                                               ` Mike Galbraith
  2012-05-09  7:55                                               ` Nick Piggin
  1 sibling, 0 replies; 10+ messages in thread
From: Mike Galbraith @ 2012-05-08  4:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: paulmck, Al Viro, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling,
	Christoph Hellwig, linux-fsdevel

On Mon, 2012-05-07 at 18:07 -0700, Eric W. Biederman wrote: 
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> What I see as the real need is the filesystem modules need to do:
> 	rcu_barrier()
> 	kmem_cache_destroy(cache);
> 
> Perhaps we can add some helpers to make it easy.  But I think
> I would be happy today with simply moving the rcu_barrier into
> every filesystems module exit path, just before the file system
> module destoryed it's inode cache.

One liner kills the reap bottleneck and 99.999% of cache bloat.  1000
backgrounded vfstpd testcases finished ~instantly and left one
persistent pid namespace vs taking ages and bloating very badly.

Hacked up hackbench still hurts with all (except user) namespaces, but
that's a different problem (modulo hackbench wonderfulness).

Previous numbers:

default flags = SIGCHLD

-namespace:  flag |= CLONE_NEWPID 
-all:  flags |= CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUSER

marge:/usr/local/tmp/starvation # ./hackbench
Running with 10*40 (== 400) tasks.
Time: 2.636
marge:/usr/local/tmp/starvation # ./hackbench -namespace
Running with 10*40 (== 400) tasks.
Time: 11.624
marge:/usr/local/tmp/starvation # ./hackbench -namespace -all
Running with 10*40 (== 400) tasks.
Time: 51.474


New numbers: 
marge:/usr/local/tmp/starvation # time ./hackbench
Running with 10*40 (== 400) tasks.
Time: 2.718

real    0m2.877s
user    0m0.060s
sys     0m10.057s
marge:/usr/local/tmp/starvation # time ./hackbench -namespace
Running with 10*40 (== 400) tasks.
Time: 2.689

real    0m2.878s
user    0m0.060s
sys     0m9.945s
marge:/usr/local/tmp/starvation # time ./hackbench -namespace -all
Running with 10*40 (== 400) tasks.                                                                                                                                                                                                          
Time: 2.521
                                                                                                                                                                                                                                            
real    0m27.774s
user    0m0.048s                                                                                                                                                                                                                            
sys     0m21.681s
marge:/usr/local/tmp/starvation #  

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-08  1:07                                             ` Eric W. Biederman
  2012-05-08  4:53                                               ` Mike Galbraith
@ 2012-05-09  7:55                                               ` Nick Piggin
  2012-05-09 11:02                                                 ` Eric W. Biederman
  2012-05-09 13:59                                                 ` Paul E. McKenney
  1 sibling, 2 replies; 10+ messages in thread
From: Nick Piggin @ 2012-05-09  7:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: paulmck, Al Viro, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>
>> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>>>
>>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>>> > in deactivate_locked_super.  Being non-modular there is no danger
>>> > of the rcu callback running after the module is unloaded.
>>>
>>> There's more than just a module unload there, though - actual freeing
>>>  struct super_block also happens past that rcu_barrier()...
>
> Al.  I have not closely audited the entire code path but at a quick
> sample I see no evidence that anything depends on inode->i_sb being
> rcu safe.  Do you know of any such location?
>
> It has only been a year and a half since Nick added this code which
> isn't very much time to have grown strange dependencies like that.

No, it has always depended on this.

Look at ncp_compare_dentry(), for example.


>> Is there anything in there for which synchronous operation is required?
>> If not, one approach would be to drop the rcu_barrier() calls to a
>> workqueue or something similar.
>
> We need to drain all of the rcu callbacks before we free the slab
> and unload the module.
>
> This actually makes deactivate_locked_super the totally wrong place
> for the rcu_barrier.  We want the rcu_barrier in the module exit
> routine where we destroy the inode cache.
>
> What I see as the real need is the filesystem modules need to do:
>        rcu_barrier()
>        kmem_cache_destroy(cache);
>
> Perhaps we can add some helpers to make it easy.  But I think
> I would be happy today with simply moving the rcu_barrier into
> every filesystems module exit path, just before the file system
> module destoryed it's inode cache.

No, because that's not the only requirement for the rcu_barrier.

Making it asynchronous is not something I wanted to do, because
then we potentially have a process exiting from kernel space after
releasing last reference on a mount, but the mount does not go
away until "some time" later. Which is crazy.

However. We are holding vfsmount_lock for read at the point
where we ever actually do anything with an "rcu-referenced"
dentry/inode. I wonder if we could use this to get i_sb pinned.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-09  7:55                                               ` Nick Piggin
@ 2012-05-09 11:02                                                 ` Eric W. Biederman
  2012-05-15  8:40                                                   ` Nick Piggin
  2012-05-09 13:59                                                 ` Paul E. McKenney
  1 sibling, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2012-05-09 11:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: paulmck, Al Viro, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

Nick Piggin <npiggin@gmail.com> writes:

> On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>
>>> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>>>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>>>>
>>>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>>>> > in deactivate_locked_super.  Being non-modular there is no danger
>>>> > of the rcu callback running after the module is unloaded.
>>>>
>>>> There's more than just a module unload there, though - actual freeing
>>>>  struct super_block also happens past that rcu_barrier()...
>>
>> Al.  I have not closely audited the entire code path but at a quick
>> sample I see no evidence that anything depends on inode->i_sb being
>> rcu safe.  Do you know of any such location?
>>
>> It has only been a year and a half since Nick added this code which
>> isn't very much time to have grown strange dependencies like that.
>
> No, it has always depended on this.
>
> Look at ncp_compare_dentry(), for example.

Interesting. ncp_compare_dentry this logic is broken.

Accessing i_sb->s_fs_info for parameters does seem reasonable.
Unfortunately ncp_put_super frees server directly.

Meaning if we are depending on only rcu protections a badly timed
ncp_compare_dentry will oops the kernel.

I am going to go out on a limb and guess that every other filesystem
with a similar dependency follows the same pattern and is likely
broken as well.

>> We need to drain all of the rcu callbacks before we free the slab
>> and unload the module.
>>
>> This actually makes deactivate_locked_super the totally wrong place
>> for the rcu_barrier.  We want the rcu_barrier in the module exit
>> routine where we destroy the inode cache.
>>
>> What I see as the real need is the filesystem modules need to do:
>>        rcu_barrier()
>>        kmem_cache_destroy(cache);
>>
>> Perhaps we can add some helpers to make it easy.  But I think
>> I would be happy today with simply moving the rcu_barrier into
>> every filesystems module exit path, just before the file system
>> module destoryed it's inode cache.
>
> No, because that's not the only requirement for the rcu_barrier.
>
> Making it asynchronous is not something I wanted to do, because
> then we potentially have a process exiting from kernel space after
> releasing last reference on a mount, but the mount does not go
> away until "some time" later. Which is crazy.

Well we certainly want a deliberate unmount of a filesystem to safely
and successfully put the filesystem in a sane state before the unmount
returns.

If we have a few linger data structures waiting for an rcu grace period
after a process exits I'm not certain that is bad.  Although I would not
mind it much.

> However. We are holding vfsmount_lock for read at the point
> where we ever actually do anything with an "rcu-referenced"
> dentry/inode. I wonder if we could use this to get i_sb pinned.

Interesting observation.

Taking that observation farther we have a mount reference count, that
pins the super block.  So at first glance the super block looks safe
without any rcu protections.

I'm not certain what pins the inodes. Let's see:

mnt->d_mnt_root has the root dentry of the dentry tree, and that
dentry count is protected by the vfsmount_lock.

Beyond that we have kill_sb.
  kill_sb() typically calls generic_shutdown_super()
  From generic_shutdown_super() we call:
     shrink_dcache_for_umount() which flushes lingering dentries.
     evict_inodes() which flushes lingering inodes.

So in some sense the reference counts on mounts and dentries protect
the cache.

So the only case I can see where rcu appears to matter is when we are
freeing dentries.

When freeing dentries the idiom is:
dentry_iput(dentry);
d_free(dentry);

d_free does if (dentry->d_flags & DCACHE_RCUACCESS) call_rcu(... __d_free);

So while most of the time dentries hold onto inodes reliably with a
reference count and most of the time dentries are kept alive by the
dentry->d_count part of the time there is this gray zone where only
rcu references to dentries are keeping them alive.

Which explains the need for rcu freeing of inodes. 

This makes me wonder why we think calling d_release is safe
before we want the rcu grace period.

Documentation/filesystems/vfs.txt seems to duplicate this reasoning
of why the superblock is safe.  Because we hold a real reference to it
from the vfsmount.

The strangest case is calling __lookup_mnt during an "rcu-path-walk".
But mounts are reference counted from the mount namespace, and
are protected during an "rcu-path-walk" by vfsmount_lock read locked,
and are only changed with vfsmount_lock write locked.

Which leads again (with stronger reasons now) to the conclusions that:
a) We don't depend on rcu_barrier to protect the superblock.
b) My trivial patch is safe.
c) We probably should move rcu_barrier to the filesystem module exit
   routines, just to make things clear and to make everything faster.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-09  7:55                                               ` Nick Piggin
  2012-05-09 11:02                                                 ` Eric W. Biederman
@ 2012-05-09 13:59                                                 ` Paul E. McKenney
  1 sibling, 0 replies; 10+ messages in thread
From: Paul E. McKenney @ 2012-05-09 13:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric W. Biederman, Al Viro, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

On Wed, May 09, 2012 at 05:55:57PM +1000, Nick Piggin wrote:
> On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

[ . . . ]

> >> Is there anything in there for which synchronous operation is required?
> >> If not, one approach would be to drop the rcu_barrier() calls to a
> >> workqueue or something similar.
> >
> > We need to drain all of the rcu callbacks before we free the slab
> > and unload the module.
> >
> > This actually makes deactivate_locked_super the totally wrong place
> > for the rcu_barrier.  We want the rcu_barrier in the module exit
> > routine where we destroy the inode cache.
> >
> > What I see as the real need is the filesystem modules need to do:
> >        rcu_barrier()
> >        kmem_cache_destroy(cache);
> >
> > Perhaps we can add some helpers to make it easy.  But I think
> > I would be happy today with simply moving the rcu_barrier into
> > every filesystems module exit path, just before the file system
> > module destoryed it's inode cache.
> 
> No, because that's not the only requirement for the rcu_barrier.
> 
> Making it asynchronous is not something I wanted to do, because
> then we potentially have a process exiting from kernel space after
> releasing last reference on a mount, but the mount does not go
> away until "some time" later. Which is crazy.

In any case, I am looking into making concurrent calls to rcu_barrier()
share each others' work, so if asynchronous turns out to be needed,
it will be efficient.

						Thanx, Paul

> However. We are holding vfsmount_lock for read at the point
> where we ever actually do anything with an "rcu-referenced"
> dentry/inode. I wonder if we could use this to get i_sb pinned.
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-09 11:02                                                 ` Eric W. Biederman
@ 2012-05-15  8:40                                                   ` Nick Piggin
  2012-05-16  0:34                                                     ` Eric W. Biederman
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2012-05-15  8:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: paulmck, Al Viro, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

On 9 May 2012 21:02, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Nick Piggin <npiggin@gmail.com> writes:
>
>> On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>>
>>>> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>>>>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>>>>>
>>>>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>>>>> > in deactivate_locked_super.  Being non-modular there is no danger
>>>>> > of the rcu callback running after the module is unloaded.
>>>>>
>>>>> There's more than just a module unload there, though - actual freeing
>>>>>  struct super_block also happens past that rcu_barrier()...
>>>
>>> Al.  I have not closely audited the entire code path but at a quick
>>> sample I see no evidence that anything depends on inode->i_sb being
>>> rcu safe.  Do you know of any such location?
>>>
>>> It has only been a year and a half since Nick added this code which
>>> isn't very much time to have grown strange dependencies like that.
>>
>> No, it has always depended on this.
>>
>> Look at ncp_compare_dentry(), for example.
>
> Interesting. ncp_compare_dentry this logic is broken.
>
> Accessing i_sb->s_fs_info for parameters does seem reasonable.
> Unfortunately ncp_put_super frees server directly.
>
> Meaning if we are depending on only rcu protections a badly timed
> ncp_compare_dentry will oops the kernel.
>
> I am going to go out on a limb and guess that every other filesystem
> with a similar dependency follows the same pattern and is likely
> broken as well.

But ncp_put_super should be called after the rcu_barrier(), no?

How is it broken?


>>> We need to drain all of the rcu callbacks before we free the slab
>>> and unload the module.
>>>
>>> This actually makes deactivate_locked_super the totally wrong place
>>> for the rcu_barrier.  We want the rcu_barrier in the module exit
>>> routine where we destroy the inode cache.
>>>
>>> What I see as the real need is the filesystem modules need to do:
>>>        rcu_barrier()
>>>        kmem_cache_destroy(cache);
>>>
>>> Perhaps we can add some helpers to make it easy.  But I think
>>> I would be happy today with simply moving the rcu_barrier into
>>> every filesystems module exit path, just before the file system
>>> module destoryed it's inode cache.
>>
>> No, because that's not the only requirement for the rcu_barrier.
>>
>> Making it asynchronous is not something I wanted to do, because
>> then we potentially have a process exiting from kernel space after
>> releasing last reference on a mount, but the mount does not go
>> away until "some time" later. Which is crazy.
>
> Well we certainly want a deliberate unmount of a filesystem to safely
> and successfully put the filesystem in a sane state before the unmount
> returns.
>
> If we have a few linger data structures waiting for an rcu grace period
> after a process exits I'm not certain that is bad.  Although I would not
> mind it much.
>
>> However. We are holding vfsmount_lock for read at the point
>> where we ever actually do anything with an "rcu-referenced"
>> dentry/inode. I wonder if we could use this to get i_sb pinned.
>
> Interesting observation.
>
> Taking that observation farther we have a mount reference count, that
> pins the super block.  So at first glance the super block looks safe
> without any rcu protections.

Well yes, that's what I'm getting at. But I don't think it's quite complete...

>
> I'm not certain what pins the inodes. Let's see:
>
> mnt->d_mnt_root has the root dentry of the dentry tree, and that
> dentry count is protected by the vfsmount_lock.

If the mount is already detached from the namespace when we start
to do a path walk, AFAIKS it can be freed up from underneath us at
that point.

This would require cycling vfsmount_lock for write in such path. It's
better than rcu_barrier probably, but not terribly nice.

>
> Beyond that we have kill_sb.
>  kill_sb() typically calls generic_shutdown_super()
>  From generic_shutdown_super() we call:
>     shrink_dcache_for_umount() which flushes lingering dentries.
>     evict_inodes() which flushes lingering inodes.
>
> So in some sense the reference counts on mounts and dentries protect
> the cache.
>
> So the only case I can see where rcu appears to matter is when we are
> freeing dentries.
>
> When freeing dentries the idiom is:
> dentry_iput(dentry);
> d_free(dentry);
>
> d_free does if (dentry->d_flags & DCACHE_RCUACCESS) call_rcu(... __d_free);
>
> So while most of the time dentries hold onto inodes reliably with a
> reference count and most of the time dentries are kept alive by the
> dentry->d_count part of the time there is this gray zone where only
> rcu references to dentries are keeping them alive.
>
> Which explains the need for rcu freeing of inodes.
>
> This makes me wonder why we think calling d_release is safe
> before we want the rcu grace period.

Why wouldn't it be? The superblock cannot go away until all dentries
are freed.

>
> Documentation/filesystems/vfs.txt seems to duplicate this reasoning
> of why the superblock is safe.  Because we hold a real reference to it
> from the vfsmount.

rcu walk does not hold a reference to the vfsmount, however. It can
go away. This is why functions which can be called from rcu-walk
must go through synchronize_rcu() before they go away, also before
the superblock goes away.

The other way we could change the rule is to require barrier only for
those filesystems which access superblock or other info from rcu-walk.
I would prefer not to have such a rule, but it could be pragmatic.

>
> The strangest case is calling __lookup_mnt during an "rcu-path-walk".
> But mounts are reference counted from the mount namespace, and
> are protected during an "rcu-path-walk" by vfsmount_lock read locked,
> and are only changed with vfsmount_lock write locked.
>
> Which leads again (with stronger reasons now) to the conclusions that:
> a) We don't depend on rcu_barrier to protect the superblock.
> b) My trivial patch is safe.
> c) We probably should move rcu_barrier to the filesystem module exit
>   routines, just to make things clear and to make everything faster.

Still not convinced.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
  2012-05-15  8:40                                                   ` Nick Piggin
@ 2012-05-16  0:34                                                     ` Eric W. Biederman
  0 siblings, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2012-05-16  0:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: paulmck, Al Viro, Andrew Morton, Oleg Nesterov, LKML,
	Pavel Emelyanov, Cyrill Gorcunov, Louis Rilling, Mike Galbraith,
	Christoph Hellwig, linux-fsdevel

Nick Piggin <npiggin@gmail.com> writes:

> On 9 May 2012 21:02, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Nick Piggin <npiggin@gmail.com> writes:
>>
>>> On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>>>
>>>>> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>>>>>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>>>>>>
>>>>>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>>>>>> > in deactivate_locked_super.  Being non-modular there is no danger
>>>>>> > of the rcu callback running after the module is unloaded.
>>>>>>
>>>>>> There's more than just a module unload there, though - actual freeing
>>>>>>  struct super_block also happens past that rcu_barrier()...
>>>>
>>>> Al.  I have not closely audited the entire code path but at a quick
>>>> sample I see no evidence that anything depends on inode->i_sb being
>>>> rcu safe.  Do you know of any such location?
>>>>
>>>> It has only been a year and a half since Nick added this code which
>>>> isn't very much time to have grown strange dependencies like that.
>>>
>>> No, it has always depended on this.
>>>
>>> Look at ncp_compare_dentry(), for example.
>>
>> Interesting. ncp_compare_dentry this logic is broken.
>>
>> Accessing i_sb->s_fs_info for parameters does seem reasonable.
>> Unfortunately ncp_put_super frees server directly.
>>
>> Meaning if we are depending on only rcu protections a badly timed
>> ncp_compare_dentry will oops the kernel.
>>
>> I am going to go out on a limb and guess that every other filesystem
>> with a similar dependency follows the same pattern and is likely
>> broken as well.
>
> But ncp_put_super should be called after the rcu_barrier(), no?
>
> How is it broken?

The interesting hunk of code from deactivate_locked_super is:
>	cleancache_invalidate_fs(s);
>	fs->kill_sb(s);
 	^^^^^^^^^^^^^^  This is where ncp_put_super() is called.
>
>	/* caches are now gone, we can safely kill the shrinker now */
>	unregister_shrinker(&s->s_shrink);
>
>	/*
>	 * We need to call rcu_barrier so all the delayed rcu free
>	 * inodes are flushed before we release the fs module.
>	 */
>	rcu_barrier();
>	put_filesystem(fs);
>	put_super(s);

Which guarantees ncp_put_super() happens before the rcu_barrier.

>> Taking that observation farther we have a mount reference count, that
>> pins the super block.  So at first glance the super block looks safe
>> without any rcu protections.
>
> Well yes, that's what I'm getting at. But I don't think it's quite complete...
>
>>
>> I'm not certain what pins the inodes. Let's see:
>>
>> mnt->d_mnt_root has the root dentry of the dentry tree, and that
>> dentry count is protected by the vfsmount_lock.
>
> If the mount is already detached from the namespace when we start
> to do a path walk, AFAIKS it can be freed up from underneath us at
> that point.
>
> This would require cycling vfsmount_lock for write in such path. It's
> better than rcu_barrier probably, but not terribly nice.

Where do you see the possibility of a mount detached from a namespace
causing problems?   Simply having any count on a mount ensures we cycle
the vfsmount in mntput_no_expire.

Or if you want to see what I am seeing:

The rcu_path_walk starts at one of.  "." "/" or file->f_path, all of
which hold a reference on a struct vfsmount.

We perform an rcu_path_walk with the locking.
br_read_lock(vfsmount_lock);
rcu_read_lock();

We can transition to another vfs mount via follow_mount_rcu 
which consults the mount hash table which can only be modified
under the br_write_lock(vfsmount_lock);

We can also transition to another vfs mount via follow_up_rcu
which simply goes to mnt->mnt_parent.  Where our starting vfsmount
holds a reference to the target vfsmount.

When we complete the rcu_path_walk we do:
rcu_read_unlock()
br_write_lock(vfsmount_lock)

mntput_no_expire, which decrements mount counts takes and releases
br_write_lock before we put the final mount reference.  Which means
that it is impossible for the final mntput on a mount to complete
while we are in the middle of an rcu path walk.

Once we have take and released br_write_lock(vfsmount_lock)
in mntput_no_expire we call mntfree.  mntfree calls
deactivate_super.  And deactivate_super calls deactivate_locked_super.

Which is a long winded way of saying we always call
deactivate_locked_super after we put our final mount count.

I don't possibly see how a mount can be freed while we are in
the middle of a rcu path walk.  Not while we hold the
br_read_lock(vfsmount_lock), and the final mntput takes
br_write_lock(vfsmount_lock).

>> Documentation/filesystems/vfs.txt seems to duplicate this reasoning
>> of why the superblock is safe.  Because we hold a real reference to it
>> from the vfsmount.
>
> rcu walk does not hold a reference to the vfsmount, however. It can
> go away. This is why functions which can be called from rcu-walk
> must go through synchronize_rcu() before they go away, also before
> the superblock goes away.

Not at all.

The rcu walk itself does not hold a reference to the vfsmount, but
something holds a reference to the vfsmount and to drop the final
reference on a vfsmount we must hold the vfsmount_lock for write.
The rcu walk holds the vfsmount_lock for read which prevents us from
grabbing the vfsmount_lock for write.

We need to wait an rcu grace period before freeing dentries and inodes
becuase for dentries and inodes we only have rcu protection for them.
For vfsmounts and the superblock we have a lock protected reference
count.

> The other way we could change the rule is to require barrier only for
> those filesystems which access superblock or other info from rcu-walk.
> I would prefer not to have such a rule, but it could be pragmatic.

I don't see that we need to change a rule.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-05-16  0:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1335604790.5995.22.camel@marge.simpson.net>
     [not found] ` <20120428142605.GA20248@redhat.com>
     [not found]   ` <m14ns355ru.fsf@fess.ebiederm.org>
     [not found]     ` <20120429165846.GA19054@redhat.com>
     [not found]       ` <1335754867.17899.4.camel@marge.simpson.net>
     [not found]         ` <m1zk9rmyh4.fsf@fess.ebiederm.org>
     [not found]           ` <20120501134214.f6b44f4a.akpm@linux-foundation.org>
     [not found]             ` <1336014721.7370.32.camel@marge.simpson.net>
     [not found]               ` <1336057018.8119.46.camel@marge.simpson.net>
     [not found]                 ` <1336105676.7356.42.camel@marge.simpson.net>
     [not found]                   ` <m1aa1oidmn.fsf@fess.ebiederm.org>
     [not found]                     ` <1336124716.25479.36.camel@marge.simpson.net>
     [not found]                       ` <m1vckcdoey.fsf@fess.ebiederm.org>
     [not found]                         ` <1336142995.25479.49.camel@marge.simpson.net>
     [not found]                           ` <m1zk9oc61j.fsf@fess.ebiederm.org>
     [not found]                             ` <1336150643.7502.4.camel@marge.simpson.net>
     [not found]                               ` <m1fwbfd71h.fsf@fess.ebiederm.org>
     [not found]                                 ` <1336197362.7346.9.camel@marge.simpson.net>
     [not found]                                   ` <1336198093.7346.11.camel@marge.simpson.net>
     [not found]                                     ` <1336201977.7346.22.camel@marge.simpson.net>
2012-05-07 21:51                                       ` [PATCH] vfs: Speed up deactivate_super for non-modular filesystems Eric W. Biederman
2012-05-07 22:17                                         ` Al Viro
2012-05-07 23:56                                           ` Paul E. McKenney
2012-05-08  1:07                                             ` Eric W. Biederman
2012-05-08  4:53                                               ` Mike Galbraith
2012-05-09  7:55                                               ` Nick Piggin
2012-05-09 11:02                                                 ` Eric W. Biederman
2012-05-15  8:40                                                   ` Nick Piggin
2012-05-16  0:34                                                     ` Eric W. Biederman
2012-05-09 13:59                                                 ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).