Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems

All of lore.kernel.org
 help / color / mirror / Atom feed

From: ebiederm@xmission.com (Eric W. Biederman)
To: Nick Piggin <npiggin@gmail.com>
Cc: paulmck@linux.vnet.ibm.com, Al Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Louis Rilling <louis.rilling@kerlabs.com>,
	Mike Galbraith <efault@gmx.de>,
	Christoph Hellwig <hch@infradead.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
Date: Wed, 09 May 2012 04:02:52 -0700	[thread overview]
Message-ID: <m1havp62hv.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <CAPa8GCBE--64++-B3H5K-y-sUtyeVrhKwKG-eazVx8Dc7RH+YQ@mail.gmail.com> (Nick Piggin's message of "Wed, 9 May 2012 17:55:57 +1000")

Nick Piggin <npiggin@gmail.com> writes:

> On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>
>>> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>>>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>>>>
>>>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>>>> > in deactivate_locked_super.  Being non-modular there is no danger
>>>> > of the rcu callback running after the module is unloaded.
>>>>
>>>> There's more than just a module unload there, though - actual freeing
>>>>  struct super_block also happens past that rcu_barrier()...
>>
>> Al.  I have not closely audited the entire code path but at a quick
>> sample I see no evidence that anything depends on inode->i_sb being
>> rcu safe.  Do you know of any such location?
>>
>> It has only been a year and a half since Nick added this code which
>> isn't very much time to have grown strange dependencies like that.
>
> No, it has always depended on this.
>
> Look at ncp_compare_dentry(), for example.

Interesting. ncp_compare_dentry this logic is broken.

Accessing i_sb->s_fs_info for parameters does seem reasonable.
Unfortunately ncp_put_super frees server directly.

Meaning if we are depending on only rcu protections a badly timed
ncp_compare_dentry will oops the kernel.

I am going to go out on a limb and guess that every other filesystem
with a similar dependency follows the same pattern and is likely
broken as well.

>> We need to drain all of the rcu callbacks before we free the slab
>> and unload the module.
>>
>> This actually makes deactivate_locked_super the totally wrong place
>> for the rcu_barrier.  We want the rcu_barrier in the module exit
>> routine where we destroy the inode cache.
>>
>> What I see as the real need is the filesystem modules need to do:
>>        rcu_barrier()
>>        kmem_cache_destroy(cache);
>>
>> Perhaps we can add some helpers to make it easy.  But I think
>> I would be happy today with simply moving the rcu_barrier into
>> every filesystems module exit path, just before the file system
>> module destoryed it's inode cache.
>
> No, because that's not the only requirement for the rcu_barrier.
>
> Making it asynchronous is not something I wanted to do, because
> then we potentially have a process exiting from kernel space after
> releasing last reference on a mount, but the mount does not go
> away until "some time" later. Which is crazy.

Well we certainly want a deliberate unmount of a filesystem to safely
and successfully put the filesystem in a sane state before the unmount
returns.

If we have a few linger data structures waiting for an rcu grace period
after a process exits I'm not certain that is bad.  Although I would not
mind it much.

> However. We are holding vfsmount_lock for read at the point
> where we ever actually do anything with an "rcu-referenced"
> dentry/inode. I wonder if we could use this to get i_sb pinned.

Interesting observation.

Taking that observation farther we have a mount reference count, that
pins the super block.  So at first glance the super block looks safe
without any rcu protections.

I'm not certain what pins the inodes. Let's see:

mnt->d_mnt_root has the root dentry of the dentry tree, and that
dentry count is protected by the vfsmount_lock.

Beyond that we have kill_sb.
  kill_sb() typically calls generic_shutdown_super()
  From generic_shutdown_super() we call:
     shrink_dcache_for_umount() which flushes lingering dentries.
     evict_inodes() which flushes lingering inodes.

So in some sense the reference counts on mounts and dentries protect
the cache.

So the only case I can see where rcu appears to matter is when we are
freeing dentries.

When freeing dentries the idiom is:
dentry_iput(dentry);
d_free(dentry);

d_free does if (dentry->d_flags & DCACHE_RCUACCESS) call_rcu(... __d_free);

So while most of the time dentries hold onto inodes reliably with a
reference count and most of the time dentries are kept alive by the
dentry->d_count part of the time there is this gray zone where only
rcu references to dentries are keeping them alive.

Which explains the need for rcu freeing of inodes. 

This makes me wonder why we think calling d_release is safe
before we want the rcu grace period.

Documentation/filesystems/vfs.txt seems to duplicate this reasoning
of why the superblock is safe.  Because we hold a real reference to it
from the vfsmount.

The strangest case is calling __lookup_mnt during an "rcu-path-walk".
But mounts are reference counted from the mount namespace, and
are protected during an "rcu-path-walk" by vfsmount_lock read locked,
and are only changed with vfsmount_lock write locked.

Which leads again (with stronger reasons now) to the conclusions that:
a) We don't depend on rcu_barrier to protect the superblock.
b) My trivial patch is safe.
c) We probably should move rcu_barrier to the filesystem module exit
   routines, just to make things clear and to make everything faster.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)

From: ebiederm@xmission.com (Eric W. Biederman)
To: Nick Piggin <npiggin@gmail.com>
Cc: paulmck@linux.vnet.ibm.com, Al Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Louis Rilling <louis.rilling@kerlabs.com>,
	Mike Galbraith <efault@gmx.de>,
	Christoph Hellwig <hch@infradead.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
Date: Wed, 09 May 2012 04:02:52 -0700	[thread overview]
Message-ID: <m1havp62hv.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <CAPa8GCBE--64++-B3H5K-y-sUtyeVrhKwKG-eazVx8Dc7RH+YQ@mail.gmail.com> (Nick Piggin's message of "Wed, 9 May 2012 17:55:57 +1000")

Nick Piggin <npiggin@gmail.com> writes:

> On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>
>>> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>>>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>>>>
>>>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>>>> > in deactivate_locked_super.  Being non-modular there is no danger
>>>> > of the rcu callback running after the module is unloaded.
>>>>
>>>> There's more than just a module unload there, though - actual freeing
>>>>  struct super_block also happens past that rcu_barrier()...
>>
>> Al.  I have not closely audited the entire code path but at a quick
>> sample I see no evidence that anything depends on inode->i_sb being
>> rcu safe.  Do you know of any such location?
>>
>> It has only been a year and a half since Nick added this code which
>> isn't very much time to have grown strange dependencies like that.
>
> No, it has always depended on this.
>
> Look at ncp_compare_dentry(), for example.

Interesting. ncp_compare_dentry this logic is broken.

Accessing i_sb->s_fs_info for parameters does seem reasonable.
Unfortunately ncp_put_super frees server directly.

Meaning if we are depending on only rcu protections a badly timed
ncp_compare_dentry will oops the kernel.

I am going to go out on a limb and guess that every other filesystem
with a similar dependency follows the same pattern and is likely
broken as well.

>> We need to drain all of the rcu callbacks before we free the slab
>> and unload the module.
>>
>> This actually makes deactivate_locked_super the totally wrong place
>> for the rcu_barrier.  We want the rcu_barrier in the module exit
>> routine where we destroy the inode cache.
>>
>> What I see as the real need is the filesystem modules need to do:
>>        rcu_barrier()
>>        kmem_cache_destroy(cache);
>>
>> Perhaps we can add some helpers to make it easy.  But I think
>> I would be happy today with simply moving the rcu_barrier into
>> every filesystems module exit path, just before the file system
>> module destoryed it's inode cache.
>
> No, because that's not the only requirement for the rcu_barrier.
>
> Making it asynchronous is not something I wanted to do, because
> then we potentially have a process exiting from kernel space after
> releasing last reference on a mount, but the mount does not go
> away until "some time" later. Which is crazy.

Well we certainly want a deliberate unmount of a filesystem to safely
and successfully put the filesystem in a sane state before the unmount
returns.

If we have a few linger data structures waiting for an rcu grace period
after a process exits I'm not certain that is bad.  Although I would not
mind it much.

> However. We are holding vfsmount_lock for read at the point
> where we ever actually do anything with an "rcu-referenced"
> dentry/inode. I wonder if we could use this to get i_sb pinned.

Interesting observation.

Taking that observation farther we have a mount reference count, that
pins the super block.  So at first glance the super block looks safe
without any rcu protections.

I'm not certain what pins the inodes. Let's see:

mnt->d_mnt_root has the root dentry of the dentry tree, and that
dentry count is protected by the vfsmount_lock.

Beyond that we have kill_sb.
  kill_sb() typically calls generic_shutdown_super()
  From generic_shutdown_super() we call:
     shrink_dcache_for_umount() which flushes lingering dentries.
     evict_inodes() which flushes lingering inodes.

So in some sense the reference counts on mounts and dentries protect
the cache.

So the only case I can see where rcu appears to matter is when we are
freeing dentries.

When freeing dentries the idiom is:
dentry_iput(dentry);
d_free(dentry);

d_free does if (dentry->d_flags & DCACHE_RCUACCESS) call_rcu(... __d_free);

So while most of the time dentries hold onto inodes reliably with a
reference count and most of the time dentries are kept alive by the
dentry->d_count part of the time there is this gray zone where only
rcu references to dentries are keeping them alive.

Which explains the need for rcu freeing of inodes. 

This makes me wonder why we think calling d_release is safe
before we want the rcu grace period.

Documentation/filesystems/vfs.txt seems to duplicate this reasoning
of why the superblock is safe.  Because we hold a real reference to it
from the vfsmount.

The strangest case is calling __lookup_mnt during an "rcu-path-walk".
But mounts are reference counted from the mount namespace, and
are protected during an "rcu-path-walk" by vfsmount_lock read locked,
and are only changed with vfsmount_lock write locked.

Which leads again (with stronger reasons now) to the conclusions that:
a) We don't depend on rcu_barrier to protect the superblock.
b) My trivial patch is safe.
c) We probably should move rcu_barrier to the filesystem module exit
   routines, just to make things clear and to make everything faster.

Eric

next prev parent reply	other threads:[~2012-05-09 10:58 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-28  9:19 [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith
2012-04-28 14:26 ` Oleg Nesterov
2012-04-29  4:13   ` Mike Galbraith
2012-04-29  7:57   ` Eric W. Biederman
2012-04-29  9:49     ` Mike Galbraith
2012-04-29 16:58     ` Oleg Nesterov
2012-04-30  2:59       ` Eric W. Biederman
2012-04-30  3:25         ` Mike Galbraith
2012-05-02 12:40         ` Oleg Nesterov
2012-05-02 17:37           ` Eric W. Biederman
2012-04-30  3:01       ` [PATCH] " Mike Galbraith
     [not found]         ` <m1zk9rmyh4.fsf@fess.ebiederm.org>
2012-05-01 20:42           ` Andrew Morton
2012-05-03  3:12             ` Mike Galbraith
2012-05-03 14:56               ` Mike Galbraith
2012-05-04  4:27                 ` Mike Galbraith
2012-05-04  7:55                   ` Eric W. Biederman
2012-05-04  8:34                     ` Mike Galbraith
2012-05-04  9:45                     ` Mike Galbraith
2012-05-04 14:13                       ` Eric W. Biederman
2012-05-04 14:49                         ` Mike Galbraith
2012-05-04 15:36                           ` Eric W. Biederman
2012-05-04 16:57                             ` Mike Galbraith
2012-05-04 20:29                               ` Eric W. Biederman
2012-05-05  5:56                                 ` Mike Galbraith
2012-05-05  6:08                                   ` Mike Galbraith
2012-05-05  7:12                                     ` Mike Galbraith
2012-05-05 11:37                                       ` Eric W. Biederman
2012-05-07 21:51                                       ` [PATCH] vfs: Speed up deactivate_super for non-modular filesystems Eric W. Biederman
2012-05-07 22:17                                         ` Al Viro
2012-05-07 23:56                                           ` Paul E. McKenney
2012-05-08  1:07                                             ` Eric W. Biederman
2012-05-08  4:53                                               ` Mike Galbraith
2012-05-09  7:55                                               ` Nick Piggin
2012-05-09 11:02                                                 ` Eric W. Biederman [this message]
2012-05-09 11:02                                                   ` Eric W. Biederman
2012-05-15  8:40                                                   ` Nick Piggin
2012-05-16  0:34                                                     ` Eric W. Biederman
2012-05-16  0:34                                                       ` Eric W. Biederman
2012-05-09 13:59                                                 ` Paul E. McKenney
2012-05-04  8:03                 ` [PATCH] Re: [RFC PATCH] namespaces: fix leak on fork() failure Eric W. Biederman
2012-05-04  8:19                   ` Mike Galbraith
2012-05-04  8:54                     ` Mike Galbraith
2012-05-07  0:32             ` [PATCH 0/3] pidns: Closing the pid namespace exit race Eric W. Biederman
2012-05-07  0:33               ` [PATCH 1/3] pidns: Use task_active_pid_ns in do_notify_parent Eric W. Biederman
2012-05-07  0:35               ` [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped Eric W. Biederman
2012-05-08 22:50                 ` Andrew Morton
2012-05-16 18:39                 ` Oleg Nesterov
2012-05-16 19:34                   ` Oleg Nesterov
2012-05-16 20:54                   ` Eric W. Biederman
2012-05-17 17:00                     ` Oleg Nesterov
2012-05-17 21:46                       ` Eric W. Biederman
2012-05-18 12:39                         ` Oleg Nesterov
2012-05-19  0:03                           ` Eric W. Biederman
2012-05-21 12:44                             ` Oleg Nesterov
2012-05-22  0:16                               ` Eric W. Biederman
2012-05-22  0:20                               ` [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2 Eric W. Biederman
2012-05-22 16:54                                 ` Oleg Nesterov
2012-05-22 19:23                                 ` Andrew Morton
2012-05-23 14:52                                   ` Oleg Nesterov
2012-05-25 15:15                                     ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Oleg Nesterov
2012-05-25 15:59                                       ` [PATCH -mm 0/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper Oleg Nesterov
2012-05-25 16:00                                         ` [PATCH -mm 1/1] " Oleg Nesterov
2012-05-25 21:43                                           ` Eric W. Biederman
2012-05-27 19:10                                             ` [PATCH v2 -mm 0/1] " Oleg Nesterov
2012-05-27 19:11                                               ` [PATCH v2 -mm 1/1] " Oleg Nesterov
2012-05-29  6:34                                                 ` Eric W. Biederman
2012-05-25 21:25                                       ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Eric W. Biederman
2012-05-27 18:41                                         ` [PATCH -mm v2] " Oleg Nesterov
2012-05-07  0:35               ` [PATCH 3/3] pidns: Make killed children autoreap Eric W. Biederman
2012-05-08 22:51                 ` Andrew Morton
2012-04-30 13:57 ` [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1havp62hv.fsf@fess.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=akpm@linux-foundation.org \
    --cc=efault@gmx.de \
    --cc=gorcunov@openvz.org \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=louis.rilling@kerlabs.com \
    --cc=npiggin@gmail.com \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.