All of lore.kernel.org
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Nick Piggin <npiggin@gmail.com>
Cc: paulmck@linux.vnet.ibm.com, Al Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Louis Rilling <louis.rilling@kerlabs.com>,
	Mike Galbraith <efault@gmx.de>,
	Christoph Hellwig <hch@infradead.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
Date: Tue, 15 May 2012 17:34:50 -0700	[thread overview]
Message-ID: <87k40d2cb9.fsf@xmission.com> (raw)
In-Reply-To: <CAPa8GCDvdoprH3PD5-Rd8G3_ROe9NqerxxKudqUpk+ULVQQnkg@mail.gmail.com> (Nick Piggin's message of "Tue, 15 May 2012 18:40:46 +1000")

Nick Piggin <npiggin@gmail.com> writes:

> On 9 May 2012 21:02, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Nick Piggin <npiggin@gmail.com> writes:
>>
>>> On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>>>
>>>>> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>>>>>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>>>>>>
>>>>>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>>>>>> > in deactivate_locked_super.  Being non-modular there is no danger
>>>>>> > of the rcu callback running after the module is unloaded.
>>>>>>
>>>>>> There's more than just a module unload there, though - actual freeing
>>>>>>  struct super_block also happens past that rcu_barrier()...
>>>>
>>>> Al.  I have not closely audited the entire code path but at a quick
>>>> sample I see no evidence that anything depends on inode->i_sb being
>>>> rcu safe.  Do you know of any such location?
>>>>
>>>> It has only been a year and a half since Nick added this code which
>>>> isn't very much time to have grown strange dependencies like that.
>>>
>>> No, it has always depended on this.
>>>
>>> Look at ncp_compare_dentry(), for example.
>>
>> Interesting. ncp_compare_dentry this logic is broken.
>>
>> Accessing i_sb->s_fs_info for parameters does seem reasonable.
>> Unfortunately ncp_put_super frees server directly.
>>
>> Meaning if we are depending on only rcu protections a badly timed
>> ncp_compare_dentry will oops the kernel.
>>
>> I am going to go out on a limb and guess that every other filesystem
>> with a similar dependency follows the same pattern and is likely
>> broken as well.
>
> But ncp_put_super should be called after the rcu_barrier(), no?
>
> How is it broken?

The interesting hunk of code from deactivate_locked_super is:
>	cleancache_invalidate_fs(s);
>	fs->kill_sb(s);
 	^^^^^^^^^^^^^^  This is where ncp_put_super() is called.
>
>	/* caches are now gone, we can safely kill the shrinker now */
>	unregister_shrinker(&s->s_shrink);
>
>	/*
>	 * We need to call rcu_barrier so all the delayed rcu free
>	 * inodes are flushed before we release the fs module.
>	 */
>	rcu_barrier();
>	put_filesystem(fs);
>	put_super(s);

Which guarantees ncp_put_super() happens before the rcu_barrier.

>> Taking that observation farther we have a mount reference count, that
>> pins the super block.  So at first glance the super block looks safe
>> without any rcu protections.
>
> Well yes, that's what I'm getting at. But I don't think it's quite complete...
>
>>
>> I'm not certain what pins the inodes. Let's see:
>>
>> mnt->d_mnt_root has the root dentry of the dentry tree, and that
>> dentry count is protected by the vfsmount_lock.
>
> If the mount is already detached from the namespace when we start
> to do a path walk, AFAIKS it can be freed up from underneath us at
> that point.
>
> This would require cycling vfsmount_lock for write in such path. It's
> better than rcu_barrier probably, but not terribly nice.

Where do you see the possibility of a mount detached from a namespace
causing problems?   Simply having any count on a mount ensures we cycle
the vfsmount in mntput_no_expire.


Or if you want to see what I am seeing:

The rcu_path_walk starts at one of.  "." "/" or file->f_path, all of
which hold a reference on a struct vfsmount.

We perform an rcu_path_walk with the locking.
br_read_lock(vfsmount_lock);
rcu_read_lock();

We can transition to another vfs mount via follow_mount_rcu 
which consults the mount hash table which can only be modified
under the br_write_lock(vfsmount_lock);

We can also transition to another vfs mount via follow_up_rcu
which simply goes to mnt->mnt_parent.  Where our starting vfsmount
holds a reference to the target vfsmount.

When we complete the rcu_path_walk we do:
rcu_read_unlock()
br_write_lock(vfsmount_lock)

mntput_no_expire, which decrements mount counts takes and releases
br_write_lock before we put the final mount reference.  Which means
that it is impossible for the final mntput on a mount to complete
while we are in the middle of an rcu path walk.

Once we have take and released br_write_lock(vfsmount_lock)
in mntput_no_expire we call mntfree.  mntfree calls
deactivate_super.  And deactivate_super calls deactivate_locked_super.

Which is a long winded way of saying we always call
deactivate_locked_super after we put our final mount count.

I don't possibly see how a mount can be freed while we are in
the middle of a rcu path walk.  Not while we hold the
br_read_lock(vfsmount_lock), and the final mntput takes
br_write_lock(vfsmount_lock).


>> Documentation/filesystems/vfs.txt seems to duplicate this reasoning
>> of why the superblock is safe.  Because we hold a real reference to it
>> from the vfsmount.
>
> rcu walk does not hold a reference to the vfsmount, however. It can
> go away. This is why functions which can be called from rcu-walk
> must go through synchronize_rcu() before they go away, also before
> the superblock goes away.

Not at all.

The rcu walk itself does not hold a reference to the vfsmount, but
something holds a reference to the vfsmount and to drop the final
reference on a vfsmount we must hold the vfsmount_lock for write.
The rcu walk holds the vfsmount_lock for read which prevents us from
grabbing the vfsmount_lock for write.

We need to wait an rcu grace period before freeing dentries and inodes
becuase for dentries and inodes we only have rcu protection for them.
For vfsmounts and the superblock we have a lock protected reference
count.

> The other way we could change the rule is to require barrier only for
> those filesystems which access superblock or other info from rcu-walk.
> I would prefer not to have such a rule, but it could be pragmatic.

I don't see that we need to change a rule.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)
From: ebiederm@xmission.com (Eric W. Biederman)
To: Nick Piggin <npiggin@gmail.com>
Cc: paulmck@linux.vnet.ibm.com, Al Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Louis Rilling <louis.rilling@kerlabs.com>,
	Mike Galbraith <efault@gmx.de>,
	Christoph Hellwig <hch@infradead.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] vfs: Speed up deactivate_super for non-modular filesystems
Date: Tue, 15 May 2012 17:34:50 -0700	[thread overview]
Message-ID: <87k40d2cb9.fsf@xmission.com> (raw)
In-Reply-To: <CAPa8GCDvdoprH3PD5-Rd8G3_ROe9NqerxxKudqUpk+ULVQQnkg@mail.gmail.com> (Nick Piggin's message of "Tue, 15 May 2012 18:40:46 +1000")

Nick Piggin <npiggin@gmail.com> writes:

> On 9 May 2012 21:02, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Nick Piggin <npiggin@gmail.com> writes:
>>
>>> On 8 May 2012 11:07, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>>>
>>>>> On Mon, May 07, 2012 at 11:17:06PM +0100, Al Viro wrote:
>>>>>> On Mon, May 07, 2012 at 02:51:08PM -0700, Eric W. Biederman wrote:
>>>>>>
>>>>>> > /proc and similar non-modular filesystems do not need a rcu_barrier
>>>>>> > in deactivate_locked_super.  Being non-modular there is no danger
>>>>>> > of the rcu callback running after the module is unloaded.
>>>>>>
>>>>>> There's more than just a module unload there, though - actual freeing
>>>>>>  struct super_block also happens past that rcu_barrier()...
>>>>
>>>> Al.  I have not closely audited the entire code path but at a quick
>>>> sample I see no evidence that anything depends on inode->i_sb being
>>>> rcu safe.  Do you know of any such location?
>>>>
>>>> It has only been a year and a half since Nick added this code which
>>>> isn't very much time to have grown strange dependencies like that.
>>>
>>> No, it has always depended on this.
>>>
>>> Look at ncp_compare_dentry(), for example.
>>
>> Interesting. ncp_compare_dentry this logic is broken.
>>
>> Accessing i_sb->s_fs_info for parameters does seem reasonable.
>> Unfortunately ncp_put_super frees server directly.
>>
>> Meaning if we are depending on only rcu protections a badly timed
>> ncp_compare_dentry will oops the kernel.
>>
>> I am going to go out on a limb and guess that every other filesystem
>> with a similar dependency follows the same pattern and is likely
>> broken as well.
>
> But ncp_put_super should be called after the rcu_barrier(), no?
>
> How is it broken?

The interesting hunk of code from deactivate_locked_super is:
>	cleancache_invalidate_fs(s);
>	fs->kill_sb(s);
 	^^^^^^^^^^^^^^  This is where ncp_put_super() is called.
>
>	/* caches are now gone, we can safely kill the shrinker now */
>	unregister_shrinker(&s->s_shrink);
>
>	/*
>	 * We need to call rcu_barrier so all the delayed rcu free
>	 * inodes are flushed before we release the fs module.
>	 */
>	rcu_barrier();
>	put_filesystem(fs);
>	put_super(s);

Which guarantees ncp_put_super() happens before the rcu_barrier.

>> Taking that observation farther we have a mount reference count, that
>> pins the super block.  So at first glance the super block looks safe
>> without any rcu protections.
>
> Well yes, that's what I'm getting at. But I don't think it's quite complete...
>
>>
>> I'm not certain what pins the inodes. Let's see:
>>
>> mnt->d_mnt_root has the root dentry of the dentry tree, and that
>> dentry count is protected by the vfsmount_lock.
>
> If the mount is already detached from the namespace when we start
> to do a path walk, AFAIKS it can be freed up from underneath us at
> that point.
>
> This would require cycling vfsmount_lock for write in such path. It's
> better than rcu_barrier probably, but not terribly nice.

Where do you see the possibility of a mount detached from a namespace
causing problems?   Simply having any count on a mount ensures we cycle
the vfsmount in mntput_no_expire.


Or if you want to see what I am seeing:

The rcu_path_walk starts at one of.  "." "/" or file->f_path, all of
which hold a reference on a struct vfsmount.

We perform an rcu_path_walk with the locking.
br_read_lock(vfsmount_lock);
rcu_read_lock();

We can transition to another vfs mount via follow_mount_rcu 
which consults the mount hash table which can only be modified
under the br_write_lock(vfsmount_lock);

We can also transition to another vfs mount via follow_up_rcu
which simply goes to mnt->mnt_parent.  Where our starting vfsmount
holds a reference to the target vfsmount.

When we complete the rcu_path_walk we do:
rcu_read_unlock()
br_write_lock(vfsmount_lock)

mntput_no_expire, which decrements mount counts takes and releases
br_write_lock before we put the final mount reference.  Which means
that it is impossible for the final mntput on a mount to complete
while we are in the middle of an rcu path walk.

Once we have take and released br_write_lock(vfsmount_lock)
in mntput_no_expire we call mntfree.  mntfree calls
deactivate_super.  And deactivate_super calls deactivate_locked_super.

Which is a long winded way of saying we always call
deactivate_locked_super after we put our final mount count.

I don't possibly see how a mount can be freed while we are in
the middle of a rcu path walk.  Not while we hold the
br_read_lock(vfsmount_lock), and the final mntput takes
br_write_lock(vfsmount_lock).


>> Documentation/filesystems/vfs.txt seems to duplicate this reasoning
>> of why the superblock is safe.  Because we hold a real reference to it
>> from the vfsmount.
>
> rcu walk does not hold a reference to the vfsmount, however. It can
> go away. This is why functions which can be called from rcu-walk
> must go through synchronize_rcu() before they go away, also before
> the superblock goes away.

Not at all.

The rcu walk itself does not hold a reference to the vfsmount, but
something holds a reference to the vfsmount and to drop the final
reference on a vfsmount we must hold the vfsmount_lock for write.
The rcu walk holds the vfsmount_lock for read which prevents us from
grabbing the vfsmount_lock for write.

We need to wait an rcu grace period before freeing dentries and inodes
becuase for dentries and inodes we only have rcu protection for them.
For vfsmounts and the superblock we have a lock protected reference
count.

> The other way we could change the rule is to require barrier only for
> those filesystems which access superblock or other info from rcu-walk.
> I would prefer not to have such a rule, but it could be pragmatic.

I don't see that we need to change a rule.

Eric

  reply	other threads:[~2012-05-16  0:35 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-28  9:19 [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith
2012-04-28 14:26 ` Oleg Nesterov
2012-04-29  4:13   ` Mike Galbraith
2012-04-29  7:57   ` Eric W. Biederman
2012-04-29  9:49     ` Mike Galbraith
2012-04-29 16:58     ` Oleg Nesterov
2012-04-30  2:59       ` Eric W. Biederman
2012-04-30  3:25         ` Mike Galbraith
2012-05-02 12:40         ` Oleg Nesterov
2012-05-02 17:37           ` Eric W. Biederman
2012-04-30  3:01       ` [PATCH] " Mike Galbraith
     [not found]         ` <m1zk9rmyh4.fsf@fess.ebiederm.org>
2012-05-01 20:42           ` Andrew Morton
2012-05-03  3:12             ` Mike Galbraith
2012-05-03 14:56               ` Mike Galbraith
2012-05-04  4:27                 ` Mike Galbraith
2012-05-04  7:55                   ` Eric W. Biederman
2012-05-04  8:34                     ` Mike Galbraith
2012-05-04  9:45                     ` Mike Galbraith
2012-05-04 14:13                       ` Eric W. Biederman
2012-05-04 14:49                         ` Mike Galbraith
2012-05-04 15:36                           ` Eric W. Biederman
2012-05-04 16:57                             ` Mike Galbraith
2012-05-04 20:29                               ` Eric W. Biederman
2012-05-05  5:56                                 ` Mike Galbraith
2012-05-05  6:08                                   ` Mike Galbraith
2012-05-05  7:12                                     ` Mike Galbraith
2012-05-05 11:37                                       ` Eric W. Biederman
2012-05-07 21:51                                       ` [PATCH] vfs: Speed up deactivate_super for non-modular filesystems Eric W. Biederman
2012-05-07 22:17                                         ` Al Viro
2012-05-07 23:56                                           ` Paul E. McKenney
2012-05-08  1:07                                             ` Eric W. Biederman
2012-05-08  4:53                                               ` Mike Galbraith
2012-05-09  7:55                                               ` Nick Piggin
2012-05-09 11:02                                                 ` Eric W. Biederman
2012-05-09 11:02                                                   ` Eric W. Biederman
2012-05-15  8:40                                                   ` Nick Piggin
2012-05-16  0:34                                                     ` Eric W. Biederman [this message]
2012-05-16  0:34                                                       ` Eric W. Biederman
2012-05-09 13:59                                                 ` Paul E. McKenney
2012-05-04  8:03                 ` [PATCH] Re: [RFC PATCH] namespaces: fix leak on fork() failure Eric W. Biederman
2012-05-04  8:19                   ` Mike Galbraith
2012-05-04  8:54                     ` Mike Galbraith
2012-05-07  0:32             ` [PATCH 0/3] pidns: Closing the pid namespace exit race Eric W. Biederman
2012-05-07  0:33               ` [PATCH 1/3] pidns: Use task_active_pid_ns in do_notify_parent Eric W. Biederman
2012-05-07  0:35               ` [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped Eric W. Biederman
2012-05-08 22:50                 ` Andrew Morton
2012-05-16 18:39                 ` Oleg Nesterov
2012-05-16 19:34                   ` Oleg Nesterov
2012-05-16 20:54                   ` Eric W. Biederman
2012-05-17 17:00                     ` Oleg Nesterov
2012-05-17 21:46                       ` Eric W. Biederman
2012-05-18 12:39                         ` Oleg Nesterov
2012-05-19  0:03                           ` Eric W. Biederman
2012-05-21 12:44                             ` Oleg Nesterov
2012-05-22  0:16                               ` Eric W. Biederman
2012-05-22  0:20                               ` [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2 Eric W. Biederman
2012-05-22 16:54                                 ` Oleg Nesterov
2012-05-22 19:23                                 ` Andrew Morton
2012-05-23 14:52                                   ` Oleg Nesterov
2012-05-25 15:15                                     ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Oleg Nesterov
2012-05-25 15:59                                       ` [PATCH -mm 0/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper Oleg Nesterov
2012-05-25 16:00                                         ` [PATCH -mm 1/1] " Oleg Nesterov
2012-05-25 21:43                                           ` Eric W. Biederman
2012-05-27 19:10                                             ` [PATCH v2 -mm 0/1] " Oleg Nesterov
2012-05-27 19:11                                               ` [PATCH v2 -mm 1/1] " Oleg Nesterov
2012-05-29  6:34                                                 ` Eric W. Biederman
2012-05-25 21:25                                       ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Eric W. Biederman
2012-05-27 18:41                                         ` [PATCH -mm v2] " Oleg Nesterov
2012-05-07  0:35               ` [PATCH 3/3] pidns: Make killed children autoreap Eric W. Biederman
2012-05-08 22:51                 ` Andrew Morton
2012-04-30 13:57 ` [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87k40d2cb9.fsf@xmission.com \
    --to=ebiederm@xmission.com \
    --cc=akpm@linux-foundation.org \
    --cc=efault@gmx.de \
    --cc=gorcunov@openvz.org \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=louis.rilling@kerlabs.com \
    --cc=npiggin@gmail.com \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.