[rfc][possible solution] RCU vfsmounts

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [rfc][possible solution] RCU vfsmounts
@ 2013-09-28 20:27 Al Viro
  2013-09-28 20:43 ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Al Viro @ 2013-09-28 20:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, linux-kernel

	FWIW, I think I have a kinda-sorta solution for that and I'd like
to hear your comments on that.  I want to replace vfsmount_lock with seqlock
and store additional seq number in nameidata, set to vfsmount_seq in the
beginning and rechecked in unlazy_walk/complete_walk.

	The obvious variant would be to have unlazy_walk/complete_walk to
grab refcount, check vfsmount_seq and mntput on mismatch.  The trouble
with that is race with what would've been the final mntput() done by
umount(2); complete_walk() would drop that temporary reference and
fail, all right, but... we would get a umount(2) returning without having
actually shut the filesystem down.  Said shutdown would happen in whoever
had been doing pathname resolution that stepped into the race.

	I _think_ I have a workable variant:
	* new vfsmount flag (MNT_SYNC_UMOUNT or something like that) and
ability to tell umount_tree() to set that on all victims; done on
non-lazy umount and on expiry.  Never cleared once set, and set only
when propagate_mount_busy() has been called and returned true.
Set before bumping vfsmount_seq.
	* rcu_barrier() added in namespace_unlock(), between
dropping namespace_sem and doing mntput() on the victims.
	* unlazy_walk() and complete_walk() use the common helper along
the lines of

legitimize_mnt(struct vfsmount *mnt, unsigned seq)
{
	if (read_seqcount_retry(&vfsmount_seq, seq)) {
		rcu_read_unlock();
		return false;
	}
	mntget(mnt);
	if (!read_seqcount_retry(&vfsmount_seq, seq)) {
		rcu_read_unlock();
		return true;
	}
	if (mnt->mnt_flags & MNT_SYNC_UMOUNT) {
		/* it couldn't have gotten through rcu_barrier() yet */
		mnt_add_count(real_mount(mnt), -1);
		rcu_read_unlock();
		return false;
	}
	rcu_read_unlock();
	mntput(mnt);
	return false;
}

Freeing vfsmounts would be done with rcu delay, vfsmount hash lookups,
d_path(), etc. do the obvious things as we do with rename_lock for dentry
side of things - that stuff is all obvious.  Not ending up with final
mntput() stolen from something that really expects it to be final is the
hard part and it looks like the above would be a solution.

Comments?  AFAICS, that would've killed *all* vfsmount-related locked stores
in RCU-mode pathwalks...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][possible solution] RCU vfsmounts
  2013-09-28 20:27 [rfc][possible solution] RCU vfsmounts Al Viro
@ 2013-09-28 20:43 ` Linus Torvalds
  2013-09-29  6:06   ` Al Viro
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2013-09-28 20:43 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linux Kernel Mailing List

On Sat, Sep 28, 2013 at 1:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>         FWIW, I think I have a kinda-sorta solution for that and I'd like
> to hear your comments on that.  I want to replace vfsmount_lock with seqlock
> and store additional seq number in nameidata, set to vfsmount_seq in the
> beginning and rechecked in unlazy_walk/complete_walk.

Yes, that would be lovely.

>         The obvious variant would be to have unlazy_walk/complete_walk to
> grab refcount, check vfsmount_seq and mntput on mismatch.  The trouble
> with that is race with what would've been the final mntput() done by
> umount(2); complete_walk() would drop that temporary reference and
> fail, all right, but... we would get a umount(2) returning without having
> actually shut the filesystem down.  Said shutdown would happen in whoever
> had been doing pathname resolution that stepped into the race.

Sounds reasonable to to me.

Side note: I really wish there was some way to avoid having to
finalize the vfsmount entirely for some common things. For example,
"[l]stat[at]()" really doesn't need it for the common cases (network
filesystems may need to revalidate), and is a very critical operation,
and we *could* just look up the inode under RCU and never finalize the
dentry _or_ the vfsmount. However, very annoyingly, the security layer
wants the vfsmount, and we don't know if that is RCU-safe...

                Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][possible solution] RCU vfsmounts
  2013-09-28 20:43 ` Linus Torvalds
@ 2013-09-29  6:06   ` Al Viro
  2013-09-29 17:19     ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Al Viro @ 2013-09-29  6:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, Linux Kernel Mailing List

On Sat, Sep 28, 2013 at 01:43:49PM -0700, Linus Torvalds wrote:

> Sounds reasonable to to me.

Sigh...  Looks like there's a lot of fun in shrink_dcache_for_umount() -
at the very least, it needs to bump ->d_seq on everything, because with
that change we *can* walk into a filesystem in the middle of that.
We obviously don't want to slap rcu_barrier() into the final mntput() -
it's far too costly; even one in deactivate_locked_super() (in the
wrong place and gone since a while back) had been causing problems.

Moreover, any filesystem that has e.g. ->d_hash() use an object hanging
off private data of superblock and freed by its ->kill_sb() before
generic_shutdown_super() will have an additional set of PITA; there
shouldn't be many of those, though.

Oh, well - this is going to be a fun series, by the look of it...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][possible solution] RCU vfsmounts
  2013-09-29  6:06   ` Al Viro
@ 2013-09-29 17:19     ` Linus Torvalds
  2013-09-29 18:10       ` Al Viro
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2013-09-29 17:19 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linux Kernel Mailing List

On Sat, Sep 28, 2013 at 11:06 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Sigh...  Looks like there's a lot of fun in shrink_dcache_for_umount() -
> at the very least, it needs to bump ->d_seq on everything, because with
> that change we *can* walk into a filesystem in the middle of that.
> We obviously don't want to slap rcu_barrier() into the final mntput() -

I have to say, that when I was working with the dcache lockref code, I
absolutely _detested_ the magical shrink_dcache_for_umount() code that
violated all the locking rules.

So I actually wouldn't mind at all if that was all forced to follow
all the same rules that the live filesystem code is forced to follow.
Yes, yes, it's going to slow things down, but it's not like umount()
is _that_ performance critical. And I think the whole "let's ignore
locking rules" actually comes from back when we had one global
dcache_lock: we used to have batching in order to not hold the
dcache_lock over long periods, and then it got converted to the
per-dentry locking, and then that got removed entirely with the whole
RCU lookup etc.

So I would be *entirely* ok with having
shrink_dcache_for_umount_subtree() take the d_lock on the dentry as it
shrinks it etc etc.

                     Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][possible solution] RCU vfsmounts
  2013-09-29 17:19     ` Linus Torvalds
@ 2013-09-29 18:10       ` Al Viro
  2013-09-29 18:26         ` Linus Torvalds
                           ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Al Viro @ 2013-09-29 18:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, Linux Kernel Mailing List, Miklos Szeredi

On Sun, Sep 29, 2013 at 10:19:59AM -0700, Linus Torvalds wrote:
> I have to say, that when I was working with the dcache lockref code, I
> absolutely _detested_ the magical shrink_dcache_for_umount() code that
> violated all the locking rules.

... and duplicated random-half-of-an-arseload of stuff done in other
shrinking paths.  You are not alone at that - it's been a serious
source of annoyances all along.

> So I actually wouldn't mind at all if that was all forced to follow
> all the same rules that the live filesystem code is forced to follow.
> Yes, yes, it's going to slow things down, but it's not like umount()
> is _that_ performance critical. And I think the whole "let's ignore
> locking rules" actually comes from back when we had one global
> dcache_lock: we used to have batching in order to not hold the
> dcache_lock over long periods, and then it got converted to the
> per-dentry locking, and then that got removed entirely with the whole
> RCU lookup etc.
> 
> So I would be *entirely* ok with having
> shrink_dcache_for_umount_subtree() take the d_lock on the dentry as it
> shrinks it etc etc.

I'm not even sure it will slow the things down that much these days; needs
to be tested, obviously...

FWIW, right now I'm reviewing the subset of fs code that can be hit in
RCU mode.  Not a pretty sight, that... ;-/  First catch: in
fuse_dentry_revalidate() we have a case (reachable with LOOKUP_RCU) where
we do this:
        } else if (inode) {
                fc = get_fuse_conn(inode);
                if (fc->readdirplus_auto) {
                        parent = dget_parent(entry);
                        fuse_advise_use_readdirplus(parent->d_inode);
                        dput(parent);
                }
        }
First of all, that'll lead to obvious nastiness if we get here when
->s_fs_info has already been freed in process of fs shutdown; fc will
be pointing to kfreed object and no, freeing it isn't RCU-delayed.
That's not a problem with the current tree, of course, but this
dput(parent) very much is - doing that under rcu_read_lock() is
a Bloody Bad Idea(tm).

If my reading of that code is right, the proper fix would be to
turn that else if (inode) into else if (inode && !(flags & LOOKUP_RCU))

Miklos, could you confirm that?  Or would you prefer to deal with that
in some other way?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][possible solution] RCU vfsmounts
  2013-09-29 18:10       ` Al Viro
@ 2013-09-29 18:26         ` Linus Torvalds
  2013-09-30 10:48           ` Miklos Szeredi
  2013-09-29 18:49         ` Al Viro
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2013-09-29 18:26 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linux Kernel Mailing List, Miklos Szeredi

On Sun, Sep 29, 2013 at 11:10 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, right now I'm reviewing the subset of fs code that can be hit in
> RCU mode.  Not a pretty sight, that... ;-/  First catch: in
> fuse_dentry_revalidate() we have a case (reachable with LOOKUP_RCU) where
> we do this:
>         } else if (inode) {
>                 fc = get_fuse_conn(inode);
>                 if (fc->readdirplus_auto) {
>                         parent = dget_parent(entry);
>                         fuse_advise_use_readdirplus(parent->d_inode);
>                         dput(parent);
>                 }
>         }

Ugh, yes, that dget/dput(parent) looks wrong in RCU mode.

That said, in RCU mode you simply shouldn't _need_ it at all, you
should be able to just use dentry->d_parent without any refcount
games. Put an ACCESS_ONCE there to be safe. You might want to make
sure that you do the same for the inode, and check for NULL, to be
safe against racing with a cross-directory rename/rmdir. I don't know
if there are then any internal fuse races with the whole
get_fuse_conn() etc, so...

It does look bad. In practice, of course, it will never hit anything.

> If my reading of that code is right, the proper fix would be to
> turn that else if (inode) into else if (inode && !(flags & LOOKUP_RCU))

That sounds safer, but then the fuse_advise_use_readdirplus() bit
wouldn't get set. But why _is_ that bit set there in the first place?
It sounds stupid. I think the bit should be set in the lookup path (or
the revalidation slow-path when the timeout is over and the thing gets
properly revalidated), why the hell does it do it in the fast-path
revalidation in the first place? That's just odd. Maybe there is some
odd internal fuse logic.

Miklos, please do give that a look..

            Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][possible solution] RCU vfsmounts
  2013-09-29 18:26         ` Linus Torvalds
@ 2013-09-30 10:48           ` Miklos Szeredi
  0 siblings, 0 replies; 12+ messages in thread
From: Miklos Szeredi @ 2013-09-30 10:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List

On Sun, Sep 29, 2013 at 11:26:21AM -0700, Linus Torvalds wrote:

> > If my reading of that code is right, the proper fix would be to
> > turn that else if (inode) into else if (inode && !(flags & LOOKUP_RCU))
> 
> That sounds safer, but then the fuse_advise_use_readdirplus() bit
> wouldn't get set. But why _is_ that bit set there in the first place?
> It sounds stupid. I think the bit should be set in the lookup path (or
> the revalidation slow-path when the timeout is over and the thing gets
> properly revalidated), why the hell does it do it in the fast-path
> revalidation in the first place? That's just odd. Maybe there is some
> odd internal fuse logic.

The logic advises future use of READDIRPLUS on either

 - lookup slowpath (READDIR used, but READDIRPLUS may have been useful)
 - revalidate fastpath (READDIRPLUS used and was found to be useful)

Revalidate slowpath possibly means we have the icache primed by READDIRPLUS, but
something went wrong and the lookup needed to be redone.  Advising use of
READDIRPLUS does not make much sense in that case, since the result of the
"PLUS" part wasn't actually used.

But...  this will always slow down the lookup, even if revalidating something
that was not recently read with READDIRPLUS.  How about something like this:
when setting up inodes with READDIRPLUS set a flag FUSE_I_INIT_RDPLUS.  On
revalidate only advise readdirplus on parent if this flag is set (and clear it
once used).  I think it would be okay to dereference d_parent and even
d_parent->d_inode in rcu mode.  But, to be safe, just drop out of rcu mode
instead in this case, since it won't be the common cached case.

The patch also fixes the invalid path to not call check_submounts_and_drop() in
rcu mode, which is definitely not safe to do.

Untested...

Thanks,
Miklos


----
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 62b43b5..a751598 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -182,6 +182,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
 	struct inode *inode;
 	struct dentry *parent;
 	struct fuse_conn *fc;
+	struct fuse_inode *fi;
 	int ret;
 
 	inode = ACCESS_ONCE(entry->d_inode);
@@ -228,7 +229,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
 		if (!err && !outarg.nodeid)
 			err = -ENOENT;
 		if (!err) {
-			struct fuse_inode *fi = get_fuse_inode(inode);
+			fi = get_fuse_inode(inode);
 			if (outarg.nodeid != get_node_id(inode)) {
 				fuse_queue_forget(fc, forget, outarg.nodeid, 1);
 				goto invalid;
@@ -246,8 +247,11 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
 				       attr_version);
 		fuse_change_entry_timeout(entry, &outarg);
 	} else if (inode) {
-		fc = get_fuse_conn(inode);
-		if (fc->readdirplus_auto) {
+		fi = get_fuse_inode(inode);
+		if (flags & LOOKUP_RCU) {
+			if (test_bit(FUSE_I_INIT_RDPLUS, &fi->state))
+				return -ECHILD;
+		} else if (test_and_clear_bit(FUSE_I_INIT_RDPLUS, &fi->state)) {
 			parent = dget_parent(entry);
 			fuse_advise_use_readdirplus(parent->d_inode);
 			dput(parent);
@@ -259,7 +263,8 @@ out:
 
 invalid:
 	ret = 0;
-	if (check_submounts_and_drop(entry) != 0)
+
+	if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) != 0)
 		ret = 1;
 	goto out;
 }
@@ -1291,6 +1296,8 @@ static int fuse_direntplus_link(struct file *file,
 	}
 
 found:
+	if (fc->readdirplus_auto)
+		set_bit(FUSE_I_INIT_RDPLUS, &get_fuse_inode(inode)->state);
 	fuse_change_entry_timeout(dentry, o);
 
 	err = 0;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 5ced199..5b9e6f3 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -115,6 +115,8 @@ struct fuse_inode {
 enum {
 	/** Advise readdirplus  */
 	FUSE_I_ADVISE_RDPLUS,
+	/** Initialized with readdirplus */
+	FUSE_I_INIT_RDPLUS,
 	/** An operation changing file size is in progress  */
 	FUSE_I_SIZE_UNSTABLE,
 };


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [rfc][possible solution] RCU vfsmounts
  2013-09-29 18:10       ` Al Viro
  2013-09-29 18:26         ` Linus Torvalds
@ 2013-09-29 18:49         ` Al Viro
  2013-09-29 19:04         ` Al Viro
  2013-09-30 19:49         ` Al Viro
  3 siblings, 0 replies; 12+ messages in thread
From: Al Viro @ 2013-09-29 18:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, Linux Kernel Mailing List, Miklos Szeredi

On Sun, Sep 29, 2013 at 07:10:47PM +0100, Al Viro wrote:
> FWIW, right now I'm reviewing the subset of fs code that can be hit in
> RCU mode.  Not a pretty sight, that... ;-/  First catch: in
> fuse_dentry_revalidate() we have a case (reachable with LOOKUP_RCU) where
> we do this:
>         } else if (inode) {
>                 fc = get_fuse_conn(inode);
>                 if (fc->readdirplus_auto) {
>                         parent = dget_parent(entry);
>                         fuse_advise_use_readdirplus(parent->d_inode);
>                         dput(parent);
>                 }
>         }
> First of all, that'll lead to obvious nastiness if we get here when
> ->s_fs_info has already been freed in process of fs shutdown; fc will
> be pointing to kfreed object and no, freeing it isn't RCU-delayed.
> That's not a problem with the current tree, of course, but this
> dput(parent) very much is - doing that under rcu_read_lock() is
> a Bloody Bad Idea(tm).

Another one: 
int ll_revalidate_nd(struct dentry *dentry, unsigned int flags)
{
        struct inode *parent = dentry->d_parent->d_inode;
        int unplug = 0;

        CDEBUG(D_VFSTRACE, "VFS Op:name=%s,flags=%u\n",  
               dentry->d_name.name, flags);

        if (!(flags & (LOOKUP_PARENT|LOOKUP_OPEN|LOOKUP_CREATE)) &&
            ll_need_statahead(parent, dentry) > 0) {
                if (flags & LOOKUP_RCU)
                        return -ECHILD;

... and ll_need_statahead(NULL, ...) will oops.  Doesn't even need
LOOKUP_RCU to barf - ->d_revalidate() can be called without ->i_mutex
on parent, so we can race with e.g. rename() followed by rmdir() of
old parent.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][possible solution] RCU vfsmounts
  2013-09-29 18:10       ` Al Viro
  2013-09-29 18:26         ` Linus Torvalds
  2013-09-29 18:49         ` Al Viro
@ 2013-09-29 19:04         ` Al Viro
  2013-09-30 19:49         ` Al Viro
  3 siblings, 0 replies; 12+ messages in thread
From: Al Viro @ 2013-09-29 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Linux Kernel Mailing List, Mark Fasheh,
	Joel Becker

On Sun, Sep 29, 2013 at 07:10:47PM +0100, Al Viro wrote:

> FWIW, right now I'm reviewing the subset of fs code that can be hit in
> RCU mode.  Not a pretty sight, that... ;-/  First catch: in
[snip]

and another, this one completely unrelated to RCU:
                unsigned long gen = (unsigned long) dentry->d_fsdata;
                unsigned long pgen =
                        OCFS2_I(dentry->d_parent->d_inode)->ip_dir_lock_gen;
in ocfs2_dentry_revalidate() needs dentry->d_lock around fetching pgen, as
in the diff below.  I can put it into the next vfs.git pull request
tonight, or pass the buck to ocfs2 folks.  Folks, which tree would
you prefer that to go through?  I'm fine with either variant.  Strictly
speaking, that's -stable fodder, but the race window is quite narrow,
so it's not something earth-shattering...

diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
index ef99972..0d3a97d 100644
--- a/fs/ocfs2/dcache.c
+++ b/fs/ocfs2/dcache.c
@@ -70,9 +70,10 @@ static int ocfs2_dentry_revalidate(struct dentry *dentry, unsigned int flags)
 	 */
 	if (inode == NULL) {
 		unsigned long gen = (unsigned long) dentry->d_fsdata;
-		unsigned long pgen =
-			OCFS2_I(dentry->d_parent->d_inode)->ip_dir_lock_gen;
-
+		unsigned long pgen;
+		spin_lock(&dentry->d_lock);
+		pgen = OCFS2_I(dentry->d_parent->d_inode)->ip_dir_lock_gen;
+		spin_unlock(&dentry->d_lock);
 		trace_ocfs2_dentry_revalidate_negative(dentry->d_name.len,
 						       dentry->d_name.name,
 						       pgen, gen);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [rfc][possible solution] RCU vfsmounts
  2013-09-29 18:10       ` Al Viro
                           ` (2 preceding siblings ...)
  2013-09-29 19:04         ` Al Viro
@ 2013-09-30 19:49         ` Al Viro
  2013-10-02  1:30           ` Al Viro
  2013-10-03  6:14           ` Al Viro
  3 siblings, 2 replies; 12+ messages in thread
From: Al Viro @ 2013-09-30 19:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, Linux Kernel Mailing List, Miklos Szeredi

On Sun, Sep 29, 2013 at 07:10:47PM +0100, Al Viro wrote:

> FWIW, right now I'm reviewing the subset of fs code that can be hit in
> RCU mode.

OK...  AFAICS, we are not too far from being able to handle RCU pathwalk
straying into fs in the middle of being shut down.
	* There are 5 methods that can be called:
->d_hash(...)
->d_compare(...)
->d_revalidate(..., LOOKUP_RCU | ...)
->d_manage(..., true)
->permission(..., MAY_NOT_BLOCK | MAY_EXEC)
Filesystem needs to be able to survive those during shutdown.  The stuff
needed for that should _not_ be freed without synchronize_rcu() (or via
call_rcu()); usually ->s_fs_info is involved (when anything is needed
at all).  In any case, we shouldn't allow rmmod without making sure that
everything in RCU mode has run out, but most of the filesystems have
rcu_barrier() in their exit_module anyway.
	* __put_super() probably ought to delay actual freeing via
call_rcu(); might not be strictly necessary, but probably a good idea
anyway.
	* shrink_dcache_for_umount() ought to use d_walk(), a-la
shrink_dcache_parent().

Note that most of the filesystems don't have any of these methods or
don't look at anything outside of inode/dentry involved in RCU case.
Zoo:

* adfs: has the name length limit in fs-private part of superblock; used
by ->d_hash() and ->d_compare().  No other methods involved, synchronize_rcu()
before doing kfree() in adfs_put_super() will suffice.

* autofs4: wants fs-private part of superblock in ->d_manage().
synchronize_rcu() in autofs4_kill_sb() would do it, or we could delay
freeing that sucker via call_rcu() (in that case we want delayed
freeing in __put_super() as well).

* btrfs: wants btrfs_root_readonly(BTRFS_I(inode)->root) usable in
->permission().  Delayed freeing of struct btrfs_root, perhaps?

* cifs: wants nls, refered to from fs-private part of superblock.
->permission() wants fs-private part of superblock as well.  Just
synchronize_rcu() before unload_nls() in cifs_umount()...

* fat: same situation as with cifs

* fuse: delayed freeing of struct fuse_conn?  BTW, Miklos, just what is
        } else if (mask & (MAY_ACCESS | MAY_CHDIR)) {
                if (mask & MAY_NOT_BLOCK)
                        return -ECHILD;
about, when we never pass such combinations?  Oh, well...

* hpfs: similar to cifs and fat, only without use of nls (a homegrown table
of some sort).

* ncpfs: _probably_ similar to cifs et.al., but there might be dragons

* procfs: delayed freeing of pid_namespace?

* lustre: messy, haven't looked through that.

Overall, it looks doable.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][possible solution] RCU vfsmounts
  2013-09-30 19:49         ` Al Viro
@ 2013-10-02  1:30           ` Al Viro
  2013-10-03  6:14           ` Al Viro
  1 sibling, 0 replies; 12+ messages in thread
From: Al Viro @ 2013-10-02  1:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, Linux Kernel Mailing List, Miklos Szeredi

On Mon, Sep 30, 2013 at 08:49:21PM +0100, Al Viro wrote:
> OK...  AFAICS, we are not too far from being able to handle RCU pathwalk
> straying into fs in the middle of being shut down.
> 	* There are 5 methods that can be called:
> ->d_hash(...)
> ->d_compare(...)
> ->d_revalidate(..., LOOKUP_RCU | ...)
> ->d_manage(..., true)
> ->permission(..., MAY_NOT_BLOCK | MAY_EXEC)
> Filesystem needs to be able to survive those during shutdown.  The stuff
> needed for that should _not_ be freed without synchronize_rcu() (or via
> call_rcu()); usually ->s_fs_info is involved (when anything is needed
> at all).  In any case, we shouldn't allow rmmod without making sure that
> everything in RCU mode has run out, but most of the filesystems have
> rcu_barrier() in their exit_module anyway.

... and unregister_filesystem() has synchronize_rcu(), which leaves takes
care of everything other than callbacks in fs code fed to call_rcu().
So the things are even less painful.

> 	* __put_super() probably ought to delay actual freeing via
> call_rcu(); might not be strictly necessary, but probably a good idea
> anyway.
> 	* shrink_dcache_for_umount() ought to use d_walk(), a-la
> shrink_dcache_parent().
> 
> Note that most of the filesystems don't have any of these methods or
> don't look at anything outside of inode/dentry involved in RCU case.
> Zoo:
> 
> * adfs: has the name length limit in fs-private part of superblock; used
> by ->d_hash() and ->d_compare().  No other methods involved, synchronize_rcu()
> before doing kfree() in adfs_put_super() will suffice.
> 
> * autofs4: wants fs-private part of superblock in ->d_manage().
> synchronize_rcu() in autofs4_kill_sb() would do it, or we could delay
> freeing that sucker via call_rcu() (in that case we want delayed
> freeing in __put_super() as well).
> 
> * btrfs: wants btrfs_root_readonly(BTRFS_I(inode)->root) usable in
> ->permission().  Delayed freeing of struct btrfs_root, perhaps?
> 
> * cifs: wants nls, refered to from fs-private part of superblock.
> ->permission() wants fs-private part of superblock as well.  Just
> synchronize_rcu() before unload_nls() in cifs_umount()...
> 
> * fat: same situation as with cifs
> 
> * fuse: delayed freeing of struct fuse_conn?  BTW, Miklos, just what is
>         } else if (mask & (MAY_ACCESS | MAY_CHDIR)) {
>                 if (mask & MAY_NOT_BLOCK)
>                         return -ECHILD;
> about, when we never pass such combinations?  Oh, well...
> 
> * hpfs: similar to cifs and fat, only without use of nls (a homegrown table
> of some sort).
> 
> * ncpfs: _probably_ similar to cifs et.al., but there might be dragons
> 
> * procfs: delayed freeing of pid_namespace?
> 
> * lustre: messy, haven't looked through that.

Extremely preliminary version is in vfs.git #experimental.  No fs-specific
fixes mentioned above are in there; relevant stuff is in the end of
queue -
      initialize namespace_sem statically
      fs_is_visible only needs namespace_sem held shared
      dup_mnt_ns(): get rid of pointless grabbing of vfsmount_lock
      do_remount(): pull touch_mnt_namespace() up
      fold mntfree() into mntput_no_expire()
      fs/namespace.c: bury long-dead define
      finish_automount() doesn't need vfsmount_lock for removal from expiry list
      mnt_set_expiry() doesn't need vfsmount_lock
      fold dup_mnt_ns() into its only surviving caller
      namespace.c: get rid of mnt_ghosts
      don't bother with vfsmount_lock in mounts_poll()
      new helpers: lock_mount_hash/unlock_mount_hash
      isofs: don't pass dentry to isofs_hash{i,}_common()
      uninline destroy_super(), consolidate alloc_super()
      split __lookup_mnt() in two functions
      move taking vfsmount_lock down into prepend_path()
      RCU'd vfsmounts
 fs/dcache.c           |  221 +++++++++++++----------------
 fs/internal.h         |    4 -
 fs/isofs/inode.c      |   12 +-
 fs/mount.h            |   20 +++-
 fs/namei.c            |   87 +++++------
 fs/namespace.c        |  386 +++++++++++++++++++++++++------------------------
 fs/pnode.c            |   13 +-
 fs/proc_namespace.c   |    8 +-
 fs/super.c            |  206 +++++++++++---------------
 include/linux/mount.h |    2 +
 include/linux/namei.h |    2 +-
 11 files changed, 455 insertions(+), 506 deletions(-)

With fs fixes added it'll probably still end up with slightly negative
balance...

It's barely tested (i.e. no beating for races, etc. - boots and shuts down
without any apparent problems, but that's it) and the last commit almost
certainly needs a splitup.  Everything prior to it can probably go into
-next and I hope to carve standalone pieces from it as well.  Review and
comments are welcome; personally, I prefer to use git fetch for review,
but if somebody prefers posted mailbomb, yell and I'll send it...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][possible solution] RCU vfsmounts
  2013-09-30 19:49         ` Al Viro
  2013-10-02  1:30           ` Al Viro
@ 2013-10-03  6:14           ` Al Viro
  1 sibling, 0 replies; 12+ messages in thread
From: Al Viro @ 2013-10-03  6:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, Linux Kernel Mailing List, Miklos Szeredi

On Mon, Sep 30, 2013 at 08:49:21PM +0100, Al Viro wrote:
> * btrfs: wants btrfs_root_readonly(BTRFS_I(inode)->root) usable in
> ->permission().  Delayed freeing of struct btrfs_root, perhaps?

Not needed, actually - it's only checked with MAY_WRITE, and we don't
pass that in RCU mode.

Anyway, I've slapped what looks like a sufficient set of synchronize_rcu()
in affected filesystems and re-pushed.  Result is at least supposed to cover
all of them.  It still might bugger your memory, chew the disks, etc., so
it's only for testing on scratch boxen at that point.

Still, review and (if you are really brave) testing would be very much
appreciated.

I'll post the patchset on top of the current mainline in a few.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-10-03  6:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-28 20:27 [rfc][possible solution] RCU vfsmounts Al Viro
2013-09-28 20:43 ` Linus Torvalds
2013-09-29  6:06   ` Al Viro
2013-09-29 17:19     ` Linus Torvalds
2013-09-29 18:10       ` Al Viro
2013-09-29 18:26         ` Linus Torvalds
2013-09-30 10:48           ` Miklos Szeredi
2013-09-29 18:49         ` Al Viro
2013-09-29 19:04         ` Al Viro
2013-09-30 19:49         ` Al Viro
2013-10-02  1:30           ` Al Viro
2013-10-03  6:14           ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).