* Re: [rfc][possible solution] RCU vfsmounts
2013-09-29 18:10 ` Al Viro
@ 2013-09-29 18:26 ` Linus Torvalds
2013-09-30 10:48 ` Miklos Szeredi
2013-09-29 18:49 ` Al Viro
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2013-09-29 18:26 UTC (permalink / raw)
To: Al Viro; +Cc: linux-fsdevel, Linux Kernel Mailing List, Miklos Szeredi
On Sun, Sep 29, 2013 at 11:10 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, right now I'm reviewing the subset of fs code that can be hit in
> RCU mode. Not a pretty sight, that... ;-/ First catch: in
> fuse_dentry_revalidate() we have a case (reachable with LOOKUP_RCU) where
> we do this:
> } else if (inode) {
> fc = get_fuse_conn(inode);
> if (fc->readdirplus_auto) {
> parent = dget_parent(entry);
> fuse_advise_use_readdirplus(parent->d_inode);
> dput(parent);
> }
> }
Ugh, yes, that dget/dput(parent) looks wrong in RCU mode.
That said, in RCU mode you simply shouldn't _need_ it at all, you
should be able to just use dentry->d_parent without any refcount
games. Put an ACCESS_ONCE there to be safe. You might want to make
sure that you do the same for the inode, and check for NULL, to be
safe against racing with a cross-directory rename/rmdir. I don't know
if there are then any internal fuse races with the whole
get_fuse_conn() etc, so...
It does look bad. In practice, of course, it will never hit anything.
> If my reading of that code is right, the proper fix would be to
> turn that else if (inode) into else if (inode && !(flags & LOOKUP_RCU))
That sounds safer, but then the fuse_advise_use_readdirplus() bit
wouldn't get set. But why _is_ that bit set there in the first place?
It sounds stupid. I think the bit should be set in the lookup path (or
the revalidation slow-path when the timeout is over and the thing gets
properly revalidated), why the hell does it do it in the fast-path
revalidation in the first place? That's just odd. Maybe there is some
odd internal fuse logic.
Miklos, please do give that a look..
Linus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [rfc][possible solution] RCU vfsmounts
2013-09-29 18:26 ` Linus Torvalds
@ 2013-09-30 10:48 ` Miklos Szeredi
0 siblings, 0 replies; 12+ messages in thread
From: Miklos Szeredi @ 2013-09-30 10:48 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List
On Sun, Sep 29, 2013 at 11:26:21AM -0700, Linus Torvalds wrote:
> > If my reading of that code is right, the proper fix would be to
> > turn that else if (inode) into else if (inode && !(flags & LOOKUP_RCU))
>
> That sounds safer, but then the fuse_advise_use_readdirplus() bit
> wouldn't get set. But why _is_ that bit set there in the first place?
> It sounds stupid. I think the bit should be set in the lookup path (or
> the revalidation slow-path when the timeout is over and the thing gets
> properly revalidated), why the hell does it do it in the fast-path
> revalidation in the first place? That's just odd. Maybe there is some
> odd internal fuse logic.
The logic advises future use of READDIRPLUS on either
- lookup slowpath (READDIR used, but READDIRPLUS may have been useful)
- revalidate fastpath (READDIRPLUS used and was found to be useful)
Revalidate slowpath possibly means we have the icache primed by READDIRPLUS, but
something went wrong and the lookup needed to be redone. Advising use of
READDIRPLUS does not make much sense in that case, since the result of the
"PLUS" part wasn't actually used.
But... this will always slow down the lookup, even if revalidating something
that was not recently read with READDIRPLUS. How about something like this:
when setting up inodes with READDIRPLUS set a flag FUSE_I_INIT_RDPLUS. On
revalidate only advise readdirplus on parent if this flag is set (and clear it
once used). I think it would be okay to dereference d_parent and even
d_parent->d_inode in rcu mode. But, to be safe, just drop out of rcu mode
instead in this case, since it won't be the common cached case.
The patch also fixes the invalid path to not call check_submounts_and_drop() in
rcu mode, which is definitely not safe to do.
Untested...
Thanks,
Miklos
----
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 62b43b5..a751598 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -182,6 +182,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
struct inode *inode;
struct dentry *parent;
struct fuse_conn *fc;
+ struct fuse_inode *fi;
int ret;
inode = ACCESS_ONCE(entry->d_inode);
@@ -228,7 +229,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
if (!err && !outarg.nodeid)
err = -ENOENT;
if (!err) {
- struct fuse_inode *fi = get_fuse_inode(inode);
+ fi = get_fuse_inode(inode);
if (outarg.nodeid != get_node_id(inode)) {
fuse_queue_forget(fc, forget, outarg.nodeid, 1);
goto invalid;
@@ -246,8 +247,11 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
attr_version);
fuse_change_entry_timeout(entry, &outarg);
} else if (inode) {
- fc = get_fuse_conn(inode);
- if (fc->readdirplus_auto) {
+ fi = get_fuse_inode(inode);
+ if (flags & LOOKUP_RCU) {
+ if (test_bit(FUSE_I_INIT_RDPLUS, &fi->state))
+ return -ECHILD;
+ } else if (test_and_clear_bit(FUSE_I_INIT_RDPLUS, &fi->state)) {
parent = dget_parent(entry);
fuse_advise_use_readdirplus(parent->d_inode);
dput(parent);
@@ -259,7 +263,8 @@ out:
invalid:
ret = 0;
- if (check_submounts_and_drop(entry) != 0)
+
+ if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) != 0)
ret = 1;
goto out;
}
@@ -1291,6 +1296,8 @@ static int fuse_direntplus_link(struct file *file,
}
found:
+ if (fc->readdirplus_auto)
+ set_bit(FUSE_I_INIT_RDPLUS, &get_fuse_inode(inode)->state);
fuse_change_entry_timeout(dentry, o);
err = 0;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 5ced199..5b9e6f3 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -115,6 +115,8 @@ struct fuse_inode {
enum {
/** Advise readdirplus */
FUSE_I_ADVISE_RDPLUS,
+ /** Initialized with readdirplus */
+ FUSE_I_INIT_RDPLUS,
/** An operation changing file size is in progress */
FUSE_I_SIZE_UNSTABLE,
};
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [rfc][possible solution] RCU vfsmounts
2013-09-29 18:10 ` Al Viro
2013-09-29 18:26 ` Linus Torvalds
@ 2013-09-29 18:49 ` Al Viro
2013-09-29 19:04 ` Al Viro
2013-09-30 19:49 ` Al Viro
3 siblings, 0 replies; 12+ messages in thread
From: Al Viro @ 2013-09-29 18:49 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Linux Kernel Mailing List, Miklos Szeredi
On Sun, Sep 29, 2013 at 07:10:47PM +0100, Al Viro wrote:
> FWIW, right now I'm reviewing the subset of fs code that can be hit in
> RCU mode. Not a pretty sight, that... ;-/ First catch: in
> fuse_dentry_revalidate() we have a case (reachable with LOOKUP_RCU) where
> we do this:
> } else if (inode) {
> fc = get_fuse_conn(inode);
> if (fc->readdirplus_auto) {
> parent = dget_parent(entry);
> fuse_advise_use_readdirplus(parent->d_inode);
> dput(parent);
> }
> }
> First of all, that'll lead to obvious nastiness if we get here when
> ->s_fs_info has already been freed in process of fs shutdown; fc will
> be pointing to kfreed object and no, freeing it isn't RCU-delayed.
> That's not a problem with the current tree, of course, but this
> dput(parent) very much is - doing that under rcu_read_lock() is
> a Bloody Bad Idea(tm).
Another one:
int ll_revalidate_nd(struct dentry *dentry, unsigned int flags)
{
struct inode *parent = dentry->d_parent->d_inode;
int unplug = 0;
CDEBUG(D_VFSTRACE, "VFS Op:name=%s,flags=%u\n",
dentry->d_name.name, flags);
if (!(flags & (LOOKUP_PARENT|LOOKUP_OPEN|LOOKUP_CREATE)) &&
ll_need_statahead(parent, dentry) > 0) {
if (flags & LOOKUP_RCU)
return -ECHILD;
... and ll_need_statahead(NULL, ...) will oops. Doesn't even need
LOOKUP_RCU to barf - ->d_revalidate() can be called without ->i_mutex
on parent, so we can race with e.g. rename() followed by rmdir() of
old parent.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [rfc][possible solution] RCU vfsmounts
2013-09-29 18:10 ` Al Viro
2013-09-29 18:26 ` Linus Torvalds
2013-09-29 18:49 ` Al Viro
@ 2013-09-29 19:04 ` Al Viro
2013-09-30 19:49 ` Al Viro
3 siblings, 0 replies; 12+ messages in thread
From: Al Viro @ 2013-09-29 19:04 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-fsdevel, Linux Kernel Mailing List, Mark Fasheh,
Joel Becker
On Sun, Sep 29, 2013 at 07:10:47PM +0100, Al Viro wrote:
> FWIW, right now I'm reviewing the subset of fs code that can be hit in
> RCU mode. Not a pretty sight, that... ;-/ First catch: in
[snip]
and another, this one completely unrelated to RCU:
unsigned long gen = (unsigned long) dentry->d_fsdata;
unsigned long pgen =
OCFS2_I(dentry->d_parent->d_inode)->ip_dir_lock_gen;
in ocfs2_dentry_revalidate() needs dentry->d_lock around fetching pgen, as
in the diff below. I can put it into the next vfs.git pull request
tonight, or pass the buck to ocfs2 folks. Folks, which tree would
you prefer that to go through? I'm fine with either variant. Strictly
speaking, that's -stable fodder, but the race window is quite narrow,
so it's not something earth-shattering...
diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
index ef99972..0d3a97d 100644
--- a/fs/ocfs2/dcache.c
+++ b/fs/ocfs2/dcache.c
@@ -70,9 +70,10 @@ static int ocfs2_dentry_revalidate(struct dentry *dentry, unsigned int flags)
*/
if (inode == NULL) {
unsigned long gen = (unsigned long) dentry->d_fsdata;
- unsigned long pgen =
- OCFS2_I(dentry->d_parent->d_inode)->ip_dir_lock_gen;
-
+ unsigned long pgen;
+ spin_lock(&dentry->d_lock);
+ pgen = OCFS2_I(dentry->d_parent->d_inode)->ip_dir_lock_gen;
+ spin_unlock(&dentry->d_lock);
trace_ocfs2_dentry_revalidate_negative(dentry->d_name.len,
dentry->d_name.name,
pgen, gen);
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [rfc][possible solution] RCU vfsmounts
2013-09-29 18:10 ` Al Viro
` (2 preceding siblings ...)
2013-09-29 19:04 ` Al Viro
@ 2013-09-30 19:49 ` Al Viro
2013-10-02 1:30 ` Al Viro
2013-10-03 6:14 ` Al Viro
3 siblings, 2 replies; 12+ messages in thread
From: Al Viro @ 2013-09-30 19:49 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Linux Kernel Mailing List, Miklos Szeredi
On Sun, Sep 29, 2013 at 07:10:47PM +0100, Al Viro wrote:
> FWIW, right now I'm reviewing the subset of fs code that can be hit in
> RCU mode.
OK... AFAICS, we are not too far from being able to handle RCU pathwalk
straying into fs in the middle of being shut down.
* There are 5 methods that can be called:
->d_hash(...)
->d_compare(...)
->d_revalidate(..., LOOKUP_RCU | ...)
->d_manage(..., true)
->permission(..., MAY_NOT_BLOCK | MAY_EXEC)
Filesystem needs to be able to survive those during shutdown. The stuff
needed for that should _not_ be freed without synchronize_rcu() (or via
call_rcu()); usually ->s_fs_info is involved (when anything is needed
at all). In any case, we shouldn't allow rmmod without making sure that
everything in RCU mode has run out, but most of the filesystems have
rcu_barrier() in their exit_module anyway.
* __put_super() probably ought to delay actual freeing via
call_rcu(); might not be strictly necessary, but probably a good idea
anyway.
* shrink_dcache_for_umount() ought to use d_walk(), a-la
shrink_dcache_parent().
Note that most of the filesystems don't have any of these methods or
don't look at anything outside of inode/dentry involved in RCU case.
Zoo:
* adfs: has the name length limit in fs-private part of superblock; used
by ->d_hash() and ->d_compare(). No other methods involved, synchronize_rcu()
before doing kfree() in adfs_put_super() will suffice.
* autofs4: wants fs-private part of superblock in ->d_manage().
synchronize_rcu() in autofs4_kill_sb() would do it, or we could delay
freeing that sucker via call_rcu() (in that case we want delayed
freeing in __put_super() as well).
* btrfs: wants btrfs_root_readonly(BTRFS_I(inode)->root) usable in
->permission(). Delayed freeing of struct btrfs_root, perhaps?
* cifs: wants nls, refered to from fs-private part of superblock.
->permission() wants fs-private part of superblock as well. Just
synchronize_rcu() before unload_nls() in cifs_umount()...
* fat: same situation as with cifs
* fuse: delayed freeing of struct fuse_conn? BTW, Miklos, just what is
} else if (mask & (MAY_ACCESS | MAY_CHDIR)) {
if (mask & MAY_NOT_BLOCK)
return -ECHILD;
about, when we never pass such combinations? Oh, well...
* hpfs: similar to cifs and fat, only without use of nls (a homegrown table
of some sort).
* ncpfs: _probably_ similar to cifs et.al., but there might be dragons
* procfs: delayed freeing of pid_namespace?
* lustre: messy, haven't looked through that.
Overall, it looks doable.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [rfc][possible solution] RCU vfsmounts
2013-09-30 19:49 ` Al Viro
@ 2013-10-02 1:30 ` Al Viro
2013-10-03 6:14 ` Al Viro
1 sibling, 0 replies; 12+ messages in thread
From: Al Viro @ 2013-10-02 1:30 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Linux Kernel Mailing List, Miklos Szeredi
On Mon, Sep 30, 2013 at 08:49:21PM +0100, Al Viro wrote:
> OK... AFAICS, we are not too far from being able to handle RCU pathwalk
> straying into fs in the middle of being shut down.
> * There are 5 methods that can be called:
> ->d_hash(...)
> ->d_compare(...)
> ->d_revalidate(..., LOOKUP_RCU | ...)
> ->d_manage(..., true)
> ->permission(..., MAY_NOT_BLOCK | MAY_EXEC)
> Filesystem needs to be able to survive those during shutdown. The stuff
> needed for that should _not_ be freed without synchronize_rcu() (or via
> call_rcu()); usually ->s_fs_info is involved (when anything is needed
> at all). In any case, we shouldn't allow rmmod without making sure that
> everything in RCU mode has run out, but most of the filesystems have
> rcu_barrier() in their exit_module anyway.
... and unregister_filesystem() has synchronize_rcu(), which leaves takes
care of everything other than callbacks in fs code fed to call_rcu().
So the things are even less painful.
> * __put_super() probably ought to delay actual freeing via
> call_rcu(); might not be strictly necessary, but probably a good idea
> anyway.
> * shrink_dcache_for_umount() ought to use d_walk(), a-la
> shrink_dcache_parent().
>
> Note that most of the filesystems don't have any of these methods or
> don't look at anything outside of inode/dentry involved in RCU case.
> Zoo:
>
> * adfs: has the name length limit in fs-private part of superblock; used
> by ->d_hash() and ->d_compare(). No other methods involved, synchronize_rcu()
> before doing kfree() in adfs_put_super() will suffice.
>
> * autofs4: wants fs-private part of superblock in ->d_manage().
> synchronize_rcu() in autofs4_kill_sb() would do it, or we could delay
> freeing that sucker via call_rcu() (in that case we want delayed
> freeing in __put_super() as well).
>
> * btrfs: wants btrfs_root_readonly(BTRFS_I(inode)->root) usable in
> ->permission(). Delayed freeing of struct btrfs_root, perhaps?
>
> * cifs: wants nls, refered to from fs-private part of superblock.
> ->permission() wants fs-private part of superblock as well. Just
> synchronize_rcu() before unload_nls() in cifs_umount()...
>
> * fat: same situation as with cifs
>
> * fuse: delayed freeing of struct fuse_conn? BTW, Miklos, just what is
> } else if (mask & (MAY_ACCESS | MAY_CHDIR)) {
> if (mask & MAY_NOT_BLOCK)
> return -ECHILD;
> about, when we never pass such combinations? Oh, well...
>
> * hpfs: similar to cifs and fat, only without use of nls (a homegrown table
> of some sort).
>
> * ncpfs: _probably_ similar to cifs et.al., but there might be dragons
>
> * procfs: delayed freeing of pid_namespace?
>
> * lustre: messy, haven't looked through that.
Extremely preliminary version is in vfs.git #experimental. No fs-specific
fixes mentioned above are in there; relevant stuff is in the end of
queue -
initialize namespace_sem statically
fs_is_visible only needs namespace_sem held shared
dup_mnt_ns(): get rid of pointless grabbing of vfsmount_lock
do_remount(): pull touch_mnt_namespace() up
fold mntfree() into mntput_no_expire()
fs/namespace.c: bury long-dead define
finish_automount() doesn't need vfsmount_lock for removal from expiry list
mnt_set_expiry() doesn't need vfsmount_lock
fold dup_mnt_ns() into its only surviving caller
namespace.c: get rid of mnt_ghosts
don't bother with vfsmount_lock in mounts_poll()
new helpers: lock_mount_hash/unlock_mount_hash
isofs: don't pass dentry to isofs_hash{i,}_common()
uninline destroy_super(), consolidate alloc_super()
split __lookup_mnt() in two functions
move taking vfsmount_lock down into prepend_path()
RCU'd vfsmounts
fs/dcache.c | 221 +++++++++++++----------------
fs/internal.h | 4 -
fs/isofs/inode.c | 12 +-
fs/mount.h | 20 +++-
fs/namei.c | 87 +++++------
fs/namespace.c | 386 +++++++++++++++++++++++++------------------------
fs/pnode.c | 13 +-
fs/proc_namespace.c | 8 +-
fs/super.c | 206 +++++++++++---------------
include/linux/mount.h | 2 +
include/linux/namei.h | 2 +-
11 files changed, 455 insertions(+), 506 deletions(-)
With fs fixes added it'll probably still end up with slightly negative
balance...
It's barely tested (i.e. no beating for races, etc. - boots and shuts down
without any apparent problems, but that's it) and the last commit almost
certainly needs a splitup. Everything prior to it can probably go into
-next and I hope to carve standalone pieces from it as well. Review and
comments are welcome; personally, I prefer to use git fetch for review,
but if somebody prefers posted mailbomb, yell and I'll send it...
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [rfc][possible solution] RCU vfsmounts
2013-09-30 19:49 ` Al Viro
2013-10-02 1:30 ` Al Viro
@ 2013-10-03 6:14 ` Al Viro
1 sibling, 0 replies; 12+ messages in thread
From: Al Viro @ 2013-10-03 6:14 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Linux Kernel Mailing List, Miklos Szeredi
On Mon, Sep 30, 2013 at 08:49:21PM +0100, Al Viro wrote:
> * btrfs: wants btrfs_root_readonly(BTRFS_I(inode)->root) usable in
> ->permission(). Delayed freeing of struct btrfs_root, perhaps?
Not needed, actually - it's only checked with MAY_WRITE, and we don't
pass that in RCU mode.
Anyway, I've slapped what looks like a sufficient set of synchronize_rcu()
in affected filesystems and re-pushed. Result is at least supposed to cover
all of them. It still might bugger your memory, chew the disks, etc., so
it's only for testing on scratch boxen at that point.
Still, review and (if you are really brave) testing would be very much
appreciated.
I'll post the patchset on top of the current mainline in a few.
^ permalink raw reply [flat|nested] 12+ messages in thread