linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Al Viro <viro@ZenIV.linux.org.uk>
To: "Drokin, Oleg" <oleg.drokin@intel.com>
Cc: "Dilger, Andreas" <andreas.dilger@intel.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"<linux-fsdevel@vger.kernel.org>" <linux-fsdevel@vger.kernel.org>,
	Theodore Ts'o <tytso@mit.edu>, Mark Fasheh <mfasheh@suse.com>
Subject: Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
Date: Thu, 10 Mar 2016 23:23:14 +0000	[thread overview]
Message-ID: <20160310232314.GP17997@ZenIV.linux.org.uk> (raw)
In-Reply-To: <11114F89-310D-460E-9758-C061E94B7811@intel.com>

On Thu, Mar 10, 2016 at 09:22:21PM +0000, Drokin, Oleg wrote:
> In fact I was surprised this was not the case already and so people
> needed to call it by hand in every possible case where a new dentry is possibly
> originated leading to a bit of a mess (and some bugs).

Most of the filesystems do not need that at all - just look at the
d_splice_alias() callers.  In mainline we have
	* 22 callers of form return d_splice_alias(inode, dentry); in
some ->lookup() instance.
	* 1 caller in d_add_ci(), all callers of which are in ->lookup()
instances, with return value directly returned to caller of ->lookup().
	* 1 caller in kernfs ->lookup(), with return value directly
returned after releasing a global mutex.
	* ceph - returned in a very convoluted way, but return values is
not dereferenced until it gets passed to caller of ->lookup()
	* gfs2 - passed to caller (or fed to finish_open() when it's in
atomic open) with some unlocks along the way.
	* ocfs2 - broken, and in a way that wouldn't be fixable at dentry
allocation time.
	* cifs - seeding dcache on readdir.  No postprocessing of dentry.
fs/9p/vfs_inode.c:834:  res = d_splice_alias(inode, dentry);
	* fuse - essentially touching a timestamp of dentry, be it new or
old one.  Both on ->lookup() and when seeding dcache on readdir.  Same
as done on successful revalidate, so the worst we are risking is extra work
in ->d_revalidate() for somebody finding it in dcache before we'd touched
the timestamp.
	* nfs - ditto.
	* 9p - the fid we'd obtained is hung on whatever dentry we get.
If revalidate doesn't find a fid, it'll try to get one from server and
hang it on the dentry.

d_obtain_alias() callers make it even more clear - 27 call sites where it's
directly returned, 5 more where it's returned without being dereferenced
and 5 callers in 3 filesystems (ceph, fuse, lustre) where something is
(or ought to be) done.  ceph might benefit from that thing as well - looks
like its users of d_obtain_alias() might be racy wrt d_splice_alias() from
->lookup() picking the alias before ceph_init_dentry() finishes.  Or not -
the call chain in there is convoluted as hell and I might miss the call
of ceph_init_denty() along the way.  ocfs2 might or might not be correct -
it doesn't do any postprocessing after d_obtain_alias(); no idea if it
really needs one there.

So we have two filesystems that could benefit from having ->d_init() -
ceph and lustre.  Note that e.g. ocfs2 wouldn't benefit from that - it needs
a lot more than just allocation *and* it uses ->d_fsdata differently for
positives and negatives.

FWIW, there's a troubling thing in ceph_init_dentry() -
        if (ceph_snap(d_inode(dentry->d_parent)) == CEPH_NOSNAP)
                d_set_d_op(dentry, &ceph_dentry_ops);
        else if (ceph_snap(d_inode(dentry->d_parent)) == CEPH_SNAPDIR)
                d_set_d_op(dentry, &ceph_snapdir_dentry_ops);
        else   
                d_set_d_op(dentry, &ceph_snap_dentry_ops);

_That_ can't be done in __d_alloc() and, what's worse, it doesn't look good
in their __fh_to_dentry() call of ceph_init_dentry().  ->d_parent has
a good chance to be pointing to the dentry itself.  Not sure how much trouble
does it lead to - I'm not familiar with ceph snapshots or ceph reexports over
NFS, let alone the combination...

> The downside is that we do the allocation at all times even if the dentry is
> going to be promptly destroyed because we found a better alias.

The cost of kmalloc() (or kmem_cache_alloc() - it might make sense
to introduce a specialized slab for those beasts) is trivial compared to the
cost of talking to server to find which inode you are going to deal with.

  reply	other threads:[~2016-03-10 23:23 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-08 16:05 races in ll_splice_alias() Al Viro
2016-03-08 20:44 ` Drokin, Oleg
2016-03-08 21:11   ` Al Viro
2016-03-08 23:18     ` Drokin, Oleg
2016-03-09  0:34       ` Al Viro
2016-03-09  0:53         ` Drokin, Oleg
2016-03-09  1:26           ` Al Viro
2016-03-09  5:20             ` Drokin, Oleg
2016-03-09 23:47             ` Drokin, Oleg
2016-03-10  2:20               ` races in ll_splice_alias() and elsewhere (ext4, ocfs2) Al Viro
2016-03-10  2:59                 ` Al Viro
2016-03-10 23:55                   ` Theodore Ts'o
2016-03-11  3:18                     ` Al Viro
2016-03-11 15:42                       ` Theodore Ts'o
2016-03-10  3:08                 ` Drokin, Oleg
2016-03-10  3:34                   ` Al Viro
2016-03-10  3:46                     ` Drokin, Oleg
2016-03-10  4:22                       ` Drokin, Oleg
2016-03-10  4:43                       ` Al Viro
2016-03-10  5:15                         ` Al Viro
2016-03-11  3:47                           ` Drokin, Oleg
2016-03-10  5:47                         ` Drokin, Oleg
2016-03-10 19:59                 ` Al Viro
2016-03-10 20:34                   ` do we need that smp_wmb() in __d_alloc()? Al Viro
2016-03-10 21:17                     ` Al Viro
2016-03-10 21:22                   ` races in ll_splice_alias() and elsewhere (ext4, ocfs2) Drokin, Oleg
2016-03-10 23:23                     ` Al Viro [this message]
2016-03-11  3:25                       ` Drokin, Oleg
2016-03-12 17:22                         ` Al Viro
2016-03-13 14:35                           ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160310232314.GP17997@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=andreas.dilger@intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mfasheh@suse.com \
    --cc=oleg.drokin@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).