From: Dave Chinner <david@fromorbit.com>
To: Al Viro <viro@ZenIV.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely
Date: Fri, 11 May 2018 11:32:08 +1000 [thread overview]
Message-ID: <20180511013208.GV23861@dastard> (raw)
In-Reply-To: <20180511003901.GW30522@ZenIV.linux.org.uk>
On Fri, May 11, 2018 at 01:39:01AM +0100, Al Viro wrote:
> On Fri, May 11, 2018 at 08:56:07AM +1000, Dave Chinner wrote:
>
> > > For anything NFS-exported we do _not_ want to unlock new inode
> > > before it has grown an alias; original set of fixes got the
> > > ordering right, but missed the nasty complication in case of
> > > lockdep being enabled - unlock_new_inode() does
> > > lockdep_annotate_inode_mutex_key(inode)
> > > which can only be done before anyone gets a chance to touch
> > > ->i_mutex. Unfortunately, flipping the order and doing
> > > unlock_new_inode() before d_instantiate() opens a window when
> > > mkdir can race with open-by-fhandle on a guessed fhandle, leading
> > > to multiple aliases for a directory inode and all the breakage
> > > that follows from that.
> > >
> > > Correct solution: a new primitive (d_instantiate_new())
> > > combining these two in the right order - lockdep annotate, then
> > > d_instantiate(), then the rest of unlock_new_inode(). All
> > > combinations of d_instantiate() with unlock_new_inode() should
> > > be converted to that.
> >
> > Ok, so this seems to touch only the paths that create new inodes
> > (mkdir, mknod, etc). Is the lookup path that does:
> >
> >
> > unlock_new_inode()
> > .....
> > d_splice_alias(inode, dentry);
> >
> > OK?
>
> Yes. d_splice_alias()
> * will do the right thing when it runs into directory inode
> that already has an alias
> * is called from ->d_lookup(), which has calling conventions
> allowing to return a preexisting alias
>
> The race in question is between mkdir() and open-by-fhandle that manages
> to guess an fhandle for directory about to be created. mkdir() side
> creates a new inode, inserts it into icache (locked) and proceeds towards
> unlock_new_inode()/d_instantiate(). Suppose it loses CPU right after
> unlock_new_inode() and open-by-fhandle picks the inode from icache
> (either having just gotten there, or finally gets woken up after having
> waited for the sucker to get unlocked). inode is valid, everything's
> set up properply, so we pass it to d_obtain_alias(), which sees that
> there's no exiting dentries, allocates one, rechecks, finds that there's
> still nothing and proceeds to attach its new anon dentry to that inode.
> Now mkdir regains CPU and does d_instantiate(). And we are fucked -
> there are *two* dentries for given directory inode.
Ok, thanks for the description of the race, Al. I understand it now.
:)
>
> The window is narrow - to have a chance to hit it you need either
> to run it in a VM or have security_d_instantiate() (from d_instantiate())
> to do something slow (ideally - blocking). It's non-empty, though.
>
> Doing it in the opposite order (as XFS does on mkdir et.al.) plugs that
> window - open-by-fhandle won't get to the inode until after mkdir has
> attached a dentry to it. Then d_obtain_alias() will simply return that
> dentry and we are done. It's only d_instantiate() (or d_add()) that is
> a problem - d_splice_alias() is fine, so on the lookup path we don't
> need anything like that. d_add_ci() is like d_splice_alias() in that
> respect, so the lookup is OK in case-insensitive variant as well.
>
> So it would appear that XFS doesn't need to be touched. HOWEVER,
> lockdep shite *can't* be done after something has had a chance to grab
> the damn rwsem. I really wonder if
> d_instantiate(dentry, inode);
> xfs_finish_inode_setup(cip);
> doesn't lead to unpleasantness with lockdep enabled:
> xfs_finish_inode_setup() -> unlock_new_inode() ->
> lockdep_annotate_inode_mutex_key() -> init_rwsem(&inode->i_rwsem)
> which does wonders if something has already gotten to the inode
> via that dentry and tried e.g. lock_inode() on it.
Could well do. Though it seems fixable.
i.e. we already have code in xfs_setup_inode() that sets the xfs
inode ILOCK rwsem dir/non-dir lockdep class before the new inode is
unlocked - we could just do the i_rwsem lockdep setup there, too.
Then, if we were to factor unlock_new_inode() as Andreas suggested,
we could call __unlock_new_inode() from xfs_finish_inode_setup().
I might be missing something subtle, but that looks to me like it
would work.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2018-05-11 1:32 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-10 18:20 [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely Al Viro
2018-05-10 19:11 ` Andreas Dilger
2018-05-10 19:32 ` Al Viro
2018-05-10 20:44 ` Mike Marshall
2018-05-10 22:56 ` Dave Chinner
2018-05-11 0:39 ` Al Viro
2018-05-11 1:32 ` Dave Chinner [this message]
2018-05-11 2:18 ` Al Viro
2018-05-11 3:00 ` Dave Chinner
2018-05-11 19:56 ` Al Viro
2018-05-11 6:15 ` Ritesh Harjani
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180511013208.GV23861@dastard \
--to=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=viro@ZenIV.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.