linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fs/namei.c: Misuse of sequence counts?
@ 2014-10-11 22:58 Eric Biggers
  2014-10-11 23:46 ` Al Viro
  2014-10-12  0:12 ` Al Viro
  0 siblings, 2 replies; 9+ messages in thread
From: Eric Biggers @ 2014-10-11 22:58 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, linux-kernel

Hi, I've been reading through the path lookup code and I believe there may have
been bugs introduced by commit 4023bfc9 ("be careful with nd->inode in
path_init() and follow_dotdot_rcu()").  And I may have found a pre-existing bug
as well.

In follow_dotdot_rcu(), said commit moved loads of the inode to just before
read_seqcount_begin(), in several instances.  I don't think this is correct,
because (as I understand it) read_seqcount_begin() is opening a seq-read
critical section on the new dentry.  So the inode load should come *after* it,
as in the original, to ensure the inode pointer is correctly matched with the
sequence count.

In path_init(), said commit added a call to read_seqcount_retry() after loading
the inode.  I see two problems with this:

  - The read_seqcount_retry() isn't needed just to load the inode pointer, so
    the change doesn't seem to accomplish anything.
  - If the -ECHILD code path actually runs, the reference to the 'struct file'
    can be leaked.

Also: if there were actual problems that were "fixed" by this commit, I wonder
if they were/are actually caused by the fd-relative case in path_init() using:

	nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);

instead of

	nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);

since the former is missing a memory barrier before the starting inode is
loaded.

Eric

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: fs/namei.c: Misuse of sequence counts?
  2014-10-11 22:58 fs/namei.c: Misuse of sequence counts? Eric Biggers
@ 2014-10-11 23:46 ` Al Viro
  2014-10-12  3:55   ` Eric Biggers
  2014-10-12  0:12 ` Al Viro
  1 sibling, 1 reply; 9+ messages in thread
From: Al Viro @ 2014-10-11 23:46 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-fsdevel, linux-kernel

On Sat, Oct 11, 2014 at 05:58:08PM -0500, Eric Biggers wrote:

> In follow_dotdot_rcu(), said commit moved loads of the inode to just before
> read_seqcount_begin(), in several instances.  I don't think this is correct,
> because (as I understand it) read_seqcount_begin() is opening a seq-read
> critical section on the new dentry.  So the inode load should come *after* it,
> as in the original, to ensure the inode pointer is correctly matched with the
> sequence count.

Nope.  What we do is
	* pick parent inode and seqcount (in whatever order)
	* THEN check that child is still unchanged.
The second part guarantees that parent dentry had been the parent of
child all along, since the moment we'd first fetched _child's_ seqcount.
And since a pinned positive dentry can't have its ->d_inode changed,
we know that the value of parent's inode we'd fetched remained valid
at least until we'd checked the child's seqcount and found it unchanged.
Which means that we had it valid at some point after we'd fetched parent's
seqcount.

The crucial part is that dentry cannot change its ->d_inode for as long as
there are references to it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: fs/namei.c: Misuse of sequence counts?
  2014-10-11 22:58 fs/namei.c: Misuse of sequence counts? Eric Biggers
  2014-10-11 23:46 ` Al Viro
@ 2014-10-12  0:12 ` Al Viro
  2014-10-12  4:01   ` Eric Biggers
  1 sibling, 1 reply; 9+ messages in thread
From: Al Viro @ 2014-10-12  0:12 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-fsdevel, linux-kernel

On Sat, Oct 11, 2014 at 05:58:08PM -0500, Eric Biggers wrote:

> In path_init(), said commit added a call to read_seqcount_retry() after loading
> the inode.  I see two problems with this:
> 
>   - The read_seqcount_retry() isn't needed just to load the inode pointer, so
>     the change doesn't seem to accomplish anything.

Huh?  What's to guarantee that dentry hasn't become negative since the
moment we'd fetched the seqcount?  _That_ is the problem we are dealing
with here - link_path_walk() relies on nd->inode being non-NULL.

>   - If the -ECHILD code path actually runs, the reference to the 'struct file'
>     can be leaked.

Umm...  That, OTOH, might be true, _if_ we could hit it on the path where
nd->path comes from file (and descriptor table had been shared).  Which
could happen, if we could get d_drop() done to that sucker for some
reason - other codepaths bumping the seqcount are impossible for pinned
dentry (including that of an opened file).

Which can, indeed, happen on something like NFS.  OK, so all callers of
path_init() should go to the same place where they go on link_path_walk()
failure, like this.  And that's probably what we want in backports; however,
longer term we would be better off with a combined path_init + link_path_walk -
all callers do exactly the same thing there...  OK, that can be in a followup...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/namei.c b/fs/namei.c
index d20d579..0f64aa4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1950,7 +1950,7 @@ static int path_lookupat(int dfd, const char *name,
 	err = path_init(dfd, name, flags | LOOKUP_PARENT, nd, &base);
 
 	if (unlikely(err))
-		return err;
+		goto out;
 
 	current->total_link_count = 0;
 	err = link_path_walk(name, nd);
@@ -1982,6 +1982,7 @@ static int path_lookupat(int dfd, const char *name,
 		}
 	}
 
+out:
 	if (base)
 		fput(base);
 
@@ -2301,7 +2302,7 @@ path_mountpoint(int dfd, const char *name, struct path *path, unsigned int flags
 
 	err = path_init(dfd, name, flags | LOOKUP_PARENT, &nd, &base);
 	if (unlikely(err))
-		return err;
+		goto out;
 
 	current->total_link_count = 0;
 	err = link_path_walk(name, &nd);

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: fs/namei.c: Misuse of sequence counts?
  2014-10-11 23:46 ` Al Viro
@ 2014-10-12  3:55   ` Eric Biggers
  2014-10-12  4:29     ` Al Viro
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Biggers @ 2014-10-12  3:55 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Sun, Oct 12, 2014 at 12:46:35AM +0100, Al Viro wrote:
> 
> Nope.  What we do is
> 	* pick parent inode and seqcount (in whatever order)
> 	* THEN check that child is still unchanged.
> The second part guarantees that parent dentry had been the parent of
> child all along, since the moment we'd first fetched _child's_ seqcount.
> And since a pinned positive dentry can't have its ->d_inode changed,
> we know that the value of parent's inode we'd fetched remained valid
> at least until we'd checked the child's seqcount and found it unchanged.
> Which means that we had it valid at some point after we'd fetched parent's
> seqcount.

Ah, very tricky.  And I take it that the other two fetches of d_inode in
follow_dotdot_rcu() can likewise be unordered with respect to
read_seqcount_begin(), because the underlying dentries are pinned as either
mnt_mountpoint or mnt_root ---  which in RCU mode, is only guaranteed because of
the call to synchronize_rcu() in namespace_unlock() prior to dropping
references?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: fs/namei.c: Misuse of sequence counts?
  2014-10-12  0:12 ` Al Viro
@ 2014-10-12  4:01   ` Eric Biggers
  2014-10-12  4:37     ` Al Viro
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Biggers @ 2014-10-12  4:01 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Sun, Oct 12, 2014 at 01:12:59AM +0100, Al Viro wrote:
> 
> Huh?  What's to guarantee that dentry hasn't become negative since the
> moment we'd fetched the seqcount?  _That_ is the problem we are dealing
> with here - link_path_walk() relies on nd->inode being non-NULL.

Hmm, I guess that makes sense.  So the code is actually verifying that the inode
is still the inode that was referenced from the current or root directory when
nd->path was set.  But couldn't the problem also be solved by setting nd->inode
directly in the fs->seq retry loops?  (The file descriptor case could be
'nd->inode = file_inode(f.file);'.)  Then there would be no need for the extra
read_seqcount_retry() just for the inode.  The patch you posted looks correct,
but I wonder if this approach would be better.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: fs/namei.c: Misuse of sequence counts?
  2014-10-12  3:55   ` Eric Biggers
@ 2014-10-12  4:29     ` Al Viro
  0 siblings, 0 replies; 9+ messages in thread
From: Al Viro @ 2014-10-12  4:29 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-fsdevel, linux-kernel

On Sat, Oct 11, 2014 at 10:55:10PM -0500, Eric Biggers wrote:
> On Sun, Oct 12, 2014 at 12:46:35AM +0100, Al Viro wrote:
> > 
> > Nope.  What we do is
> > 	* pick parent inode and seqcount (in whatever order)
> > 	* THEN check that child is still unchanged.
> > The second part guarantees that parent dentry had been the parent of
> > child all along, since the moment we'd first fetched _child's_ seqcount.
> > And since a pinned positive dentry can't have its ->d_inode changed,
> > we know that the value of parent's inode we'd fetched remained valid
> > at least until we'd checked the child's seqcount and found it unchanged.
> > Which means that we had it valid at some point after we'd fetched parent's
> > seqcount.
> 
> Ah, very tricky.  And I take it that the other two fetches of d_inode in
> follow_dotdot_rcu() can likewise be unordered with respect to
> read_seqcount_begin(), because the underlying dentries are pinned as either
> mnt_mountpoint or mnt_root ---  which in RCU mode, is only guaranteed because of
> the call to synchronize_rcu() in namespace_unlock() prior to dropping
> references?

The last one is actually covered by read_seqretry(&mount_lock, nd->m_seq) -
if it still matches, we know that whatever we got from __lookup_mnt() must
have been valid through fetching ->d_inode and ->d_seq of its mnt_root.
Which means that those two are consistent regardless of that synchronize_rcu().

The one before it would probably be better off with similar check on mount_lock
as well.  That code *is* correct for the reason you've mentioned, but I wonder
if explicit check of mount_lock would be better - right now it's more subtle
than I'd like it to be.  I don't think the cost would be noticable - it's
smp_rmb() + fetch + comparison when we cross a mountpoint while following ..
in lazy pathwalk, but that needs profiling - handwaving is not good enough...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: fs/namei.c: Misuse of sequence counts?
  2014-10-12  4:01   ` Eric Biggers
@ 2014-10-12  4:37     ` Al Viro
  2014-10-12  4:51       ` Eric Biggers
  2014-10-12  5:08       ` Eric Biggers
  0 siblings, 2 replies; 9+ messages in thread
From: Al Viro @ 2014-10-12  4:37 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-fsdevel, linux-kernel

On Sat, Oct 11, 2014 at 11:01:42PM -0500, Eric Biggers wrote:
> On Sun, Oct 12, 2014 at 01:12:59AM +0100, Al Viro wrote:
> > 
> > Huh?  What's to guarantee that dentry hasn't become negative since the
> > moment we'd fetched the seqcount?  _That_ is the problem we are dealing
> > with here - link_path_walk() relies on nd->inode being non-NULL.
> 
> Hmm, I guess that makes sense.  So the code is actually verifying that the inode
> is still the inode that was referenced from the current or root directory when
> nd->path was set.  But couldn't the problem also be solved by setting nd->inode
> directly in the fs->seq retry loops?

Gets clumsy in set_root_rcu() - you do *not* want it to bugger nd->inode
when done by follow_dotdot_rcu(), so we'd need either some indication which
caller it is, or something like struct inode **inode in argument list,
with NULL passed from follow_dotdot_rcu(), while path_init() would give
it &nd->inode...

Doable, but unpleasant.  And the price of that check is trivial - after all,
in case we *don't* bugger off immediately, we have that ->d_seq in cache -
we'd fetched it just before.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: fs/namei.c: Misuse of sequence counts?
  2014-10-12  4:37     ` Al Viro
@ 2014-10-12  4:51       ` Eric Biggers
  2014-10-12  5:08       ` Eric Biggers
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2014-10-12  4:51 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Sun, Oct 12, 2014 at 05:37:37AM +0100, Al Viro wrote:
> 
> Gets clumsy in set_root_rcu() - you do *not* want it to bugger nd->inode
> when done by follow_dotdot_rcu(), so we'd need either some indication which
> caller it is, or something like struct inode **inode in argument list,
> with NULL passed from follow_dotdot_rcu(), while path_init() would give
> it &nd->inode...
> 
> Doable, but unpleasant.  And the price of that check is trivial - after all,
> in case we *don't* bugger off immediately, we have that ->d_seq in cache -
> we'd fetched it just before.

Or set_root_rcu() can be hand-inlined, like the AT_FDCWD case.  Then the only
caller of set_root_rcu() would be follow_dotdot_rcu(), and the unnecessary
__read_seqcount_begin() could be removed.  (Probably gcc can't optimize it out
currently, because of the ACCESS_ONCE().)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: fs/namei.c: Misuse of sequence counts?
  2014-10-12  4:37     ` Al Viro
  2014-10-12  4:51       ` Eric Biggers
@ 2014-10-12  5:08       ` Eric Biggers
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2014-10-12  5:08 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

Here's a patch for reference (untested):

diff --git a/fs/namei.c b/fs/namei.c
index a7b05bf..84704e4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -649,17 +649,15 @@ static __always_inline void set_root(struct nameidata *nd)
 
 static int link_path_walk(const char *, struct nameidata *);
 
-static __always_inline unsigned set_root_rcu(struct nameidata *nd)
+static __always_inline void set_root_rcu(struct nameidata *nd)
 {
 	struct fs_struct *fs = current->fs;
-	unsigned seq, res;
+	unsigned seq;
 
 	do {
 		seq = read_seqcount_begin(&fs->seq);
 		nd->root = fs->root;
-		res = __read_seqcount_begin(&nd->root.dentry->d_seq);
 	} while (read_seqcount_retry(&fs->seq, seq));
-	return res;
 }
 
 static void path_put_conditional(struct path *path, struct nameidata *nd)
@@ -1854,11 +1852,21 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 	nd->m_seq = read_seqbegin(&mount_lock);
 	if (*name=='/') {
 		if (flags & LOOKUP_RCU) {
+			struct fs_struct *fs = current->fs;
+			unsigned seq;
+
 			rcu_read_lock();
-			nd->seq = set_root_rcu(nd);
+
+			do {
+				seq = read_seqcount_begin(&fs->seq);
+				nd->root = fs->root;
+				nd->inode = nd->root.dentry->d_inode;
+				nd->seq = __read_seqcount_begin(&nd->root.dentry->d_seq);
+			} while (read_seqcount_retry(&fs->seq, seq));
 		} else {
 			set_root(nd);
 			path_get(&nd->root);
+			nd->inode = nd->root.dentry->d_inode;
 		}
 		nd->path = nd->root;
 	} else if (dfd == AT_FDCWD) {
@@ -1871,10 +1879,12 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 			do {
 				seq = read_seqcount_begin(&fs->seq);
 				nd->path = fs->pwd;
+				nd->inode = nd->path.dentry->d_inode;
 				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
 			} while (read_seqcount_retry(&fs->seq, seq));
 		} else {
 			get_fs_pwd(current->fs, &nd->path);
+			nd->inode = nd->path.dentry->d_inode;
 		}
 	} else {
 		/* Caller must check execute permissions on the starting path component */
@@ -1894,6 +1904,7 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 		}
 
 		nd->path = f.file->f_path;
+		nd->inode = file_inode(f.file);
 		if (flags & LOOKUP_RCU) {
 			if (f.flags & FDPUT_FPUT)
 				*fp = f.file;
@@ -1904,16 +1915,7 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 			fdput(f);
 		}
 	}
-
-	nd->inode = nd->path.dentry->d_inode;
-	if (!(flags & LOOKUP_RCU))
-		return 0;
-	if (likely(!read_seqcount_retry(&nd->path.dentry->d_seq, nd->seq)))
-		return 0;
-	if (!(nd->flags & LOOKUP_ROOT))
-		nd->root.mnt = NULL;
-	rcu_read_unlock();
-	return -ECHILD;
+	return 0;
 }
 
 static inline int lookup_last(struct nameidata *nd, struct path *path)

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-10-12  5:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-11 22:58 fs/namei.c: Misuse of sequence counts? Eric Biggers
2014-10-11 23:46 ` Al Viro
2014-10-12  3:55   ` Eric Biggers
2014-10-12  4:29     ` Al Viro
2014-10-12  0:12 ` Al Viro
2014-10-12  4:01   ` Eric Biggers
2014-10-12  4:37     ` Al Viro
2014-10-12  4:51       ` Eric Biggers
2014-10-12  5:08       ` Eric Biggers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).