linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Al Viro <viro@zeniv.linux.org.uk>
To: Ritesh Harjani <riteshh@linux.ibm.com>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	wugyuan@cn.ibm.com, jlayton@kernel.org, hsiangkao@aol.com,
	Jan Kara <jack@suse.cz>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
Date: Tue, 22 Oct 2019 15:37:36 +0100	[thread overview]
Message-ID: <20191022143736.GX26530@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20191022133855.B1B4752050@d06av21.portsmouth.uk.ibm.com>

On Tue, Oct 22, 2019 at 07:08:54PM +0530, Ritesh Harjani wrote:
> I think we have still not taken this patch. Al?

I'm still not sure it's a good way to deal with the whole
mess, TBH ;-/  Consider e.g. walk_component(), with its

                if (unlikely(d_is_negative(path.dentry))) {
                        path_to_nameidata(&path, nd);
                        return -ENOENT;
                }

                seq = 0;        /* we are already out of RCU mode */
                inode = d_backing_inode(path.dentry);

or mountpoint_last()
        if (d_is_negative(path.dentry)) {
                dput(path.dentry);
                return -ENOENT;
        }
        path.mnt = nd->path.mnt;
        return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0);
or last_open()
        if (unlikely(d_is_negative(path.dentry))) {
                path_to_nameidata(&path, nd);
                return -ENOENT; 
        }

        /*
         * create/update audit record if it already exists.
         */
        audit_inode(nd->name, path.dentry, 0);

        if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
                path_to_nameidata(&path, nd);
                return -EEXIST;
        }

        seq = 0;        /* out of RCU mode, so the value doesn't matter */
        inode = d_backing_inode(path.dentry);
or, for that matter, any callers of filename_lookup() assuming that the
lack of ENOENT means that the last call of walk_component() (from lookup_last())
has not failed with the same ENOENT and thus the result has been observed
positive.

You've picked the easiest one to hit, but on e.g. KVM setups you can have the
host thread representing the CPU where __d_set_inode_and_type() runs get
preempted (by the host kernel), leaving others with much wider window.

Sure, we can do that to all callers of d_is_negative/d_is_positive, but...
the same goes for any places that assumes that d_is_dir() implies that
the sucker is positive, etc.

What we have guaranteed is
	* ->d_lock serializes ->d_flags/->d_inode changes
	* ->d_seq is bumped before/after such changes
	* positive dentry never changes ->d_inode as long as you hold
a reference (negative dentry *can* become positive right under you)

So there are 3 classes of valid users: those holding ->d_lock, those
sampling and rechecking ->d_seq and those relying upon having observed
the sucker they've pinned to be positive.

What you've been hitting is "we have it pinned, ->d_flags says it's
positive but we still observe the value of ->d_inode from before the
store to ->d_flags that has made it look positive".

I'm somewhat tempted to make __d_set_inode_and_type() do smp_store_release()
for setting ->d_flags and __d_entry_type() - smp_load_acquire(); that would
be pretty much free for x86 (as well as sparc, s390 and itanic) and reasonably
cheap on ppc and arm64.  How badly would e.g. SMP arm suffer from the below
(completely untested)?

diff --git a/fs/dcache.c b/fs/dcache.c
index e88cf0554e65..35368316c582 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -319,7 +319,7 @@ static inline void __d_set_inode_and_type(struct dentry *dentry,
 	flags = READ_ONCE(dentry->d_flags);
 	flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
 	flags |= type_flags;
-	WRITE_ONCE(dentry->d_flags, flags);
+	smp_store_release(&dentry->d_flags, flags);
 }
 
 static inline void __d_clear_type_and_inode(struct dentry *dentry)
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 10090f11ab95..bee28076a3fd 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -386,7 +386,7 @@ static inline bool d_mountpoint(const struct dentry *dentry)
  */
 static inline unsigned __d_entry_type(const struct dentry *dentry)
 {
-	return dentry->d_flags & DCACHE_ENTRY_TYPE;
+	return smp_load_acquire(&dentry->d_flags) & DCACHE_ENTRY_TYPE;
 }
 
 static inline bool d_is_miss(const struct dentry *dentry)

  reply	other threads:[~2019-10-22 14:37 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-27  4:42 [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast Ritesh Harjani
2019-10-15  4:07 ` Ritesh Harjani
2019-10-22 13:38   ` Ritesh Harjani
2019-10-22 14:37     ` Al Viro [this message]
2019-10-22 14:50       ` Al Viro
2019-10-22 20:11       ` Al Viro
2019-10-23 11:05         ` Ritesh Harjani
2019-11-01 23:46           ` Al Viro
2019-11-02  6:17             ` Al Viro
2019-11-02 17:24               ` Paul E. McKenney
2019-11-02 17:22             ` Paul E. McKenney
2019-11-02 18:08               ` Al Viro
2019-11-03 14:41                 ` Paul E. McKenney
2019-11-03 16:35                 ` [RFC] lookup_one_len_unlocked() lousy calling conventions Al Viro
2019-11-03 18:20                   ` Al Viro
2019-11-03 18:51                     ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Al Viro
2019-11-03 19:03                       ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either Al Viro
2019-11-13  7:01                       ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Amir Goldstein
2019-11-13 12:52                         ` Al Viro
2019-11-13 16:22                           ` Amir Goldstein
2019-11-13 20:18                           ` Jean-Louis Biasini
2019-11-03 17:05                 ` [PATCH][RFC] ecryptfs unlink/rmdir breakage (similar to caught in ecryptfs rename last year) Al Viro
2019-11-09  3:13                 ` [PATCH][RFC] race in exportfs_decode_fh() Al Viro
2019-11-09 16:55                   ` Linus Torvalds
2019-11-09 18:26                     ` Al Viro
2019-11-11  9:16                   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191022143736.GX26530@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=hsiangkao@aol.com \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=riteshh@linux.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=wugyuan@cn.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).