VFS: change ->atomic_open() calling to always have exclusive access

All of lore.kernel.org
 help / color / mirror / Atom feed

* VFS: change ->atomic_open() calling to always have exclusive access
@ 2025-09-15  3:01 NeilBrown
  2025-09-15  3:01 ` [PATCH 1/2] NFS: remove d_drop()/d_alloc_paralle() from nfs_atomic_open() NeilBrown
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: NeilBrown @ 2025-09-15  3:01 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Trond Myklebust,
	Anna Schumaker
  Cc: linux-nfs, linux-fsdevel, Jeff Layton, Amir Goldstein, Jan Kara

->atomic_open() is called with only a shared lock on the directory when
O_CREAT wasn't requested.  If the dentry is negative it is still called
because the filesystem might want to revalidate-and-open in an atomic
operations.  NFS does this to fullfil close-to-open consistency
requirements.

NFS has complex code to drop the dentry and reallocate with
d_alloc_parallel() in this case to ensure that only one lookup/open
happens at a time.  It would be simpler to have NFS return zero from
->d_revalidate in this case so that d_alloc_parallel() will be calling
in lookup_open() and NFS wan't need to worry about concurrency.

So this series makes that change to NFS to simplify atomic_open and remove
the d_drop() and d_alloc_parallel(), and then changes lookup_open() so that
atomic_open() can never be called without exclusive access to the dentry.

NeilBrown

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] NFS: remove d_drop()/d_alloc_paralle() from nfs_atomic_open()
  2025-09-15  3:01 VFS: change ->atomic_open() calling to always have exclusive access NeilBrown
@ 2025-09-15  3:01 ` NeilBrown
  2025-09-15  3:01 ` [PATCH 2/2] VFS: don't call ->atomic_open on cached negative without O_CREAT NeilBrown
  2025-09-17  4:34 ` VFS: change ->atomic_open() calling to always have exclusive access Al Viro
  2 siblings, 0 replies; 7+ messages in thread
From: NeilBrown @ 2025-09-15  3:01 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Trond Myklebust,
	Anna Schumaker
  Cc: linux-nfs, linux-fsdevel, Jeff Layton, Amir Goldstein, Jan Kara

From: NeilBrown <neil@brown.name>

It is important that two non-create NFS "open"s of a negative dentry don't race.
They both have a shared lock on i_rwsem and so could run concurrently, but
they might both try to call d_exact_alias() or d_splice_alias() at the
same time which is confusing at best.

nfs_atomic_open() currently avoids this by discarding the negative dentry
and creating a new one with d_alloc_parallel().  Only one thread can
successfully get the d_in_lookup() dentry, the other will wait for the
first to finish, and can use the result of that first lookup.

Dropping the dentry like this will defeat a proposed new locking scheme
which locks the dentry and requires it to remain hashed.

We can achieve the same effect by causing ->d_revalidate to invalidate a
negative dentry when LOOKUP_OPEN is set.  Doing this is consistent with
the "close to open" caching semantics of NFS which require the server to
be queried whenever opening a file - cached information must not be
trusted.

With this change to d_revaliate (implemented in nfs_neg_need_reval)
we can be sure that if the dentry that reaches nfs_atomic_open() is
not d_in_lookup and is negative, then some other lookup or atomic_open
must have happened since d_revalidate was called, so we have recent
confirmation from the server that the name doesn't exist, and we can
safely fail the open request (which finish_no_open() will effectively do
when passed a negative dentry).

Note that none of the above applies to O_CREAT opens.  These are fully
serialised with an exclusive lock on i_rwsem and there is no attempt to
d_drop() the dentry in that case.  d_revalidate() does not need to
request invalidation of a negative dentry as a create will proceed in
that case anyway.

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/nfs/dir.c | 30 +++++++-----------------------
 1 file changed, 7 insertions(+), 23 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index d81217923936..bc417566508e 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1615,6 +1615,11 @@ int nfs_neg_need_reval(struct inode *dir, struct dentry *dentry,
 {
 	if (flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET))
 		return 0;
+	if (flags & LOOKUP_OPEN)
+		/* close-to-open semantics require we go to server
+		 * on each open.
+		 */
+		return 1;
 	if (NFS_SERVER(dir)->flags & NFS_MOUNT_LOOKUP_CACHE_NONEG)
 		return 1;
 	/* Case insensitive server? Revalidate negative dentries */
@@ -2060,14 +2065,12 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
 		    struct file *file, unsigned open_flags,
 		    umode_t mode)
 {
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 	struct nfs_open_context *ctx;
 	struct dentry *res;
 	struct iattr attr = { .ia_valid = ATTR_OPEN };
 	struct inode *inode;
 	unsigned int lookup_flags = 0;
 	unsigned long dir_verifier;
-	bool switched = false;
 	int created = 0;
 	int err;

@@ -2112,16 +2115,8 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
 		attr.ia_size = 0;
 	}

-	if (!(open_flags & O_CREAT) && !d_in_lookup(dentry)) {
-		d_drop(dentry);
-		switched = true;
-		dentry = d_alloc_parallel(dentry->d_parent,
-					  &dentry->d_name, &wq);
-		if (IS_ERR(dentry))
-			return PTR_ERR(dentry);
-		if (unlikely(!d_in_lookup(dentry)))
-			return finish_no_open(file, dentry);
-	}
+	if (!(open_flags & O_CREAT) && !d_in_lookup(dentry))
+		return finish_no_open(file, dentry);

 	ctx = create_nfs_open_context(dentry, open_flags, file);
 	err = PTR_ERR(ctx);
@@ -2165,10 +2160,6 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
 	trace_nfs_atomic_open_exit(dir, ctx, open_flags, err);
 	put_nfs_open_context(ctx);
 out:
-	if (unlikely(switched)) {
-		d_lookup_done(dentry);
-		dput(dentry);
-	}
 	return err;

 no_open:
@@ -2191,13 +2182,6 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
 			res = ERR_PTR(-EOPENSTALE);
 		}
 	}
-	if (switched) {
-		d_lookup_done(dentry);
-		if (!res)
-			res = dentry;
-		else
-			dput(dentry);
-	}
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 	return finish_no_open(file, res);
-- 
2.50.0.107.gf914562f5916.dirty

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] VFS: don't call ->atomic_open on cached negative without O_CREAT
  2025-09-15  3:01 VFS: change ->atomic_open() calling to always have exclusive access NeilBrown
  2025-09-15  3:01 ` [PATCH 1/2] NFS: remove d_drop()/d_alloc_paralle() from nfs_atomic_open() NeilBrown
@ 2025-09-15  3:01 ` NeilBrown
  2025-09-17  4:23   ` Al Viro
  2025-09-17  4:34 ` VFS: change ->atomic_open() calling to always have exclusive access Al Viro
  2 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2025-09-15  3:01 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Trond Myklebust,
	Anna Schumaker
  Cc: linux-nfs, linux-fsdevel, Jeff Layton, Amir Goldstein, Jan Kara

From: NeilBrown <neil@brown.name>

If the dentry is found to be negative (not d_in_lookup() and not
positive) and if O_CREATE wasn't requested then we do not have exclusive
access the dentry.

If we pass it to ->atomic_open() the filesystem will need to ensure any
lookup+open operations are serialised so that two threads don't both try
to instantiate the dentry.  This is an unnecessary burden to put on the
filesystem.

If the filesystem wants to perform such a lookup+open operation when a
negative dentry is found, it should return 0 from ->d_revalidate in that
case (when LOOKUP_OPEN) so that the calls serialise in
d_alloc_parallel().

All filesystems with ->atomic_open() currently handle the case of a
negative dentry without O_CREAT either by returning -ENOENT or by
calling finish_no_open(), either with NULL or with the negative dentry.
All of these have the same effect.

For filesystems without ->atomic_open(), lookup_open() will, in this
case, also call finish_no_open().

So this patch removes the burden from filesystems by calling
finish_no_open() early on a negative cached dentry when O_CREAT isn't
requested.

With this change any ->atomic_open() function can be certain that it has
exclusive access to the dentry, either because an exclusive lock is held
on the parent directory or because DCACHE_PAR_LOOKUP is set implying an
exclusive lock on the dentry itself.

Signed-off-by: NeilBrown <neil@brown.name>
---
 Documentation/filesystems/vfs.rst | 4 ++++
 fs/namei.c                        | 8 ++++++++
 2 files changed, 12 insertions(+)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 486a91633474..be7dd654f5fd 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -678,6 +678,10 @@ otherwise noted.
 	flag should be set in file->f_mode.  In case of O_EXCL the
 	method must only succeed if the file didn't exist and hence
 	FMODE_CREATED shall always be set on success.
+	atomic_open() will always have exclusive access to the dentry
+	as if O_CREAT hasn't caused the directory to be locked exclusively,
+	then the dentry will have DCACHE_PAR_LOOKUP will also
+	provides exclusivity.

 ``tmpfile``
 	called in the end of O_TMPFILE open().  Optional, equivalent to
diff --git a/fs/namei.c b/fs/namei.c
index ba8bf73d2f9c..5f732b9cd2db 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3647,6 +3647,14 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		/* Cached positive dentry: will open in f_op->open */
 		return dentry;
 	}
+	if ((open_flag & O_CREAT) == 0 && !d_in_lookup(dentry)) {
+		/* Cached negative dentry and no create requested.
+		 * If a filesystem wants to be called in this case
+		 * it should trigger dentry invalidation in
+		 * ->d_revalidate.
+		 */
+		return dentry;
+	}

 	if (open_flag & O_CREAT)
 		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
-- 
2.50.0.107.gf914562f5916.dirty

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] VFS: don't call ->atomic_open on cached negative without O_CREAT
  2025-09-15  3:01 ` [PATCH 2/2] VFS: don't call ->atomic_open on cached negative without O_CREAT NeilBrown
@ 2025-09-17  4:23   ` Al Viro
  2025-09-17  7:36     ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2025-09-17  4:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Trond Myklebust, Anna Schumaker, linux-nfs,
	linux-fsdevel, Jeff Layton, Amir Goldstein, Jan Kara

On Mon, Sep 15, 2025 at 01:01:08PM +1000, NeilBrown wrote:

> All filesystems with ->atomic_open() currently handle the case of a
> negative dentry without O_CREAT either by returning -ENOENT or by
> calling finish_no_open(), either with NULL or with the negative dentry.

Wait a sec...  Just who is passing finish_no_open() a negative dentry?
Any such is very likely to be a bug...

> All of these have the same effect.
> 
> For filesystems without ->atomic_open(), lookup_open() will, in this
> case, also call finish_no_open().
> 
> So this patch removes the burden from filesystems by calling
> finish_no_open() early on a negative cached dentry when O_CREAT isn't
> requested.

Re "removing the burden" - it still can be called with negative cached without
O_CREAT.

O_CREAT in open(2) arguments might not survive to the call of atomic_open() -
in case when you don't have write permissions on parent.  In that case
we strip O_CREAT and call into atomic_open() (if the method is there).
In that case -ENOENT from atomic_open() is translated into -EACCES or -EROFS:
                dentry = atomic_open(nd, dentry, file, open_flag, mode);
		if (unlikely(create_error) && dentry == ERR_PTR(-ENOENT))
			dentry = ERR_PTR(create_error);
		return dentry;
In case when no ->atomic_open() is there the same logics is
        if (unlikely(create_error) && !dentry->d_inode) {
		error = create_error;
		goto out_dput;
	}
in the very end of lookup_open().  The point is, you might have had a call
of ->d_revalidate() with LOOKUP_CREATE|LOOKUP_OPEN and then have the damn
thing passed to ->atomic_open() with no O_CREAT in flags.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] VFS: don't call ->atomic_open on cached negative without O_CREAT
  2025-09-17  4:23   ` Al Viro
@ 2025-09-17  7:36     ` NeilBrown
  0 siblings, 0 replies; 7+ messages in thread
From: NeilBrown @ 2025-09-17  7:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Trond Myklebust, Anna Schumaker, linux-nfs,
	linux-fsdevel, Jeff Layton, Amir Goldstein, Jan Kara

On Wed, 17 Sep 2025, Al Viro wrote:
> On Mon, Sep 15, 2025 at 01:01:08PM +1000, NeilBrown wrote:
> 
> > All filesystems with ->atomic_open() currently handle the case of a
> > negative dentry without O_CREAT either by returning -ENOENT or by
> > calling finish_no_open(), either with NULL or with the negative dentry.
> 
> Wait a sec...  Just who is passing finish_no_open() a negative dentry?
> Any such is very likely to be a bug...

You're right, that never happens.  It always NULL being passed or
-ENOENT being returned in the cases I was considering.

> 
> > All of these have the same effect.
> > 
> > For filesystems without ->atomic_open(), lookup_open() will, in this
> > case, also call finish_no_open().
> > 
> > So this patch removes the burden from filesystems by calling
> > finish_no_open() early on a negative cached dentry when O_CREAT isn't
> > requested.
> 
> Re "removing the burden" - it still can be called with negative cached without
> O_CREAT.

You, but it will also have exclusive access to the dentry - either an
exclusive lock on the parent, or DCACHE_PAR_LOOKUP.  That is the
important outcome that I wanted to achieve.

> 
> O_CREAT in open(2) arguments might not survive to the call of atomic_open() -
> in case when you don't have write permissions on parent.  In that case
> we strip O_CREAT and call into atomic_open() (if the method is there).
> In that case -ENOENT from atomic_open() is translated into -EACCES or -EROFS:
>                 dentry = atomic_open(nd, dentry, file, open_flag, mode);
> 		if (unlikely(create_error) && dentry == ERR_PTR(-ENOENT))
> 			dentry = ERR_PTR(create_error);
> 		return dentry;
> In case when no ->atomic_open() is there the same logics is
>         if (unlikely(create_error) && !dentry->d_inode) {
> 		error = create_error;
> 		goto out_dput;
> 	}
> in the very end of lookup_open().  The point is, you might have had a call
> of ->d_revalidate() with LOOKUP_CREATE|LOOKUP_OPEN and then have the damn
> thing passed to ->atomic_open() with no O_CREAT in flags.
> 

That inconsistency could be confusing for the filesystem.  It would be
hard to make it consistent through.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: VFS: change ->atomic_open() calling to always have exclusive access
  2025-09-15  3:01 VFS: change ->atomic_open() calling to always have exclusive access NeilBrown
  2025-09-15  3:01 ` [PATCH 1/2] NFS: remove d_drop()/d_alloc_paralle() from nfs_atomic_open() NeilBrown
  2025-09-15  3:01 ` [PATCH 2/2] VFS: don't call ->atomic_open on cached negative without O_CREAT NeilBrown
@ 2025-09-17  4:34 ` Al Viro
  2025-09-17  7:25   ` NeilBrown
  2 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2025-09-17  4:34 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Trond Myklebust, Anna Schumaker, linux-nfs,
	linux-fsdevel, Jeff Layton, Amir Goldstein, Jan Kara

On Mon, Sep 15, 2025 at 01:01:06PM +1000, NeilBrown wrote:
> ->atomic_open() is called with only a shared lock on the directory when
> O_CREAT wasn't requested.  If the dentry is negative it is still called
> because the filesystem might want to revalidate-and-open in an atomic
> operations.  NFS does this to fullfil close-to-open consistency
> requirements.
> 
> NFS has complex code to drop the dentry and reallocate with
> d_alloc_parallel() in this case to ensure that only one lookup/open
> happens at a time.  It would be simpler to have NFS return zero from
> ->d_revalidate in this case so that d_alloc_parallel() will be calling
> in lookup_open() and NFS wan't need to worry about concurrency.
> 
> So this series makes that change to NFS to simplify atomic_open and remove
> the d_drop() and d_alloc_parallel(), and then changes lookup_open() so that
> atomic_open() can never be called without exclusive access to the dentry.

What will happen if you have a stale negative dentry of /mnt/foo/bar, with
/mnt/foo not writable to you (e.g. because /mnt is read-only mount) and
foo/bar having actually been created by another client since your old lookup
had happened?  What's more, assume that this file is owned by you and has 0660
for mode.

Calling open("/mnt/foo/bar", O_CREAT|O_RDWR, 0666) should succeed - the file
in question exists, so O_CREAT and the third argument of open(2) should be
ignored.

What happens is that we get ->d_revalidate() still with effects of O_CREAT
(i.e. LOOKUP_CREATE|LOOKUP_OPEN), but ->atomic_open() does *not* get O_CREAT.

Your series is broken in that case, AFAICS.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: VFS: change ->atomic_open() calling to always have exclusive access
  2025-09-17  4:34 ` VFS: change ->atomic_open() calling to always have exclusive access Al Viro
@ 2025-09-17  7:25   ` NeilBrown
  0 siblings, 0 replies; 7+ messages in thread
From: NeilBrown @ 2025-09-17  7:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Trond Myklebust, Anna Schumaker, linux-nfs,
	linux-fsdevel, Jeff Layton, Amir Goldstein, Jan Kara

On Wed, 17 Sep 2025, Al Viro wrote:
> On Mon, Sep 15, 2025 at 01:01:06PM +1000, NeilBrown wrote:
> > ->atomic_open() is called with only a shared lock on the directory when
> > O_CREAT wasn't requested.  If the dentry is negative it is still called
> > because the filesystem might want to revalidate-and-open in an atomic
> > operations.  NFS does this to fullfil close-to-open consistency
> > requirements.
> > 
> > NFS has complex code to drop the dentry and reallocate with
> > d_alloc_parallel() in this case to ensure that only one lookup/open
> > happens at a time.  It would be simpler to have NFS return zero from
> > ->d_revalidate in this case so that d_alloc_parallel() will be calling
> > in lookup_open() and NFS wan't need to worry about concurrency.
> > 
> > So this series makes that change to NFS to simplify atomic_open and remove
> > the d_drop() and d_alloc_parallel(), and then changes lookup_open() so that
> > atomic_open() can never be called without exclusive access to the dentry.
> 
> What will happen if you have a stale negative dentry of /mnt/foo/bar, with
> /mnt/foo not writable to you (e.g. because /mnt is read-only mount) and
> foo/bar having actually been created by another client since your old lookup
> had happened?  What's more, assume that this file is owned by you and has 0660
> for mode.
> 
> Calling open("/mnt/foo/bar", O_CREAT|O_RDWR, 0666) should succeed - the file
> in question exists, so O_CREAT and the third argument of open(2) should be
> ignored.
> 
> What happens is that we get ->d_revalidate() still with effects of O_CREAT
> (i.e. LOOKUP_CREATE|LOOKUP_OPEN), but ->atomic_open() does *not* get O_CREAT.
> 
> Your series is broken in that case, AFAICS.
> 

True, thanks, but not fatally so I think.

Even though O_CREAT was cleared, we still have an exclusive lock on the
directory so there is no need to d_drop() and reallocate the dentry
in nfs_atomic_open().

The change I need to make is to remove

+	if (!(open_flags & O_CREAT) && !d_in_lookup(dentry))
+		return finish_no_open(file, dentry);

so that we still go ahead with the NFS OPEN request in any case.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-09-17  7:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-15  3:01 VFS: change ->atomic_open() calling to always have exclusive access NeilBrown
2025-09-15  3:01 ` [PATCH 1/2] NFS: remove d_drop()/d_alloc_paralle() from nfs_atomic_open() NeilBrown
2025-09-15  3:01 ` [PATCH 2/2] VFS: don't call ->atomic_open on cached negative without O_CREAT NeilBrown
2025-09-17  4:23   ` Al Viro
2025-09-17  7:36     ` NeilBrown
2025-09-17  4:34 ` VFS: change ->atomic_open() calling to always have exclusive access Al Viro
2025-09-17  7:25   ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.