linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] a possible way of reducing the PITA of ->d_name audits
@ 2025-09-07 20:32 Al Viro
  2025-09-07 21:51 ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2025-09-07 20:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, Jan Kara, NeilBrown

	I would like to discuss a flagday change to calling conventions
of ->create()/->unlink()/->rename()/etc. - all directory methods.
It would have no impact on the code generation, it would involve
quite a bit of localized churn and it would allow to deal with catching
->d_name races on compiler level.

	A bit of background:

	Unlike file->f_path, dentry->d_name *can* change for a live
dentry.  We had quite a few bugs where it had been used unsafely and
new ones almost certainly will crop up.

	A part of a problem is that locking rules are fairly
convoluted; they do cover most of the uses in filesystem code these days
(->d_revalidate() was the last major source of headache), but they are
not fun to verify.  I've no better solution than a code audit once in a while,
and every time it's really unpleasant.

	It starts with verifying that ->d_name contents of a live dentry
can only be changed by __d_move().  The next step is tracing the callers
of that and verifying several predicates regarding the locks held
by callers.  Next we need to go over the places that access ->d_name
and check that locking in those does give sufficient exclusion.
That's about 800 locations to look through right now, and examining those
(and the call chains leading to them) is very distinctly Not Fun(tm).

	It used to be worse; reducing the number of places that needed
to be examined is something I'd been doing for quite a while (debugfs
bunch was the latest mostly taken out).  The problem is, most of the
remaining stuff is genuinely needed - foofs_unlink() *does* need to
know which directory entry to remove, etc.

	The typical part of that audit goes like that:
* foo_do_something(whatever, dentry, flags) uses dentry->d_name.{name,len}.
* we need to examine all callchains that might lead to it and prove
  that locking environment in each of those is sufficient to stabilize
  these values.  Note that locking environments may vary between the
  callchains, with their intersection being too weak.
* one callchain is
	call of ->i_op->symlink() hits foo_symlink()
	foo_symlink() calls foo_mknod()
	foo_mknod() call foo_create_object()
	foo_create_object() calls foo_do_something()
  with dentry in the last one coming from the third argument of ->symlink().
  All callers of ->symlink() guarantee the sufficient locking environment
  for stability of that argument - both its ->d_name and ->d_parent are
  not going to change under us (that, BTW, is a separate audit, thankfully
  a much smaller one).  Therefore we know that this callchain is OK.
  Lather, rinse, repeat...

	None of that is all that complicated (well, unless you are looking
at something with obscene call chains *cough*ceph*cough*).  The trouble
is, there are literally hundreds of places to examine, and that needs to
be multiplied by the number of callchains.  Doing that manually is bloody
painful; "AI" (s)tools are worse than useless in that area - verifying the
output is actually _harder_ than doing the whole thing manually.

	We need some annotations ("this dentry is guaranteed to be stable"),
some way to verify their correctness and helpers that would give access
to members in question (both ->d_name and ->d_parent), with some way to
check that they are only used for stable ones.  IMO that's a good fit for
type system.  And AFAICS, the C type system, weak as it is, can be used
for that.

	Suppose we introduce something like
struct stable_dentry {
	struct dentry *__touch_that_and_suffer;
};

static inline struct stable_dentry claim_stability(struct dentry *dentry)
{
	return (struct stable_dentry){dentry};
}

static inline struct dentry *unwrap_dentry(struct stable_dentry v)
{
	/* this would better be the only place using that identifier */
	return v.__touch_that_and_suffer;
}

static const struct qstr *dentry_name(struct stable_dentry v)
{
	return &unwrap_dentry(v)->d_name;
}

static struct dentry *dentry_parent(struct stable_dentry v)
{
	return unwrap_dentry(v)->d_parent;
}

	Those things get passed as argument and returned in the same way
dentry pointers are, AFAICS on all ABIs we care about.  And they can
serve as annotations - to pass such object (by value) instead of passing
a dentry reference == claim that dentry in question is stable and will
remain such for duration of call.

	So we can do a series of patches along the lines of
* switch ->symlink() to use of struct stable_dentry
* replace
static int ramfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
			 struct dentry *dentry, const char *symname)
{
with
static int ramfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
			 struct stable_dentry child, const char *symname)
{
	struct dentry *dentry = unwrap_dentry(child);
and similar for all instances
* replace the caller(s) (there's only one, in this case - in vfs_symlink())
so that
	error = dir->i_op->symlink(idmap, dir, dentry, oldname);
becomes
	error = dir->i_op->symlink(idmap, dir, claim_stability(dentry), oldname);
possibly with a followup that switches vfs_symlink() to stable_dentry as well
(in which case claim_stability() shifts to the callers).
* Add to D/f/porting.rst:
**mandatory**

->symlink() takes struct stable_dentry now; if you are affected, replace
the third argument with struct stable_dentry and use unwrap_dentry()
to obtain the dentry reference from it.  If your ->symlink() instance
happens to be called directly, wrap the argument into claim_stability()
at the call site (and check that you *do* have sufficient locking
environment there - if you didn't, it's a bug right there and you'd
need to fix it first, so that it could be backported without dragging
the calling conventions change along).

That's the flagday part; it would be fairly mechanical, with very easy
way to conform for anything outside of mainline tree.

Once that is done, we can eliminate the direct uses of ->d_name
- dentry_name(child) would do it; might make sense to propagate
stable_dentry down into helpers - depends upon the filesystem.

What we get out of that is a much smaller set of places to audit, all
easily catchable by grep.
	1) claim_stability() calls.  There are _far_ fewer of those,
and each does need to be verified anyway.
	2) places where we use ->d_name in filesystems.  Hopefully to
be much reduced.  Something that e.g. grabs ->d_lock and looks at ->d_name
is safe, and that ->d_lock would typically be taken within a few lines
before the use.
	3) places that simulate claim_stability; catchable by grep -
that can be defeated by preprocessor-level obfuscation, but anything
like _that_ would be a confession of malicious intent.  And sparse would
not be hard to teach catching those anyway.
Everything else would be caught by compiler.

IMO getting rid of that headache (several days of unpleasant mechanical
work every time) would be very tempting.  OTOH, it is a flagday change
and such calling conventions changes risk the bikeshedding fest from hell...

Comments?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-07 20:32 [RFC] a possible way of reducing the PITA of ->d_name audits Al Viro
@ 2025-09-07 21:51 ` Linus Torvalds
  2025-09-08  0:06   ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2025-09-07 21:51 UTC (permalink / raw)
  To: Al Viro; +Cc: Christian Brauner, linux-fsdevel, Jan Kara, NeilBrown

On Sun, 7 Sept 2025 at 13:32, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         I would like to discuss a flagday change to calling conventions
> of ->create()/->unlink()/->rename()/etc. - all directory methods.
> It would have no impact on the code generation, it would involve
> quite a bit of localized churn and it would allow to deal with catching
> ->d_name races on compiler level.

Can you make this more concrete by actually sending an example patch.

Well, two patches: first the patch for the "claim_stability" helper
type and functions, and then a separate patch for converting _one_ of
the users (eg 'symlink').

Because I have a hard time visualizing just how noisy the result would
be (and whether it would be legible end result).

And I do wonder if it might not be simpler to have a model where
filesystems always get a stable dentry name - either because we hold
the parent lock on a VFS level (fairly common, I think), or because we
pass a separate copy to the filesystem

You did that with the d_revalidate() callback, and I think that was a
clear success. Can we extend on *that* pattern, perhaps?

            Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-07 21:51 ` Linus Torvalds
@ 2025-09-08  0:06   ` Al Viro
  2025-09-08  0:47     ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2025-09-08  0:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, Jan Kara, NeilBrown

On Sun, Sep 07, 2025 at 02:51:18PM -0700, Linus Torvalds wrote:
> On Sun, 7 Sept 2025 at 13:32, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         I would like to discuss a flagday change to calling conventions
> > of ->create()/->unlink()/->rename()/etc. - all directory methods.
> > It would have no impact on the code generation, it would involve
> > quite a bit of localized churn and it would allow to deal with catching
> > ->d_name races on compiler level.
> 
> Can you make this more concrete by actually sending an example patch.
> 
> Well, two patches: first the patch for the "claim_stability" helper
> type and functions, and then a separate patch for converting _one_ of
> the users (eg 'symlink').
> 
> Because I have a hard time visualizing just how noisy the result would
> be (and whether it would be legible end result).

Will do...  BTW, for a non-obvious example of the things that show up
in process: ceph_wait_on_conflict_unlink() is actually guaranteed that
its argument is stable.  The way it copies ->d_name locally (with
a good reason - there's a loop using that for comparisons, so it's
a valid optimization) would otherwise be seriously racy.

> And I do wonder if it might not be simpler to have a model where
> filesystems always get a stable dentry name - either because we hold
> the parent lock on a VFS level (fairly common, I think), or because we
> pass a separate copy to the filesystem
> 
> You did that with the d_revalidate() callback, and I think that was a
> clear success. Can we extend on *that* pattern, perhaps?

Umm...  For one thing, there's something wrong with passing two arguments
that always differ by constant offset (and type, of course)...  More
serious reason is that in quite a few cases we want both dentry and stable
name accessible for a helper called from those (ceph_wait_on_conflict_unlink(),
for example).  And for those we either have to pass both dentry and const
struct qstr * or we still have to trace their call chains every time we
do that audit.  Passing struct stable_dentry instead avoids that...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-08  0:06   ` Al Viro
@ 2025-09-08  0:47     ` Linus Torvalds
  2025-09-08  2:51       ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2025-09-08  0:47 UTC (permalink / raw)
  To: Al Viro; +Cc: Christian Brauner, linux-fsdevel, Jan Kara, NeilBrown

On Sun, 7 Sept 2025 at 17:06, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > You did that with the d_revalidate() callback, and I think that was a
> > clear success. Can we extend on *that* pattern, perhaps?
>
> Umm...  For one thing, there's something wrong with passing two arguments
> that always differ by constant offset (and type, of course)...

I would expect that you *only* do this for the functions where the
name isn't stable (ie it's called without the parent locked).

So rmdir/mknod/link/etc would continue to just pass the dentry,
because the name is stable and filesystems using dentry->d_name is
perfectly fine.

Only the cases where we pass a dentry otherwise, would we do that
separate 'struct qname' - and so then the name would *not* be
'dentry->d_name', but a copy. Exactly like d_revalidate.

Of course, d_revalidate() already *had* that separate copy (created
for the lookup), and when that isn't true you would need to waste time
and effort in creating it.

Ie we'd end up using take_dentry_name_snapshot().

Would that be horribly bad?

Don't a lot of routines already have the parent locked - all the
normal functions where we actually _modify_ the directory - so the
name is stable. No?

Then, the other thing we could do is just mark "struct qstr d_name" as
'const' in struct dentry, and then the (very very few) places that
actually modify the name will have to use an alias to do the
modification.

Wouldn't that deal with at least a majority of the worries of people
playing games?

This is where you go "Oh, Linus, you sweet summer child", and shake
your head. You've been walking through the call-chains, I haven't.

            Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-08  0:47     ` Linus Torvalds
@ 2025-09-08  2:51       ` Al Viro
  2025-09-08  3:57         ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2025-09-08  2:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, Jan Kara, NeilBrown

On Sun, Sep 07, 2025 at 05:47:31PM -0700, Linus Torvalds wrote:
> On Sun, 7 Sept 2025 at 17:06, Al Viro <viro@zeniv.linux.org.uk> wrote:

> I would expect that you *only* do this for the functions where the
> name isn't stable (ie it's called without the parent locked).
> 
> So rmdir/mknod/link/etc would continue to just pass the dentry,
> because the name is stable and filesystems using dentry->d_name is
> perfectly fine.

Absolute majority of those ~800 hits *are* in those or in functions
called only from them.

> Ie we'd end up using take_dentry_name_snapshot().
> 
> Would that be horribly bad?

In some cases we'll have to (trace shite, mostly - racy, but not "the
sky is falling" stuff), but that (and any real crap) drowns in the
false positives.

> Don't a lot of routines already have the parent locked - all the
> normal functions where we actually _modify_ the directory - so the
> name is stable. No?

Yes, it is.

> Then, the other thing we could do is just mark "struct qstr d_name" as
> 'const' in struct dentry, and then the (very very few) places that
> actually modify the name will have to use an alias to do the
> modification.

Yes, and I have that (that's what I mentioned as the first and easy part
of those audits - see #work.qstr, doing exactly that).

> Wouldn't that deal with at least a majority of the worries of people
> playing games?
> 
> This is where you go "Oh, Linus, you sweet summer child", and shake
> your head. You've been walking through the call-chains, I haven't.

The problem is that a regression (e.g. somebody suddenly starting to read
->d_name in their ->open() or ->setattr(), or... - we had such cases) is
always possible and the way the things are it's fucking hard to spot.

Most of the uses *are* done to stable dentries; it's just that we have no
way to tell which ones are like that.  That's exactly the reason why I'm
thinking about such annotations.  The only way to catch regressions is
to grep, go over the list and manually verify that all of those places
still are reachable only from the "safe" methods.  Worse, comparing
the grep output from the previous cycle (modulo line number shifts,
etc.) doesn't help - if you have one in a function 3 levels down from a
method and its caller grows an extra call site, you would see nothing
in the diff anywhere near the places where ->d_name is used or any of
the directory methods.

I'm not so worried about anyone malicious - honest fuckups happen and
AFAICS all of them so far had been of that sort.  I would really like
to be able to find such fuckups without ~8--12 hours of digging;
needs breaks, too, since it's monotonous enough to cause something
similar to highway hypnosis, IME ;-/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-08  2:51       ` Al Viro
@ 2025-09-08  3:57         ` Al Viro
  2025-09-08  4:50           ` NeilBrown
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2025-09-08  3:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, Jan Kara, NeilBrown

On Mon, Sep 08, 2025 at 03:51:35AM +0100, Al Viro wrote:
 
> Most of the uses *are* done to stable dentries; it's just that we have no
> way to tell which ones are like that.

Random example to get the taste of that joy:

static void
xfs_dentry_to_name( 
        struct xfs_name *namep,
        struct dentry   *dentry)
{
        namep->name = dentry->d_name.name;
        namep->len = dentry->d_name.len;
        namep->type = XFS_DIR3_FT_UNKNOWN;
}

OK, fetches from ->d_name.  Callers:
	xfs_cleanup_inode()
		xfs_generic_create()
			xfs_vn_mknod()
				== xfs_dir_inode_operations.mknod
				== xfs_dir_ci_inode_operations.mknod
			xfs_vn_create()
				== xfs_dir_inode_operations.create
				== xfs_dir_ci_inode_operations.create
			xfs_vn_mkdir()
				== xfs_dir_inode_operations.mkdir
				== xfs_dir_ci_inode_operations.mkdir
			xfs_vn_tmpfile()	# WTF?
		xfs_vn_symlink()
			== xfs_dir_inode_operations.symlink
			== xfs_dir_ci_inode_operations.symlink
	xfs_vn_lookup()
		== xfs_dir_inode_operations.lookup
	xfs_vn_ci_lookup()
		== xfs_dir_ci_inode_operations.lookup
	xfs_vn_unlink()
		== xfs_dir_inode_operations.unlink
		== xfs_dir_inode_operations.rmdir
		== xfs_dir_ci_inode_operations.unlink
		== xfs_dir_ci_inode_operations.rmdir
(+ checking that in all cases dentry has come from the method argument)
WTF is going on with xfs_vn_tmpfile()?  It doesn't *have* any useful
name... looking... Aha.  vfs_generic_create(_, _, _, _, _, p) only calls
xfs_cleanup_inode() in case when p is NULL; xfs_vn_tmpfile() is called
only as ->tmpfile(), and the only caller of that is this:
        file->f_path.mnt = parentpath->mnt;
        file->f_path.dentry = child;
        mode = vfs_prepare_mode(idmap, dir, mode, mode, mode);
        error = dir->i_op->tmpfile(idmap, dir, file, mode);
so the method never gets called with NULL as the 3rd argument.  Safe...

And that's just one example - two grep hits.  Right next to them,
static int
xfs_dentry_mode_to_name(
        struct xfs_name *namep,
        struct dentry   *dentry,
        int             mode)
{
        namep->name = dentry->d_name.name;
        namep->len = dentry->d_name.len;
        namep->type = xfs_mode_to_ftype(mode);
 
        if (unlikely(namep->type == XFS_DIR3_FT_UNKNOWN))
                return -EFSCORRUPTED;
 
        return 0;
}

Callers:
	xfs_generic_create()
		same callers as above, but this time it's *not* conditional:
------------------------------------------------------------------
        /* Verify mode is valid also for tmpfile case */
        error = xfs_dentry_mode_to_name(&name, dentry, args.mode);
        if (unlikely(error))
                goto out_free_acl;
------------------------------------------------------------------
		presumably it's "we may fetch shite for tmpfile, but in that case
		we won't use that shite".
	xfs_vn_link()
		== xfs_dir_inode_operations.link
		== xfs_dir_ci_inode_operations.link
	xfs_vn_symlink()
		seen above, same dentry as above
	xfs_vn_rename()	# for odentry
		== xfs_dir_inode_operations.rename
		== xfs_dir_ci_inode_operations.rename
	xfs_vn_rename() # for ndentry
		seen above

That's not all for xfs, though - there's also
        error = xfs_inode_init_security(inode, dir, &dentry->d_name);
in the same xfs_generic_create() - and also called for tmpfile case,
AFAISC.  Which is quite likely a bug - ->d_name is stable there, all
right, but at that stage it's "/"; what selinux (the only thing that
cares about the basename of object being created) would do to that
is an interesting question, might depend upon the policy.  Non-tmpfile
callers are OK, as seen above.
Another one:
        error = xfs_inode_init_security(inode, dir, &dentry->d_name);
in xfs_vn_symlink(), safe per above.
Another:
        if (dentry->d_name.len >= MAXNAMELEN)
                return ERR_PTR(-ENAMETOOLONG);
in xfs_vn_lookup() and xfs_vn_ci_lookup().  Safe.

... and finally there's this, in all its foul glory:
DECLARE_EVENT_CLASS(xrep_dentry_class,
        TP_PROTO(struct xfs_mount *mp, const struct dentry *dentry),
        TP_ARGS(mp, dentry),
        TP_STRUCT__entry(
                __field(dev_t, dev)
                __field(unsigned int, flags)
                __field(unsigned long, ino)
                __field(bool, positive)
                __field(unsigned long, parent_ino)
                __field(unsigned int, namelen)
                __dynamic_array(char, name, dentry->d_name.len)
        ),
        TP_fast_assign(
                __entry->dev = mp->m_super->s_dev;
                __entry->flags = dentry->d_flags;
                __entry->positive = d_is_positive(dentry);
                if (dentry->d_parent && d_inode(dentry->d_parent))
                        __entry->parent_ino = d_inode(dentry->d_parent)->i_ino;
                else
                        __entry->parent_ino = -1UL;
                __entry->ino = d_inode(dentry) ? d_inode(dentry)->i_ino : 0;
                __entry->namelen = dentry->d_name.len;
                memcpy(__get_str(name), dentry->d_name.name, dentry->d_name.len);
        ),
        TP_printk("dev %d:%d flags 0x%x positive? %d parent_ino 0x%lx ino 0x%lx name '%.*s'",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  __entry->flags,
                  __entry->positive,
                  __entry->parent_ino,
                  __entry->ino,
                  __entry->namelen,
                  __get_str(name))
);
#define DEFINE_REPAIR_DENTRY_EVENT(name) \
DEFINE_EVENT(xrep_dentry_class, name, \
        TP_PROTO(struct xfs_mount *mp, const struct dentry *dentry), \
        TP_ARGS(mp, dentry))
DEFINE_REPAIR_DENTRY_EVENT(xrep_adoption_check_child);
DEFINE_REPAIR_DENTRY_EVENT(xrep_adoption_invalidate_child);
DEFINE_REPAIR_DENTRY_EVENT(xrep_dirtree_delete_child);

used by
                trace_xrep_adoption_check_child(sc->mp, d_child);
in xrep_adoption_check_dcache(), two calls of
                trace_xrep_adoption_invalidate_child(sc->mp, d_child);
in xrep_adoption_zap_dcache() and 
        trace_xrep_dirtree_delete_child(dp->i_mount, child_dentry);
in xrep_dirtree_purge_dentry()

The last 3 are *not* stable - fuck knows if they can happen in parallel with
lookups from other threads (those can end up moving dentries on sufficiently
buggered filesystem), but IMO these deserve take_dentry_name_snapshot()
treatment - if tracepoint is active, that is.

The rest all get stable dentries; I would really prefer to have that checked
by compiler, with sufficient annotations given to it.  In this case -
struct stable_dentry arguments for lookup/create/mkdir/mknod/symlink/unlink/rmdir,
rename and link as part of calling conventions change
+ stable_dentry as argument of xfs_generic_create(), xfs_dentry_mode_to_name(),
xfs_dentry_to_name() and xfs_cleanup_inode()
+ claim of stability in xfs_vn_tmpfile():
STATIC int
xfs_vn_tmpfile(
        struct mnt_idmap        *idmap,
        struct inode            *dir,
        struct file             *file,
        umode_t                 mode)
{
	int err = xfs_generic_create(idmap, dir,
			claim_stability(file->f_path.dentry), mode, 0, file);
	// at this point in ->tmpfile() dentry is nameless and negative;
	// nothing can move it until we get to finish_open_...()
        return finish_open_simple(file, err);
}

That way xfs hits will be down to that claim_stability() and the obscenity in
trace.h - until the users of the latter get wrapped into something that would
take snapshots and pass those instead of messing with ->d_name.  Considering
the fun quoted above, not having to repeat that digging is something I'd
count as a win...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-08  3:57         ` Al Viro
@ 2025-09-08  4:50           ` NeilBrown
  2025-09-08  5:19             ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: NeilBrown @ 2025-09-08  4:50 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara

On Mon, 08 Sep 2025, Al Viro wrote:
> That way xfs hits will be down to that claim_stability() and the obscenity in
> trace.h - until the users of the latter get wrapped into something that would
> take snapshots and pass those instead of messing with ->d_name.  Considering
> the fun quoted above, not having to repeat that digging is something I'd
> count as a win...
> 

What would you think of providing an accessor function and insisting
everyone use it - and have some sort of lockdep_assert_held() to that
function so that developers who test their code will see these problem?

Then a simple grep can find any unapproved uses.

NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-08  4:50           ` NeilBrown
@ 2025-09-08  5:19             ` Al Viro
  2025-09-08  6:25               ` NeilBrown
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2025-09-08  5:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara

On Mon, Sep 08, 2025 at 02:50:10PM +1000, NeilBrown wrote:
> On Mon, 08 Sep 2025, Al Viro wrote:
> > That way xfs hits will be down to that claim_stability() and the obscenity in
> > trace.h - until the users of the latter get wrapped into something that would
> > take snapshots and pass those instead of messing with ->d_name.  Considering
> > the fun quoted above, not having to repeat that digging is something I'd
> > count as a win...
> > 
> 
> What would you think of providing an accessor function and insisting
> everyone use it - and have some sort of lockdep_assert_held() to that
> function so that developers who test their code will see these problem?
> 
> Then a simple grep can find any unapproved uses.

Really?  Consider ->link().  Both arguments *are* stable, but the reasons
are not just different - they don't even intersect.

Old: known to be a regular file, held locked.  The former guarantees that
it can't be moved by d_splice_alias(), the latter prevents that being done
by vfs_rename(), which also locks the objects being moved.

New: has been looked up after its parent had been locked.  Note that _after_
is important here - you can't just blindly fetch ->d_parent and check if
its inode is locked (not to mention anything else, you'd need to check it
being non-NULL, do it under rcu_read_lock(), et sodding cetera - ->d_parent
stability rules are not that different from ->d_name ones).

And this "everyone use it" is not going to fly - you still have places where
it's done simply under ->d_lock.  Or ->d_lock on known parent - either would
suffice.

Besides, there's "which primitive do I use here?" - with the annotation approach
it's as simple as "if I have it as stable_dentry, just use stable_dentry_name(),
otherwise think hard - it may be tricky".

I don't believe that lockdep is an answer here - annotations (and these *are*
annotations - no code generation changes at all) give better coverage, and
bitrot tends to happen in rarely taken failure exits.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-08  5:19             ` Al Viro
@ 2025-09-08  6:25               ` NeilBrown
  2025-09-08  9:05                 ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: NeilBrown @ 2025-09-08  6:25 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara

On Mon, 08 Sep 2025, Al Viro wrote:
> On Mon, Sep 08, 2025 at 02:50:10PM +1000, NeilBrown wrote:
> > On Mon, 08 Sep 2025, Al Viro wrote:
> > > That way xfs hits will be down to that claim_stability() and the obscenity in
> > > trace.h - until the users of the latter get wrapped into something that would
> > > take snapshots and pass those instead of messing with ->d_name.  Considering
> > > the fun quoted above, not having to repeat that digging is something I'd
> > > count as a win...
> > > 
> > 
> > What would you think of providing an accessor function and insisting
> > everyone use it - and have some sort of lockdep_assert_held() to that
> > function so that developers who test their code will see these problem?
> > 
> > Then a simple grep can find any unapproved uses.
> 
> Really?  Consider ->link().  Both arguments *are* stable, but the reasons
> are not just different - they don't even intersect.

Might the locking rules be too complex?  Are they even documented?

As you know I want to change directory locking so that a ->d_flags bit
locks a dentry in much the same way that locking the parent directory
currently does.

I had wondered why vfs_link() locks the inode being linked and thought it was
only to protect ->i_nlink.  If it is needed to protect against rename
too, that could usefully be documented - or we could use the same
->d_flags bit to ensure that lock.

I guess I'm a bit concerned that your goal here is to transition from
"lots of auditing" to "much less auditing" and I would rather it was "no
auditing needed" - one day you won't want to (or be able to) audit any
more.

Fudging some type-state with C may well be useful but I suspect it is at
most part of a solution.  Simplification, documentation, run-time checks
might also be important parts.  As the type-state flag-day is a big
thing, maybe it shouldn't be first.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-08  6:25               ` NeilBrown
@ 2025-09-08  9:05                 ` Al Viro
  2025-09-10  2:45                   ` NeilBrown
  2025-09-13 21:28                   ` [RFC] a possible way of reducing the PITA of ->d_name audits Al Viro
  0 siblings, 2 replies; 43+ messages in thread
From: Al Viro @ 2025-09-08  9:05 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara

On Mon, Sep 08, 2025 at 04:25:26PM +1000, NeilBrown wrote:

> Might the locking rules be too complex?  Are they even documented?

For ->d_name and ->d_parent?  Very poorly.  That's the absolute worst
part of struct dentry locking rules.  _That_ is the place where attempts
to write a coherent documentation stall every time, often diverting
into yet another side story from hell.  ->d_revalidate() calling convention
change came from one of those; at least that got dealt with for good
now - and took a couple of years in the making, what with getting sidetracked
to other stuff.

Keep in mind that there are weird filesystems that manage to get away with
playing very odd games.  Or do not manage, really - apparmorfs is the
most recent weird thing.  Locking of their own, *interspersed* between
the directory locks.  With ->mkdir() and ->rmdir() both unlocking and
relocking the parent.  At least apparmor folks have finally admitted that
there's a problem...

I don't believe that any of us can get away with imposing changes that
would break configfs (another weird horror).  Or apparmor.  Or autofs,
for that matter...

Anyway, the current rules are:
	* only __d_move() ever changes name/parent of live dentry (there's
d_mark_tmpfile(), but that's done to a dentry that is invisible to anyone
other than caller - it flips name from "/" to "#<inumber>" just before
attaching the inode to it; part of ->tmpfile())
	* any change happens under rename_lock.
	* any change happens under ->d_lock on everything involved (parent(s)
included).
	* move from one parent to another happens only under ->s_vfs_rename_mutex.
	* exclusive lock on directory is enough to prevent moves to or from
that directory.
	* only positive dentries ever get moved/renamed
	* d_splice_alias() may move an existing directory dentry, but no
non-directory is going to be touched by it.
	* with a couple of exceptions (told you it's messy) d_move() and d_exchange()
are called in locking conditions equivalent to vfs_rename() - ->s_vfs_rename_mutex
if cross-directory, parent(s) exclusive, etc.  Exceptions are vfat_lookup()
and exfat_lookup() - both would be in trouble if ->link() was supported there.

What it boils down to for filesystems is
	* any dentry passed to directory-modifying operation has stable ->d_name
and ->d_parent.  That also applies to ->atomic_open() even without O_CREAT in
flags, except for the mess with directories in some cases (without O_CREAT
if you are given an in-lookup dentry and it turns out to be a directory,
from that point on you have no promise that it won't be reparented by d_splice_alias();
whether it can actually happens depends upon the filesystem, but instances tend to
just call finish_no_open() in that case and bugger off).
	* any dentry passed to ->lookup() has stable ->d_name/->d_parent until it
had been passed to d_splice_alias() (which is normally the last time we look at
it).
	* ->d_revalidate() and ->d_compare() get stable name passed as argument;
they should not look at dentry->d_name at all.

> As you know I want to change directory locking so that a ->d_flags bit
> locks a dentry in much the same way that locking the parent directory
> currently does.
> 
> I had wondered why vfs_link() locks the inode being linked and thought it was
> only to protect ->i_nlink.  If it is needed to protect against rename
> too, that could usefully be documented - or we could use the same
> ->d_flags bit to ensure that lock.

We could.  FWIW, I *like* the notion of "dentry that is held in place and won't
go away/get renamed/dropped until we say so".  That's the major reason why I'm
interested in your patchset, actually.

> I guess I'm a bit concerned that your goal here is to transition from
> "lots of auditing" to "much less auditing" and I would rather it was "no
> auditing needed" - one day you won't want to (or be able to) audit any
> more.
> 
> Fudging some type-state with C may well be useful but I suspect it is at
> most part of a solution.  Simplification, documentation, run-time checks
> might also be important parts.  As the type-state flag-day is a big
> thing, maybe it shouldn't be first.

All of that requires being able to answer questions about what's there in
the existing filesystems.  Which is pretty much the same problem as
those audits, obviously.  And static annotations are way easier to
reason about.

Anyway, I'm about to fall asleep (it's nearly 5am here); I've put an
initial sketch of infrastructure and of symlink calling conventions change
to #experimenta.stable_dentry, but it's very raw at the moment - and lacks
the followups that would make use of that stuff.  I'll continue tomorrow
(right now I've got ->create() half-done as well), but by now one thing
I'm certain about is that it will be trivial to reorder wrt your stuff -
in either direction.  So that worry can be discarded - change might or
might not be a good idea, but as flagdays go it's trivial.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-08  9:05                 ` Al Viro
@ 2025-09-10  2:45                   ` NeilBrown
  2025-09-10  7:24                     ` Al Viro
  2025-09-13 21:28                   ` [RFC] a possible way of reducing the PITA of ->d_name audits Al Viro
  1 sibling, 1 reply; 43+ messages in thread
From: NeilBrown @ 2025-09-10  2:45 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara

On Mon, 08 Sep 2025, Al Viro wrote:
> On Mon, Sep 08, 2025 at 04:25:26PM +1000, NeilBrown wrote:
> 
> > Might the locking rules be too complex?  Are they even documented?
> 
> For ->d_name and ->d_parent?  Very poorly.  That's the absolute worst
> part of struct dentry locking rules.  _That_ is the place where attempts
> to write a coherent documentation stall every time, often diverting
> into yet another side story from hell.  ->d_revalidate() calling convention
> change came from one of those; at least that got dealt with for good
> now - and took a couple of years in the making, what with getting sidetracked
> to other stuff.
> 
> Keep in mind that there are weird filesystems that manage to get away with
> playing very odd games.  Or do not manage, really - apparmorfs is the
> most recent weird thing.  Locking of their own, *interspersed* between
> the directory locks.  With ->mkdir() and ->rmdir() both unlocking and
> relocking the parent.  At least apparmor folks have finally admitted that
> there's a problem...
> 
> I don't believe that any of us can get away with imposing changes that
> would break configfs (another weird horror).  Or apparmor.  Or autofs,
> for that matter...
> 
> Anyway, the current rules are:
> 	* only __d_move() ever changes name/parent of live dentry (there's
> d_mark_tmpfile(), but that's done to a dentry that is invisible to anyone
> other than caller - it flips name from "/" to "#<inumber>" just before
> attaching the inode to it; part of ->tmpfile())
> 	* any change happens under rename_lock.
> 	* any change happens under ->d_lock on everything involved (parent(s)
> included).
> 	* move from one parent to another happens only under ->s_vfs_rename_mutex.
> 	* exclusive lock on directory is enough to prevent moves to or from
> that directory.
> 	* only positive dentries ever get moved/renamed
> 	* d_splice_alias() may move an existing directory dentry, but no
> non-directory is going to be touched by it.
> 	* with a couple of exceptions (told you it's messy) d_move() and d_exchange()
> are called in locking conditions equivalent to vfs_rename() - ->s_vfs_rename_mutex
> if cross-directory, parent(s) exclusive, etc.  Exceptions are vfat_lookup()
> and exfat_lookup() - both would be in trouble if ->link() was supported there.
> 
> What it boils down to for filesystems is
> 	* any dentry passed to directory-modifying operation has stable ->d_name
> and ->d_parent.  That also applies to ->atomic_open() even without O_CREAT in
> flags, except for the mess with directories in some cases (without O_CREAT
> if you are given an in-lookup dentry and it turns out to be a directory,
> from that point on you have no promise that it won't be reparented by d_splice_alias();
> whether it can actually happens depends upon the filesystem, but instances tend to
> just call finish_no_open() in that case and bugger off).
> 	* any dentry passed to ->lookup() has stable ->d_name/->d_parent until it
> had been passed to d_splice_alias() (which is normally the last time we look at
> it).
> 	* ->d_revalidate() and ->d_compare() get stable name passed as argument;
> they should not look at dentry->d_name at all.

So if we were to provide kdoc documentation for claim_stability() it
could say something like:

 It is only safe to claim_stability() for a dentry if:
   - ->d_parent->d_inode is locked exclusively
      This is guaranteed for target dentries passed to directory-modifying
      inode_operations, but not e.g. the old_dentry passed to ->link.
      It is also guaranteed for dentry passed to ->atomic_open when create it set.
   - ->d_inode is locked exclusively. This is guaranteed for the
       old_dentry passed to ->link.
   - DCACHE_PAR_LOOKUP is set.  Dentries passed to ->lookup will
     either have this flag or the parent will be locked exclusively
   - dentry is negative and a shared lock is held on parent inode.
     This is guaranteed for dentries passed to ->atomic_open when create
     is NOT set.
   - the ->d_lock is held on the dentry or its ->d_parent

 All uses of claim_stability() should be preceded by a comment
 identifying which condition is claimed to be met.

> 
> > As you know I want to change directory locking so that a ->d_flags bit
> > locks a dentry in much the same way that locking the parent directory
> > currently does.
> > 
> > I had wondered why vfs_link() locks the inode being linked and thought it was
> > only to protect ->i_nlink.  If it is needed to protect against rename
> > too, that could usefully be documented - or we could use the same
> > ->d_flags bit to ensure that lock.
> 
> We could.  FWIW, I *like* the notion of "dentry that is held in place and won't
> go away/get renamed/dropped until we say so".  That's the major reason why I'm
> interested in your patchset, actually.
> 
> > I guess I'm a bit concerned that your goal here is to transition from
> > "lots of auditing" to "much less auditing" and I would rather it was "no
> > auditing needed" - one day you won't want to (or be able to) audit any
> > more.
> > 
> > Fudging some type-state with C may well be useful but I suspect it is at
> > most part of a solution.  Simplification, documentation, run-time checks
> > might also be important parts.  As the type-state flag-day is a big
> > thing, maybe it shouldn't be first.
> 
> All of that requires being able to answer questions about what's there in
> the existing filesystems.  Which is pretty much the same problem as
> those audits, obviously.  And static annotations are way easier to
> reason about.

And they are good places to encourage people to put comments.  I'm
liking this idea more because it provides a focus for documenting a
non-trivial dependency.

> 
> Anyway, I'm about to fall asleep (it's nearly 5am here); I've put an
> initial sketch of infrastructure and of symlink calling conventions change
> to #experimenta.stable_dentry, but it's very raw at the moment - and lacks
> the followups that would make use of that stuff.  I'll continue tomorrow
> (right now I've got ->create() half-done as well), but by now one thing
> I'm certain about is that it will be trivial to reorder wrt your stuff -
> in either direction.  So that worry can be discarded - change might or
> might not be a good idea, but as flagdays go it's trivial.
> 

One thing I don't like is the name "unwrap_dentry()".  It says what is
done rather than what it means or what the purpose is.
Maybe "access_dentry()" (a bit like rcu_access_pointer()).
Maybe "dentry_of()" - then we would want to call stable dentries
"stable_foo" or similar.  So:

 static int afs_symlink(struct mnt_idmap *idmap, struct inode *dir,
                      struct stable_dentry stable_child, const char *content)
 {
       struct dentry *dentry = dentry_of(stable_child);


Passing around a struct that contains a single pointer feels strange and
anything we can do to give it a more natural feel and make it easy to
read would likely be a win.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-10  2:45                   ` NeilBrown
@ 2025-09-10  7:24                     ` Al Viro
  2025-09-10 22:52                       ` NeilBrown
  2025-09-12  5:49                       ` ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) Al Viro
  0 siblings, 2 replies; 43+ messages in thread
From: Al Viro @ 2025-09-10  7:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara

On Wed, Sep 10, 2025 at 12:45:41PM +1000, NeilBrown wrote:

>    - dentry is negative and a shared lock is held on parent inode.
>      This is guaranteed for dentries passed to ->atomic_open when create
>      is NOT set.

Umm...  The same goes for tmpfile dentry while it's still negative (nobody
else could make it positive - it's only reachable via the parent's list
of children and those who traverse such will ignore anything negative unhashed
they find there.

> One thing I don't like is the name "unwrap_dentry()".  It says what is
> done rather than what it means or what the purpose is.
> Maybe "access_dentry()" (a bit like rcu_access_pointer()).
> Maybe "dentry_of()" - then we would want to call stable dentries
> "stable_foo" or similar.  So:
> 
>  static int afs_symlink(struct mnt_idmap *idmap, struct inode *dir,
>                       struct stable_dentry stable_child, const char *content)

That has too much of over-the-top hungarian notation feel for my taste, TBH...

Note that these unwrap_dentry() are very likely to move into helpers - if some
function is always called with unwrapped_dentry(something) as an argument,
great, that's probably a candidate for struct stable_dentry.

I'll hold onto the current variant for now...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-10  7:24                     ` Al Viro
@ 2025-09-10 22:52                       ` NeilBrown
  2025-09-12  5:49                       ` ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) Al Viro
  1 sibling, 0 replies; 43+ messages in thread
From: NeilBrown @ 2025-09-10 22:52 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara

On Wed, 10 Sep 2025, Al Viro wrote:
> On Wed, Sep 10, 2025 at 12:45:41PM +1000, NeilBrown wrote:
> 
> >    - dentry is negative and a shared lock is held on parent inode.
> >      This is guaranteed for dentries passed to ->atomic_open when create
> >      is NOT set.
> 
> Umm...  The same goes for tmpfile dentry while it's still negative (nobody
> else could make it positive - it's only reachable via the parent's list
> of children and those who traverse such will ignore anything negative unhashed
> they find there.
> 
> > One thing I don't like is the name "unwrap_dentry()".  It says what is
> > done rather than what it means or what the purpose is.
> > Maybe "access_dentry()" (a bit like rcu_access_pointer()).
> > Maybe "dentry_of()" - then we would want to call stable dentries
> > "stable_foo" or similar.  So:
> > 
> >  static int afs_symlink(struct mnt_idmap *idmap, struct inode *dir,
> >                       struct stable_dentry stable_child, const char *content)
> 
> That has too much of over-the-top hungarian notation feel for my taste, TBH...
> 
> Note that these unwrap_dentry() are very likely to move into helpers - if some
> function is always called with unwrapped_dentry(something) as an argument,
> great, that's probably a candidate for struct stable_dentry.
> 
> I'll hold onto the current variant for now...
> 

Another idea - maybe we don't need "unwrap_dentry()" at all.
Just have
  struct stable_dentry {
       struct dentry *dentry;
  };

and code can use child.dentry.

There is no concern about people misusing unwrap_dentry() so there is no
need to be able to grep for it.  So we don't need it.

It is only uses of claim_stability() and of ->d_name and ->d_parent that we
might want to monitor.  So I imagine that one day all accesses to
->d_name and ->d_parent will be centralised either dentry_name() and
dentry_parent() or in fs/*.c code - and then we can rename them to
->_private_d_name and ->_private_d_parent or whatever to discourage
unwrapped access.

But accessing the dentry in a stable_dentry is safe and uninteresting so
there is no need to wrap it.

NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-10  7:24                     ` Al Viro
  2025-09-10 22:52                       ` NeilBrown
@ 2025-09-12  5:49                       ` Al Viro
  2025-09-12  8:23                         ` Miklos Szeredi
  2025-09-12 18:55                         ` Al Viro
  1 sibling, 2 replies; 43+ messages in thread
From: Al Viro @ 2025-09-12  5:49 UTC (permalink / raw)
  To: NeilBrown
  Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara,
	Miklos Szeredi

On Wed, Sep 10, 2025 at 08:24:23AM +0100, Al Viro wrote:

> Note that these unwrap_dentry() are very likely to move into helpers - if some
> function is always called with unwrapped_dentry(something) as an argument,
> great, that's probably a candidate for struct stable_dentry.
> 
> I'll hold onto the current variant for now...

BTW, fun fallout from that experiment once I've got to ->atomic_open() - things
get nicer if we teach finish_no_open() to accept ERR_PTR() for dentry:

int finish_no_open(struct file *file, struct dentry *dentry)
{
	if (IS_ERR(dentry))
		return PTR_ERR(dentry);
        file->f_path.dentry = dentry;
	return 0;
}

For example, we get
int nfs_atomic_open_v23(struct inode *dir, struct dentry *dentry,
                        struct file *file, unsigned int open_flags,
                        umode_t mode)
{
	struct dentry *res;
	/* Same as look+open from lookup_open(), but with different O_TRUNC
	 * handling.
	 */
	int error = 0;

	if (dentry->d_name.len > NFS_SERVER(dir)->namelen)
		return -ENAMETOOLONG;

	if (open_flags & O_CREAT) {
		file->f_mode |= FMODE_CREATED;
		error = nfs_do_create(dir, dentry, mode, open_flags);
		if (error)
			return error;
		return finish_open(file, dentry, NULL);
	}
	if (d_in_lookup(dentry)) {
		/* The only flags nfs_lookup considers are
		 * LOOKUP_EXCL and LOOKUP_RENAME_TARGET, and
		 * we want those to be zero so the lookup isn't skipped.
		 */
		res = nfs_lookup(dir, dentry, 0);
	}
	return finish_no_open(file, res);
}

and in cifs !O_CREAT case folds into
        if (!(oflags & O_CREAT)) {
                /*
                 * Check for hashed negative dentry. We have already revalidated
                 * the dentry and it is fine. No need to perform another lookup.
                 */
                if (!d_in_lookup(direntry))
                        return -ENOENT;
 
                return finish_no_open(cifs_lookup(inode, direntry, 0));
        }

In vboxsf, and similar in 9p and fuse:
        if (d_in_lookup(dentry)) {
                struct dentry *res = vboxsf_dir_lookup(parent, dentry, 0);
                if (res || d_really_is_positive(dentry))
			return finish_no_open(file, res);
        }
 
        /* Only creates */
        if (!(flags & O_CREAT))
                return finish_no_open(file, NULL);
	...

The thing is, in that form it's really clear that dentry stays stable if
we proceed past those chunks; basically, d_splice_alias() only returns
non-NULL if it's got an error or a preexisting directory alias, in which
case result will be positive.  And it's more compact and easier to follow
in that form...

While we are at it, Miklos mentioned some plans for changing ->atomic_open()
calling conventions.  Might be a good time to revisit that...  Miklos,
could you give a braindump on those plans?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-12  5:49                       ` ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) Al Viro
@ 2025-09-12  8:23                         ` Miklos Szeredi
  2025-09-12 18:29                           ` Al Viro
  2025-09-12 18:55                         ` Al Viro
  1 sibling, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2025-09-12  8:23 UTC (permalink / raw)
  To: Al Viro
  Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara, Bernd Schubert

On Fri, 12 Sept 2025 at 07:49, Al Viro <viro@zeniv.linux.org.uk> wrote:
> While we are at it, Miklos mentioned some plans for changing ->atomic_open()
> calling conventions.  Might be a good time to revisit that...  Miklos,
> could you give a braindump on those plans?

[Cc: Bernd]

What we want is ->atomic_open() being able to do an atomic revalidate
+ open (cached positive) case.  This is the only case currently that
can't be done with a single ATOMIC_OPEN request but needs two
roundtrips to the server.

The ->atomic_open() interface shouldn't need any changes, since it's
already allowed to use a different dentry from the supplied one.

Based on (flags & LOOKUP_OPEN) ->revalidate() needs to tell the caller
that it's expecting the subsequent ->atomic_open() call to do the
actual revalidation.  The proposed interface for that was to add a
D_REVALIDATE_ATOMIC =  2 constant to use as a return value in this
case.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-12  8:23                         ` Miklos Szeredi
@ 2025-09-12 18:29                           ` Al Viro
  2025-09-12 19:22                             ` Miklos Szeredi
  2025-09-13  3:36                             ` NeilBrown
  0 siblings, 2 replies; 43+ messages in thread
From: Al Viro @ 2025-09-12 18:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara, Bernd Schubert

On Fri, Sep 12, 2025 at 10:23:39AM +0200, Miklos Szeredi wrote:
> On Fri, 12 Sept 2025 at 07:49, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > While we are at it, Miklos mentioned some plans for changing ->atomic_open()
> > calling conventions.  Might be a good time to revisit that...  Miklos,
> > could you give a braindump on those plans?
> 
> [Cc: Bernd]
> 
> What we want is ->atomic_open() being able to do an atomic revalidate
> + open (cached positive) case.  This is the only case currently that
> can't be done with a single ATOMIC_OPEN request but needs two
> roundtrips to the server.
> 
> The ->atomic_open() interface shouldn't need any changes, since it's
> already allowed to use a different dentry from the supplied one.
> 
> Based on (flags & LOOKUP_OPEN) ->revalidate() needs to tell the caller
> that it's expecting the subsequent ->atomic_open() call to do the
> actual revalidation.  The proposed interface for that was to add a
> D_REVALIDATE_ATOMIC =  2 constant to use as a return value in this
> case.

	Umm...	Unless I'm misunderstanding you, that *would* change the
calling conventions - dentry could bloody well be positive, couldn't it?
And that changes quite a bit - without O_CREAT in flags you could get
parent locked only shared and pass a positive hashed dentry attached
to a directory inode to ->atomic_open().  The thing is, in that case it
can be moved by d_splice_alias()...

	Or am I misreading you?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-12  5:49                       ` ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) Al Viro
  2025-09-12  8:23                         ` Miklos Szeredi
@ 2025-09-12 18:55                         ` Al Viro
  2025-09-12 18:59                           ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro
  1 sibling, 1 reply; 43+ messages in thread
From: Al Viro @ 2025-09-12 18:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara,
	Miklos Szeredi

On Fri, Sep 12, 2025 at 06:49:07AM +0100, Al Viro wrote:
> On Wed, Sep 10, 2025 at 08:24:23AM +0100, Al Viro wrote:
> 
> > Note that these unwrap_dentry() are very likely to move into helpers - if some
> > function is always called with unwrapped_dentry(something) as an argument,
> > great, that's probably a candidate for struct stable_dentry.
> > 
> > I'll hold onto the current variant for now...
> 
> BTW, fun fallout from that experiment once I've got to ->atomic_open() - things
> get nicer if we teach finish_no_open() to accept ERR_PTR() for dentry:

See git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.finish_no_open
Patches in followups.

Shortlog:
      allow finish_no_open(file, ERR_PTR(-E...))
      9p: simplify v9fs_vfs_atomic_open()
      9p: simplify v9fs_vfs_atomic_open_dotl()
      simplify cifs_atomic_open()
      simplify vboxsf_dir_atomic_open()
      simplify nfs_atomic_open_v23()
      simplify fuse_atomic_open()
      simplify gfs2_atomic_open()
      slightly simplify nfs_atomic_open()
Diffstat:
 fs/9p/vfs_inode.c      | 34 ++++++++++++----------------------
 fs/9p/vfs_inode_dotl.c | 15 +++++----------
 fs/fuse/dir.c          | 21 +++++++--------------
 fs/gfs2/inode.c        | 26 +++++++++-----------------
 fs/nfs/dir.c           | 18 +++++-------------
 fs/open.c              | 10 ++++++----
 fs/smb/client/dir.c    |  8 +-------
 fs/vboxsf/dir.c        | 25 +++++++++----------------
 8 files changed, 54 insertions(+), 103 deletions(-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...))
  2025-09-12 18:55                         ` Al Viro
@ 2025-09-12 18:59                           ` Al Viro
  2025-09-12 18:59                             ` [PATCH 2/9] 9p: simplify v9fs_vfs_atomic_open() Al Viro
                                               ` (9 more replies)
  0 siblings, 10 replies; 43+ messages in thread
From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw)
  To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos

... allowing any ->lookup() return value to be passed to it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/open.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 9655158c3885..4890b13461c7 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1059,18 +1059,20 @@ EXPORT_SYMBOL(finish_open);
  * finish_no_open - finish ->atomic_open() without opening the file
  *
  * @file: file pointer
- * @dentry: dentry or NULL (as returned from ->lookup())
+ * @dentry: dentry, ERR_PTR(-E...) or NULL (as returned from ->lookup())
  *
- * This can be used to set the result of a successful lookup in ->atomic_open().
+ * This can be used to set the result of a lookup in ->atomic_open().
  *
  * NB: unlike finish_open() this function does consume the dentry reference and
  * the caller need not dput() it.
  *
- * Returns "0" which must be the return value of ->atomic_open() after having
- * called this function.
+ * Returns 0 or -E..., which must be the return value of ->atomic_open() after
+ * having called this function.
  */
 int finish_no_open(struct file *file, struct dentry *dentry)
 {
+	if (IS_ERR(dentry))
+		return PTR_ERR(dentry);
 	file->f_path.dentry = dentry;
 	return 0;
 }
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/9] 9p: simplify v9fs_vfs_atomic_open()
  2025-09-12 18:59                           ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro
@ 2025-09-12 18:59                             ` Al Viro
  2025-09-12 18:59                             ` [PATCH 3/9] 9p: simplify v9fs_vfs_atomic_open_dotl() Al Viro
                                               ` (8 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw)
  To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos

if v9fs_vfs_lookup() returns a preexisting alias, it is guaranteed to be
positive.  IOW, in that case we will immediately return finish_no_open(),
leaving only the case res == NULL past that point.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/9p/vfs_inode.c | 34 ++++++++++++----------------------
 1 file changed, 12 insertions(+), 22 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 399d455d50d6..d0c77ec31b1d 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -768,22 +768,18 @@ v9fs_vfs_atomic_open(struct inode *dir, struct dentry *dentry,
 	struct v9fs_inode __maybe_unused *v9inode;
 	struct v9fs_session_info *v9ses;
 	struct p9_fid *fid;
-	struct dentry *res = NULL;
 	struct inode *inode;
 	int p9_omode;
 
 	if (d_in_lookup(dentry)) {
-		res = v9fs_vfs_lookup(dir, dentry, 0);
-		if (IS_ERR(res))
-			return PTR_ERR(res);
-
-		if (res)
-			dentry = res;
+		struct dentry *res = v9fs_vfs_lookup(dir, dentry, 0);
+		if (res || d_really_is_positive(dentry))
+			return finish_no_open(file, res);
 	}
 
 	/* Only creates */
-	if (!(flags & O_CREAT) || d_really_is_positive(dentry))
-		return finish_no_open(file, res);
+	if (!(flags & O_CREAT))
+		return finish_no_open(file, NULL);
 
 	v9ses = v9fs_inode2v9ses(dir);
 	perm = unixmode2p9mode(v9ses, mode);
@@ -795,17 +791,17 @@ v9fs_vfs_atomic_open(struct inode *dir, struct dentry *dentry,
 			"write-only file with writeback enabled, creating w/ O_RDWR\n");
 	}
 	fid = v9fs_create(v9ses, dir, dentry, NULL, perm, p9_omode);
-	if (IS_ERR(fid)) {
-		err = PTR_ERR(fid);
-		goto error;
-	}
+	if (IS_ERR(fid))
+		return PTR_ERR(fid);
 
 	v9fs_invalidate_inode_attr(dir);
 	inode = d_inode(dentry);
 	v9inode = V9FS_I(inode);
 	err = finish_open(file, dentry, generic_file_open);
-	if (err)
-		goto error;
+	if (unlikely(err)) {
+		p9_fid_put(fid);
+		return err;
+	}
 
 	file->private_data = fid;
 #ifdef CONFIG_9P_FSCACHE
@@ -818,13 +814,7 @@ v9fs_vfs_atomic_open(struct inode *dir, struct dentry *dentry,
 	v9fs_open_fid_add(inode, &fid);
 
 	file->f_mode |= FMODE_CREATED;
-out:
-	dput(res);
-	return err;
-
-error:
-	p9_fid_put(fid);
-	goto out;
+	return 0;
 }
 
 /**
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/9] 9p: simplify v9fs_vfs_atomic_open_dotl()
  2025-09-12 18:59                           ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro
  2025-09-12 18:59                             ` [PATCH 2/9] 9p: simplify v9fs_vfs_atomic_open() Al Viro
@ 2025-09-12 18:59                             ` Al Viro
  2025-09-12 18:59                             ` [PATCH 4/9] simplify cifs_atomic_open() Al Viro
                                               ` (7 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw)
  To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos

again, preexisting aliases will always be positive

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/9p/vfs_inode_dotl.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index 5b5fda617b80..be297e335468 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -238,20 +238,16 @@ v9fs_vfs_atomic_open_dotl(struct inode *dir, struct dentry *dentry,
 	struct p9_fid *dfid = NULL, *ofid = NULL;
 	struct v9fs_session_info *v9ses;
 	struct posix_acl *pacl = NULL, *dacl = NULL;
-	struct dentry *res = NULL;
 
 	if (d_in_lookup(dentry)) {
-		res = v9fs_vfs_lookup(dir, dentry, 0);
-		if (IS_ERR(res))
-			return PTR_ERR(res);
-
-		if (res)
-			dentry = res;
+		struct dentry *res = v9fs_vfs_lookup(dir, dentry, 0);
+		if (res || d_really_is_positive(dentry))
+			return	finish_no_open(file, res);
 	}
 
 	/* Only creates */
-	if (!(flags & O_CREAT) || d_really_is_positive(dentry))
-		return	finish_no_open(file, res);
+	if (!(flags & O_CREAT))
+		return	finish_no_open(file, NULL);
 
 	v9ses = v9fs_inode2v9ses(dir);
 
@@ -337,7 +333,6 @@ v9fs_vfs_atomic_open_dotl(struct inode *dir, struct dentry *dentry,
 	p9_fid_put(ofid);
 	p9_fid_put(fid);
 	v9fs_put_acl(dacl, pacl);
-	dput(res);
 	return err;
 }
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/9] simplify cifs_atomic_open()
  2025-09-12 18:59                           ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro
  2025-09-12 18:59                             ` [PATCH 2/9] 9p: simplify v9fs_vfs_atomic_open() Al Viro
  2025-09-12 18:59                             ` [PATCH 3/9] 9p: simplify v9fs_vfs_atomic_open_dotl() Al Viro
@ 2025-09-12 18:59                             ` Al Viro
  2025-09-12 18:59                             ` [PATCH 5/9] simplify vboxsf_dir_atomic_open() Al Viro
                                               ` (6 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw)
  To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos

now that finish_no_open() does the right thing if it's given ERR_PTR() as
dentry...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/smb/client/dir.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c
index 5223edf6d11a..47710aa13822 100644
--- a/fs/smb/client/dir.c
+++ b/fs/smb/client/dir.c
@@ -484,8 +484,6 @@ cifs_atomic_open(struct inode *inode, struct dentry *direntry,
 	 * in network traffic in the other paths.
 	 */
 	if (!(oflags & O_CREAT)) {
-		struct dentry *res;
-
 		/*
 		 * Check for hashed negative dentry. We have already revalidated
 		 * the dentry and it is fine. No need to perform another lookup.
@@ -493,11 +491,7 @@ cifs_atomic_open(struct inode *inode, struct dentry *direntry,
 		if (!d_in_lookup(direntry))
 			return -ENOENT;
 
-		res = cifs_lookup(inode, direntry, 0);
-		if (IS_ERR(res))
-			return PTR_ERR(res);
-
-		return finish_no_open(file, res);
+		return finish_no_open(file, cifs_lookup(inode, direntry, 0));
 	}
 
 	xid = get_xid();
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/9] simplify vboxsf_dir_atomic_open()
  2025-09-12 18:59                           ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro
                                               ` (2 preceding siblings ...)
  2025-09-12 18:59                             ` [PATCH 4/9] simplify cifs_atomic_open() Al Viro
@ 2025-09-12 18:59                             ` Al Viro
  2025-09-12 18:59                             ` [PATCH 6/9] simplify nfs_atomic_open_v23() Al Viro
                                               ` (5 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw)
  To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos

similar to 9p et.al.
---
 fs/vboxsf/dir.c | 25 +++++++++----------------
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/fs/vboxsf/dir.c b/fs/vboxsf/dir.c
index 770e29ec3557..42bedc4ec7af 100644
--- a/fs/vboxsf/dir.c
+++ b/fs/vboxsf/dir.c
@@ -315,46 +315,39 @@ static int vboxsf_dir_atomic_open(struct inode *parent, struct dentry *dentry,
 {
 	struct vboxsf_sbi *sbi = VBOXSF_SBI(parent->i_sb);
 	struct vboxsf_handle *sf_handle;
-	struct dentry *res = NULL;
 	u64 handle;
 	int err;
 
 	if (d_in_lookup(dentry)) {
-		res = vboxsf_dir_lookup(parent, dentry, 0);
-		if (IS_ERR(res))
-			return PTR_ERR(res);
-
-		if (res)
-			dentry = res;
+		struct dentry *res = vboxsf_dir_lookup(parent, dentry, 0);
+		if (res || d_really_is_positive(dentry))
+			return finish_no_open(file, res);
 	}
 
 	/* Only creates */
-	if (!(flags & O_CREAT) || d_really_is_positive(dentry))
-		return finish_no_open(file, res);
+	if (!(flags & O_CREAT))
+		return finish_no_open(file, NULL);
 
 	err = vboxsf_dir_create(parent, dentry, mode, false, flags & O_EXCL, &handle);
 	if (err)
-		goto out;
+		return err;
 
 	sf_handle = vboxsf_create_sf_handle(d_inode(dentry), handle, SHFL_CF_ACCESS_READWRITE);
 	if (IS_ERR(sf_handle)) {
 		vboxsf_close(sbi->root, handle);
-		err = PTR_ERR(sf_handle);
-		goto out;
+		return PTR_ERR(sf_handle);
 	}
 
 	err = finish_open(file, dentry, generic_file_open);
 	if (err) {
 		/* This also closes the handle passed to vboxsf_create_sf_handle() */
 		vboxsf_release_sf_handle(d_inode(dentry), sf_handle);
-		goto out;
+		return err;
 	}
 
 	file->private_data = sf_handle;
 	file->f_mode |= FMODE_CREATED;
-out:
-	dput(res);
-	return err;
+	return 0;
 }
 
 static int vboxsf_dir_unlink(struct inode *parent, struct dentry *dentry)
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/9] simplify nfs_atomic_open_v23()
  2025-09-12 18:59                           ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro
                                               ` (3 preceding siblings ...)
  2025-09-12 18:59                             ` [PATCH 5/9] simplify vboxsf_dir_atomic_open() Al Viro
@ 2025-09-12 18:59                             ` Al Viro
  2025-09-12 18:59                             ` [PATCH 7/9] simplify fuse_atomic_open() Al Viro
                                               ` (4 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw)
  To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos

1) finish_no_open() takes ERR_PTR() as dentry now.
2) caller of ->atomic_open() will call d_lookup_done() itself, no
need to do it here.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/nfs/dir.c | 16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index d81217923936..c8dd1d0b8d85 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -2260,7 +2260,7 @@ int nfs_atomic_open_v23(struct inode *dir, struct dentry *dentry,
 			struct file *file, unsigned int open_flags,
 			umode_t mode)
 {
-
+	struct dentry *res = NULL;
 	/* Same as look+open from lookup_open(), but with different O_TRUNC
 	 * handling.
 	 */
@@ -2275,21 +2275,15 @@ int nfs_atomic_open_v23(struct inode *dir, struct dentry *dentry,
 		if (error)
 			return error;
 		return finish_open(file, dentry, NULL);
-	} else if (d_in_lookup(dentry)) {
+	}
+	if (d_in_lookup(dentry)) {
 		/* The only flags nfs_lookup considers are
 		 * LOOKUP_EXCL and LOOKUP_RENAME_TARGET, and
 		 * we want those to be zero so the lookup isn't skipped.
 		 */
-		struct dentry *res = nfs_lookup(dir, dentry, 0);
-
-		d_lookup_done(dentry);
-		if (unlikely(res)) {
-			if (IS_ERR(res))
-				return PTR_ERR(res);
-			return finish_no_open(file, res);
-		}
+		res = nfs_lookup(dir, dentry, 0);
 	}
-	return finish_no_open(file, NULL);
+	return finish_no_open(file, res);
 
 }
 EXPORT_SYMBOL_GPL(nfs_atomic_open_v23);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 7/9] simplify fuse_atomic_open()
  2025-09-12 18:59                           ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro
                                               ` (4 preceding siblings ...)
  2025-09-12 18:59                             ` [PATCH 6/9] simplify nfs_atomic_open_v23() Al Viro
@ 2025-09-12 18:59                             ` Al Viro
  2025-09-12 18:59                             ` [PATCH 8/9] simplify gfs2_atomic_open() Al Viro
                                               ` (3 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw)
  To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/fuse/dir.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 2d817d7cab26..d3076bfddb89 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -739,22 +739,18 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
 	int err;
 	struct mnt_idmap *idmap = file_mnt_idmap(file);
 	struct fuse_conn *fc = get_fuse_conn(dir);
-	struct dentry *res = NULL;
 
 	if (fuse_is_bad(dir))
 		return -EIO;
 
 	if (d_in_lookup(entry)) {
-		res = fuse_lookup(dir, entry, 0);
-		if (IS_ERR(res))
-			return PTR_ERR(res);
-
-		if (res)
-			entry = res;
+		struct dentry *res = fuse_lookup(dir, entry, 0);
+		if (res || d_really_is_positive(entry))
+			return finish_no_open(file, res);
 	}
 
-	if (!(flags & O_CREAT) || d_really_is_positive(entry))
-		goto no_open;
+	if (!(flags & O_CREAT))
+		return finish_no_open(file, NULL);
 
 	/* Only creates */
 	file->f_mode |= FMODE_CREATED;
@@ -768,16 +764,13 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
 		goto mknod;
 	} else if (err == -EEXIST)
 		fuse_invalidate_entry(entry);
-out_dput:
-	dput(res);
 	return err;
 
 mknod:
 	err = fuse_mknod(idmap, dir, entry, mode, 0);
 	if (err)
-		goto out_dput;
-no_open:
-	return finish_no_open(file, res);
+		return err;
+	return finish_no_open(file, NULL);
 }
 
 /*
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 8/9] simplify gfs2_atomic_open()
  2025-09-12 18:59                           ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro
                                               ` (5 preceding siblings ...)
  2025-09-12 18:59                             ` [PATCH 7/9] simplify fuse_atomic_open() Al Viro
@ 2025-09-12 18:59                             ` Al Viro
  2025-09-12 18:59                             ` [PATCH 9/9] slightly simplify nfs_atomic_open() Al Viro
                                               ` (2 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw)
  To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos

the difference from 9p et.al. is that on gfs2 the lookup side might
end up opening the file.  That's what the FMODE_OPENED check there
is about - and it might actually be seen with finish_open() having
failed, if it fails late enough.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/gfs2/inode.c | 26 +++++++++-----------------
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 8760e7e20c9d..8a7ed80d9f2d 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -1368,27 +1368,19 @@ static int gfs2_atomic_open(struct inode *dir, struct dentry *dentry,
 			    struct file *file, unsigned flags,
 			    umode_t mode)
 {
-	struct dentry *d;
 	bool excl = !!(flags & O_EXCL);
 
-	if (!d_in_lookup(dentry))
-		goto skip_lookup;
-
-	d = __gfs2_lookup(dir, dentry, file);
-	if (IS_ERR(d))
-		return PTR_ERR(d);
-	if (d != NULL)
-		dentry = d;
-	if (d_really_is_positive(dentry)) {
-		if (!(file->f_mode & FMODE_OPENED))
+	if (d_in_lookup(dentry)) {
+		struct dentry *d = __gfs2_lookup(dir, dentry, file);
+		if (file->f_mode & FMODE_OPENED) {
+			if (IS_ERR(d))
+				return PTR_ERR(d);
+			dput(d);
+			return excl && (flags & O_CREAT) ? -EEXIST : 0;
+		}
+		if (d || d_really_is_positive(dentry))
 			return finish_no_open(file, d);
-		dput(d);
-		return excl && (flags & O_CREAT) ? -EEXIST : 0;
 	}
-
-	BUG_ON(d != NULL);
-
-skip_lookup:
 	if (!(flags & O_CREAT))
 		return -ENOENT;
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 9/9] slightly simplify nfs_atomic_open()
  2025-09-12 18:59                           ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro
                                               ` (6 preceding siblings ...)
  2025-09-12 18:59                             ` [PATCH 8/9] simplify gfs2_atomic_open() Al Viro
@ 2025-09-12 18:59                             ` Al Viro
  2025-09-12 22:23                             ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Linus Torvalds
  2025-09-13  3:34                             ` NeilBrown
  9 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw)
  To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/nfs/dir.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index c8dd1d0b8d85..5f7d9be6f022 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -2198,8 +2198,6 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
 		else
 			dput(dentry);
 	}
-	if (IS_ERR(res))
-		return PTR_ERR(res);
 	return finish_no_open(file, res);
 }
 EXPORT_SYMBOL_GPL(nfs_atomic_open);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-12 18:29                           ` Al Viro
@ 2025-09-12 19:22                             ` Miklos Szeredi
  2025-09-12 20:36                               ` Al Viro
  2025-09-13  3:36                             ` NeilBrown
  1 sibling, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2025-09-12 19:22 UTC (permalink / raw)
  To: Al Viro
  Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara, Bernd Schubert

On Fri, 12 Sept 2025 at 20:29, Al Viro <viro@zeniv.linux.org.uk> wrote:

> > Based on (flags & LOOKUP_OPEN) ->revalidate() needs to tell the caller
> > that it's expecting the subsequent ->atomic_open() call to do the
> > actual revalidation.  The proposed interface for that was to add a
> > D_REVALIDATE_ATOMIC =  2 constant to use as a return value in this
> > case.
>
>         Umm...  Unless I'm misunderstanding you, that *would* change the
> calling conventions - dentry could bloody well be positive, couldn't it?
> And that changes quite a bit - without O_CREAT in flags you could get
> parent locked only shared and pass a positive hashed dentry attached
> to a directory inode to ->atomic_open().  The thing is, in that case it
> can be moved by d_splice_alias()...

You are talking about the disconnected directory case?  That can't
happen in this call path, since it's following a normal component from
a parent directory, which by definition isn't disconnected.

Just realized, that open_last_lookups() will bypass lookup_open() on
cached positive anyway, so really no point in handing that  inside
lookup_open().

Should this be done via a new  i_op->reval_open() called without inode
lock, with a positive dentry and gets a name just like d_revalidate?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-12 19:22                             ` Miklos Szeredi
@ 2025-09-12 20:36                               ` Al Viro
  2025-09-12 20:50                                 ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2025-09-12 20:36 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara, Bernd Schubert

On Fri, Sep 12, 2025 at 09:22:05PM +0200, Miklos Szeredi wrote:

> You are talking about the disconnected directory case?  That can't
> happen in this call path, since it's following a normal component from
> a parent directory, which by definition isn't disconnected.

Huh?  No, just a lookup in a different place with server giving you
the nodeid of the directory you had in that place in dcache.
d_splice_alias() has no choice other than "move the existing alias
into the right place" - directory inodes *can't* have multiple
hashed dentries, period.

> Just realized, that open_last_lookups() will bypass lookup_open() on
> cached positive anyway, so really no point in handing that  inside
> lookup_open().

IIRC, when the last time it came up, it was along the lines of "could
we call ->atomic_open() for positive dentries as well?" and I think
it had been about FUSE.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-12 20:36                               ` Al Viro
@ 2025-09-12 20:50                                 ` Al Viro
  0 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2025-09-12 20:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara, Bernd Schubert

On Fri, Sep 12, 2025 at 09:36:18PM +0100, Al Viro wrote:
> On Fri, Sep 12, 2025 at 09:22:05PM +0200, Miklos Szeredi wrote:
> 
> > You are talking about the disconnected directory case?  That can't
> > happen in this call path, since it's following a normal component from
> > a parent directory, which by definition isn't disconnected.
> 
> Huh?  No, just a lookup in a different place with server giving you
> the nodeid of the directory you had in that place in dcache.
> d_splice_alias() has no choice other than "move the existing alias
> into the right place" - directory inodes *can't* have multiple
> hashed dentries, period.
> 
> > Just realized, that open_last_lookups() will bypass lookup_open() on
> > cached positive anyway, so really no point in handing that  inside
> > lookup_open().
> 
> IIRC, when the last time it came up, it was along the lines of "could
> we call ->atomic_open() for positive dentries as well?" and I think
> it had been about FUSE.

The last iteration of those threads I've seen was in October 2023; has
anything changed since then?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...))
  2025-09-12 18:59                           ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro
                                               ` (7 preceding siblings ...)
  2025-09-12 18:59                             ` [PATCH 9/9] slightly simplify nfs_atomic_open() Al Viro
@ 2025-09-12 22:23                             ` Linus Torvalds
  2025-09-13  3:34                             ` NeilBrown
  9 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2025-09-12 22:23 UTC (permalink / raw)
  To: Al Viro; +Cc: neil, linux-fsdevel, brauner, jack, miklos

On Fri, 12 Sept 2025 at 11:59, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> ... allowing any ->lookup() return value to be passed to it.

Ack, the whole series looks like a nice cleanup.

           Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...))
  2025-09-12 18:59                           ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro
                                               ` (8 preceding siblings ...)
  2025-09-12 22:23                             ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Linus Torvalds
@ 2025-09-13  3:34                             ` NeilBrown
  9 siblings, 0 replies; 43+ messages in thread
From: NeilBrown @ 2025-09-13  3:34 UTC (permalink / raw)
  To: Al Viro; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos

On Sat, 13 Sep 2025, Al Viro wrote:
> ... allowing any ->lookup() return value to be passed to it.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/open.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index 9655158c3885..4890b13461c7 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1059,18 +1059,20 @@ EXPORT_SYMBOL(finish_open);
>   * finish_no_open - finish ->atomic_open() without opening the file
>   *
>   * @file: file pointer
> - * @dentry: dentry or NULL (as returned from ->lookup())
> + * @dentry: dentry, ERR_PTR(-E...) or NULL (as returned from ->lookup())
>   *
> - * This can be used to set the result of a successful lookup in ->atomic_open().
> + * This can be used to set the result of a lookup in ->atomic_open().

This is my favourite part of the series - removing the word
"successful".  It makes the API more general.

My only other thought is that

	if (d_in_lookup(dentry)) {
		struct dentry *res = dir->i_op->lookup(dir, dentry, 0);
		if (res || d_really_is_positive(dentry))
			return finish_no_open(file, res);
	}

is a common pattern.  It would be nice if we could factor it out into a
help but I cannot think of a clean way to do that.
You can add
Reviewed-by: NeilBrown <neil@brown.name>
if you like.

Thanks,
NeilBrown

>   *
>   * NB: unlike finish_open() this function does consume the dentry reference and
>   * the caller need not dput() it.
>   *
> - * Returns "0" which must be the return value of ->atomic_open() after having
> - * called this function.
> + * Returns 0 or -E..., which must be the return value of ->atomic_open() after
> + * having called this function.
>   */
>  int finish_no_open(struct file *file, struct dentry *dentry)
>  {
> +	if (IS_ERR(dentry))
> +		return PTR_ERR(dentry);
>  	file->f_path.dentry = dentry;
>  	return 0;
>  }
> -- 
> 2.47.2
> 
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-12 18:29                           ` Al Viro
  2025-09-12 19:22                             ` Miklos Szeredi
@ 2025-09-13  3:36                             ` NeilBrown
  2025-09-13  5:07                               ` Al Viro
  1 sibling, 1 reply; 43+ messages in thread
From: NeilBrown @ 2025-09-13  3:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara, Bernd Schubert

On Sat, 13 Sep 2025, Al Viro wrote:
> On Fri, Sep 12, 2025 at 10:23:39AM +0200, Miklos Szeredi wrote:
> > On Fri, 12 Sept 2025 at 07:49, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > While we are at it, Miklos mentioned some plans for changing ->atomic_open()
> > > calling conventions.  Might be a good time to revisit that...  Miklos,
> > > could you give a braindump on those plans?
> > 
> > [Cc: Bernd]
> > 
> > What we want is ->atomic_open() being able to do an atomic revalidate
> > + open (cached positive) case.  This is the only case currently that
> > can't be done with a single ATOMIC_OPEN request but needs two
> > roundtrips to the server.
> > 
> > The ->atomic_open() interface shouldn't need any changes, since it's
> > already allowed to use a different dentry from the supplied one.
> > 
> > Based on (flags & LOOKUP_OPEN) ->revalidate() needs to tell the caller
> > that it's expecting the subsequent ->atomic_open() call to do the
> > actual revalidation.  The proposed interface for that was to add a
> > D_REVALIDATE_ATOMIC =  2 constant to use as a return value in this
> > case.
> 
> 	Umm...	Unless I'm misunderstanding you, that *would* change the
> calling conventions - dentry could bloody well be positive, couldn't it?
> And that changes quite a bit - without O_CREAT in flags you could get
> parent locked only shared and pass a positive hashed dentry attached
> to a directory inode to ->atomic_open().  The thing is, in that case it
> can be moved by d_splice_alias()...

Once we get per-dentry locking for dirops this will cease to be a
problem.  The dentry would be locked exclusively whether we create or
not and the lock would prevent the d_splice_alias() rename.

NeilBrown


> 
> 	Or am I misreading you?
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-13  3:36                             ` NeilBrown
@ 2025-09-13  5:07                               ` Al Viro
  2025-09-13  5:50                                 ` NeilBrown
  2025-09-14 19:01                                 ` Miklos Szeredi
  0 siblings, 2 replies; 43+ messages in thread
From: Al Viro @ 2025-09-13  5:07 UTC (permalink / raw)
  To: NeilBrown
  Cc: Miklos Szeredi, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara, Bernd Schubert

On Sat, Sep 13, 2025 at 01:36:49PM +1000, NeilBrown wrote:

> > 	Umm...	Unless I'm misunderstanding you, that *would* change the
> > calling conventions - dentry could bloody well be positive, couldn't it?
> > And that changes quite a bit - without O_CREAT in flags you could get
> > parent locked only shared and pass a positive hashed dentry attached
> > to a directory inode to ->atomic_open().  The thing is, in that case it
> > can be moved by d_splice_alias()...
> 
> Once we get per-dentry locking for dirops this will cease to be a
> problem.  The dentry would be locked exclusively whether we create or
> not and the lock would prevent the d_splice_alias() rename.

Umm...  Interesting, but... what would happen in such case, really?

You have one thread with allegedly directory dentry it had skipped
verification for.  It tries to combine open with revalidation, and
sends such request.  In the meanwhile, another thread has found the
same thing in a different place - possibly because of fs corruption,
possibly because the damn thing got moved on server, right after it has
sent "OK, I've opened it" reply to the first thread.  What would you do?
Have the second thread spin/fail with some error/something else?

Alternatively, move succeeds first, the lookup in the new place arrives,
then revalidate+open.  The first thread gets a nodeid mismatch, and...?

How would that combined revalidate+open work for fuse, anyway?  The former
is basically a lookup - you send nodeid of parent + name, get nodeid +
attributes of child.  The latter goes strictly by nodeid of child and
gets a 64bit number that apparently tells one opened file from another
(not to be confused with fhandle).  Combined request of some sort?

I think we need Bernd to bring a fresh braindump on the plans in that
area.  And we'd damn better *NOT* make it another "this filesystem is
special, the locking works for it because $REASONS (half missing from
the discussion) and nobody else returns that special ->d_revalidate()
return value (at the moment)" - every time something of that sort had
been kludged in we ended up paying for that later.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-13  5:07                               ` Al Viro
@ 2025-09-13  5:50                                 ` NeilBrown
  2025-09-14 19:01                                 ` Miklos Szeredi
  1 sibling, 0 replies; 43+ messages in thread
From: NeilBrown @ 2025-09-13  5:50 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara, Bernd Schubert

On Sat, 13 Sep 2025, Al Viro wrote:
> On Sat, Sep 13, 2025 at 01:36:49PM +1000, NeilBrown wrote:
> 
> > > 	Umm...	Unless I'm misunderstanding you, that *would* change the
> > > calling conventions - dentry could bloody well be positive, couldn't it?
> > > And that changes quite a bit - without O_CREAT in flags you could get
> > > parent locked only shared and pass a positive hashed dentry attached
> > > to a directory inode to ->atomic_open().  The thing is, in that case it
> > > can be moved by d_splice_alias()...
> > 
> > Once we get per-dentry locking for dirops this will cease to be a
> > problem.  The dentry would be locked exclusively whether we create or
> > not and the lock would prevent the d_splice_alias() rename.
> 
> Umm...  Interesting, but... what would happen in such case, really?
> 
> You have one thread with allegedly directory dentry it had skipped
> verification for.  It tries to combine open with revalidation, and
> sends such request.  In the meanwhile, another thread has found the
> same thing in a different place - possibly because of fs corruption,
> possibly because the damn thing got moved on server, right after it has
> sent "OK, I've opened it" reply to the first thread.  What would you do?
> Have the second thread spin/fail with some error/something else?

If there is a race with d_splice_alias() wanting the move the directory
while open_atomic has it locked, -ESTALE will be returned and the lookup
will be retried with LOOKUP_REVAL which should allow the filesystem to
do any needed synchronisation.

If the directory has simply been moved on the server, then the
revalidate in atomic_open will notice this and invalidate the old dentry
and use lookup to create a new one.  Importantly it will have a stable
d_name so that it can do that.

This "invalidate and create a new one with the same name" will require
careful locking but should be achievable.

NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-08  9:05                 ` Al Viro
  2025-09-10  2:45                   ` NeilBrown
@ 2025-09-13 21:28                   ` Al Viro
  2025-09-14  1:05                     ` NeilBrown
  1 sibling, 1 reply; 43+ messages in thread
From: Al Viro @ 2025-09-13 21:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara

On Mon, Sep 08, 2025 at 10:05:57AM +0100, Al Viro wrote:

> > Fudging some type-state with C may well be useful but I suspect it is at
> > most part of a solution.  Simplification, documentation, run-time checks
> > might also be important parts.  As the type-state flag-day is a big
> > thing, maybe it shouldn't be first.
> 
> All of that requires being able to answer questions about what's there in
> the existing filesystems.  Which is pretty much the same problem as
> those audits, obviously.  And static annotations are way easier to
> reason about.

Speaking of annoyances, d_exact_alias() is gone, and good riddance, but there's
another fun issue in the same area - environment for d_splice_alias() call *and*
for one of those d_drop()-just-in-case.

The call tree is still the same:
_nfs4_open_and_get_state()
	<- _nfs4_do_open()
		<- nfs4_do_open()
			<- nfs4_atomic_open()
				== nfs_rpc_ops:open_context
					<- nfs_atomic_open()
						== ->atomic_open
					<- nfs4_file_open()
						== ->open
			<- nfs4_proc_create()
				== nfs_rpc_ops:create
					<- nfs_do_create()
						<- nfs_create()
							== ->create
						<- nfs_atomic_open_v23(), with O_CREAT
							== ->atomic_open
							# won't reach nfs4 stuff?

->create() and ->atomic_open() have the parent held at least shared;
->open() does not, but the chunk in question is hit only if dentry
is negative, which won't happen in case of ->open().

Additional complication comes from the possibility for _nfs4_open_and_get_state()
to fail after that d_splice_alias().  In that case we have _nfs4_do_open()
return an error; its caller is inside a do-while loop in nfs4_do_open() and
I think we can't end up going around the loop after such late failure (the
only error possible after that is -EACCES/-NFS4ERR_ACCESS and that's not one
of those that can lead to more iterations.

	However, looking at that late failure, that's the only call of
nfs4_opendata_access(), and that function seems to expect the possibility
of state->inode being a directory; can that really happen?

	Because if it can, we have a problem:
                alias = d_splice_alias(igrab(state->inode), dentry);
                /* d_splice_alias() can't fail here - it's a non-directory */
                if (alias) {
                        dput(ctx->dentry);
                        ctx->dentry = dentry = alias;
                }
very much *can* fail if it's reached with state->inode being a directory -
we can get ERR_PTR() out of that d_splice_alias() and that will oops at

        if (d_inode(dentry) == state->inode)
                nfs_inode_attach_open_context(ctx);
shortly afterwards (incidentally, what is that check about?  It can only
fail in case of nfs4_file_open(); should we have open(2) succeed in such
situation?)

Sigh...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-13 21:28                   ` [RFC] a possible way of reducing the PITA of ->d_name audits Al Viro
@ 2025-09-14  1:05                     ` NeilBrown
  2025-09-14  1:37                       ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: NeilBrown @ 2025-09-14  1:05 UTC (permalink / raw)
  To: Al Viro, Trond Myklebust
  Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara

On Sun, 14 Sep 2025, Al Viro wrote:
> On Mon, Sep 08, 2025 at 10:05:57AM +0100, Al Viro wrote:
> 
> > > Fudging some type-state with C may well be useful but I suspect it is at
> > > most part of a solution.  Simplification, documentation, run-time checks
> > > might also be important parts.  As the type-state flag-day is a big
> > > thing, maybe it shouldn't be first.
> > 
> > All of that requires being able to answer questions about what's there in
> > the existing filesystems.  Which is pretty much the same problem as
> > those audits, obviously.  And static annotations are way easier to
> > reason about.
> 
> Speaking of annoyances, d_exact_alias() is gone, and good riddance, but there's
> another fun issue in the same area - environment for d_splice_alias() call *and*
> for one of those d_drop()-just-in-case.
> 
> The call tree is still the same:
> _nfs4_open_and_get_state()
> 	<- _nfs4_do_open()
> 		<- nfs4_do_open()
> 			<- nfs4_atomic_open()
> 				== nfs_rpc_ops:open_context
> 					<- nfs_atomic_open()
> 						== ->atomic_open
> 					<- nfs4_file_open()
> 						== ->open
> 			<- nfs4_proc_create()
> 				== nfs_rpc_ops:create
> 					<- nfs_do_create()
> 						<- nfs_create()
> 							== ->create
> 						<- nfs_atomic_open_v23(), with O_CREAT
> 							== ->atomic_open
> 							# won't reach nfs4 stuff?
> 
> ->create() and ->atomic_open() have the parent held at least shared;
> ->open() does not, but the chunk in question is hit only if dentry
> is negative, which won't happen in case of ->open().
> 
> Additional complication comes from the possibility for _nfs4_open_and_get_state()
> to fail after that d_splice_alias().  In that case we have _nfs4_do_open()
> return an error; its caller is inside a do-while loop in nfs4_do_open() and
> I think we can't end up going around the loop after such late failure (the
> only error possible after that is -EACCES/-NFS4ERR_ACCESS and that's not one
> of those that can lead to more iterations.
> 
> 	However, looking at that late failure, that's the only call of
> nfs4_opendata_access(), and that function seems to expect the possibility
> of state->inode being a directory; can that really happen?

No that cannot happen.  In the NFSv4 protocol, OPEN is only used for
files.
Directories use the same model as v3 where you pass a filehandle to
READDIR without establishing and "open" state.

Why do you think nfs4_opendata_access() expects the possibilty of a
directory?  The purpose of the function is to handle to Posix weirdness
around being able to exec() a file that you cannot read().  Over NFS the
client needs to be able to read a file in order to execute it, so NFS
blurs the two permissions together.  This function is check which
permission was asked for and then checking access permissions to make
sure it is allowed.  i.e.  the server might let the client read the file
storing the application, but the client might not let the application be
run.


> 
> 	Because if it can, we have a problem:
>                 alias = d_splice_alias(igrab(state->inode), dentry);
>                 /* d_splice_alias() can't fail here - it's a non-directory */
>                 if (alias) {
>                         dput(ctx->dentry);
>                         ctx->dentry = dentry = alias;
>                 }
> very much *can* fail if it's reached with state->inode being a directory -
> we can get ERR_PTR() out of that d_splice_alias() and that will oops at
> 
>         if (d_inode(dentry) == state->inode)
>                 nfs_inode_attach_open_context(ctx);
> shortly afterwards (incidentally, what is that check about?  It can only
> fail in case of nfs4_file_open(); should we have open(2) succeed in such
> situation?)

I don't know what is going on here.
Based on the commit that introduced this code

Commit: c45ffdd26961 ("NFSv4: Close another NFSv4 recovery race")

there is presumably some race that can cause the test to fail.
Maybe Trond (cc:ed) could help explain.

NeilBrown

> 
> Sigh...
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-14  1:05                     ` NeilBrown
@ 2025-09-14  1:37                       ` Al Viro
  2025-09-14  5:56                         ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2025-09-14  1:37 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara

On Sun, Sep 14, 2025 at 11:05:08AM +1000, NeilBrown wrote:
> READDIR without establishing and "open" state.
> 
> Why do you think nfs4_opendata_access() expects the possibilty of a
> directory?

static int nfs4_opendata_access(const struct cred *cred,
			struct nfs4_opendata *opendata,
			struct nfs4_state *state, fmode_t fmode)
{
	struct nfs_access_entry cache;
	u32 mask, flags;

	/* access call failed or for some reason the server doesn't
	 * support any access modes -- defer access call until later */
	if (opendata->o_res.access_supported == 0)
		return 0;

	mask = 0;
	if (fmode & FMODE_EXEC) {
		/* ONLY check for exec rights */
		if (S_ISDIR(state->inode->i_mode))
		    ^^^^^^^
How else would you describe this?
			mask = NFS4_ACCESS_LOOKUP;
		else
			mask = NFS4_ACCESS_EXECUTE;

> >         if (d_inode(dentry) == state->inode)
> >                 nfs_inode_attach_open_context(ctx);
> > shortly afterwards (incidentally, what is that check about?  It can only
> > fail in case of nfs4_file_open(); should we have open(2) succeed in such
> > situation?)
> 
> I don't know what is going on here.
> Based on the commit that introduced this code
> 
> Commit: c45ffdd26961 ("NFSv4: Close another NFSv4 recovery race")
> 
> there is presumably some race that can cause the test to fail.
> Maybe Trond (cc:ed) could help explain.

AFAICS, it can happen if you are there from nfs4_file_open(), hit
_nfs4_opendata_to_nfs4_state(opendata), find ->rpc_done to be true
in there, hit nfs4_opendata_find_nfs4_state(), have it call
nfs4_opendata_get_inode() and run into a server without
NFS_CAP_ATOMIC_OPEN_V1.  Then you get ->o_arg.claim set to
NFS4_OPEN_CLAIM_NULL and hit this:
                inode = nfs_fhget(data->dir->d_sb, &data->o_res.fh,
                                &data->f_attr);
finding not the same inode as your dentry has attached to it.

So the test might end up not being true, at least from my reading of
that code.

What I don't understand is the reasons for not failing immediately
with EOPENSTALE in that case.

TBH, I would be a lot more comfortable if the "attach inode to dentry"
logics in there had been taken several levels up the call chains - analysis
would be much easier that way...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-14  1:37                       ` Al Viro
@ 2025-09-14  5:56                         ` Al Viro
  2025-09-14 23:07                           ` NeilBrown
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2025-09-14  5:56 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara

On Sun, Sep 14, 2025 at 02:37:30AM +0100, Al Viro wrote:

> AFAICS, it can happen if you are there from nfs4_file_open(), hit
> _nfs4_opendata_to_nfs4_state(opendata), find ->rpc_done to be true
> in there, hit nfs4_opendata_find_nfs4_state(), have it call
> nfs4_opendata_get_inode() and run into a server without
> NFS_CAP_ATOMIC_OPEN_V1.  Then you get ->o_arg.claim set to
> NFS4_OPEN_CLAIM_NULL and hit this:
>                 inode = nfs_fhget(data->dir->d_sb, &data->o_res.fh,
>                                 &data->f_attr);
> finding not the same inode as your dentry has attached to it.
> 
> So the test might end up not being true, at least from my reading of
> that code.
> 
> What I don't understand is the reasons for not failing immediately
> with EOPENSTALE in that case.
> 
> TBH, I would be a lot more comfortable if the "attach inode to dentry"
> logics in there had been taken several levels up the call chains - analysis
> would be much easier that way...
 
BTW, that's one place where your scheme with locking dentry might cause
latency problems - two opens on the same cached dentry could be sent
in parallel, but if you hold it against renames, etc., you might end up
with those two roundtrips serialized against each other...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-13  5:07                               ` Al Viro
  2025-09-13  5:50                                 ` NeilBrown
@ 2025-09-14 19:01                                 ` Miklos Szeredi
  2025-09-14 19:50                                   ` Al Viro
  1 sibling, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2025-09-14 19:01 UTC (permalink / raw)
  To: Al Viro
  Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara, Bernd Schubert

On Sat, 13 Sept 2025 at 07:07, Al Viro <viro@zeniv.linux.org.uk> wrote:

> How would that combined revalidate+open work for fuse, anyway?  The former
> is basically a lookup - you send nodeid of parent + name, get nodeid +
> attributes of child.  The latter goes strictly by nodeid of child and
> gets a 64bit number that apparently tells one opened file from another
> (not to be confused with fhandle).  Combined request of some sort?

There are already two combined ones: FUSE_CREATE and FUSE_TMPFILE both
get nodeid of parent + name and return attributes of child plus opened
file.  FUSE_CREATE gets invoked in the uncached or cached negative
case from ->atomic_open() with inode lock for held exclusive.

That leaves 2 cases:

- uncached plain open: FUSE_OPEN_ATOMIC with same semantics as
FUSE_CREATE, inode lock held shared

- cached positive (plain or O_CREAT): FUSE_OPEN_REVAL getting nodeid
of parent + name + nodeid of child and return opened file or -ESTALE,
no locking required

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-14 19:01                                 ` Miklos Szeredi
@ 2025-09-14 19:50                                   ` Al Viro
  2025-09-14 20:05                                     ` Miklos Szeredi
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2025-09-14 19:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara, Bernd Schubert

On Sun, Sep 14, 2025 at 09:01:40PM +0200, Miklos Szeredi wrote:
> On Sat, 13 Sept 2025 at 07:07, Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > How would that combined revalidate+open work for fuse, anyway?  The former
> > is basically a lookup - you send nodeid of parent + name, get nodeid +
> > attributes of child.  The latter goes strictly by nodeid of child and
> > gets a 64bit number that apparently tells one opened file from another
> > (not to be confused with fhandle).  Combined request of some sort?
> 
> There are already two combined ones: FUSE_CREATE and FUSE_TMPFILE both
> get nodeid of parent + name and return attributes of child plus opened
> file.  FUSE_CREATE gets invoked in the uncached or cached negative
> case from ->atomic_open() with inode lock for held exclusive.
> 
> That leaves 2 cases:
> 
> - uncached plain open: FUSE_OPEN_ATOMIC with same semantics as
> FUSE_CREATE, inode lock held shared
> 
> - cached positive (plain or O_CREAT): FUSE_OPEN_REVAL getting nodeid
> of parent + name + nodeid of child and return opened file or -ESTALE,
> no locking required

What happens if the latter overlaps with rename?  Or, for a cached
directory inode, with lookup elsewhere picking the same inode as our
cached (and possibly invalid, but we don't know that) dentry?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-14 19:50                                   ` Al Viro
@ 2025-09-14 20:05                                     ` Miklos Szeredi
  2025-09-15  8:54                                       ` Bernd Schubert
  0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2025-09-14 20:05 UTC (permalink / raw)
  To: Al Viro
  Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara, Bernd Schubert

On Sun, 14 Sept 2025 at 21:50, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sun, Sep 14, 2025 at 09:01:40PM +0200, Miklos Szeredi wrote:

> > - cached positive (plain or O_CREAT): FUSE_OPEN_REVAL getting nodeid
> > of parent + name + nodeid of child and return opened file or -ESTALE,
> > no locking required
>
> What happens if the latter overlaps with rename?  Or, for a cached
> directory inode, with lookup elsewhere picking the same inode as our
> cached (and possibly invalid, but we don't know that) dentry?

In the overlapping rename case either result is okay: if the open
succeeded, that means open won the race against rename.  If on the
other hand open_reval lost and didn't find the expected child, then
dentry is going to get invalidated and the whole thing retried ending
up with open_atomic or open_creat with no further races possible.

With the stale dentry getting moved by d_splice_alias() it's a
slightly different story.  The dentry starts out stale, but by being
moved to its proper place it becomes uptodate again.  So before
invalidating it, it would make sense to check whether the parent and
the name still matches the one we are looking up.   But that race
happens with plain ->d_revalidate() as well AFAICS, so that wouldn't
be a new thing.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] a possible way of reducing the PITA of ->d_name audits
  2025-09-14  5:56                         ` Al Viro
@ 2025-09-14 23:07                           ` NeilBrown
  0 siblings, 0 replies; 43+ messages in thread
From: NeilBrown @ 2025-09-14 23:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Trond Myklebust, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara

On Sun, 14 Sep 2025, Al Viro wrote:
> On Sun, Sep 14, 2025 at 02:37:30AM +0100, Al Viro wrote:
> 
> > AFAICS, it can happen if you are there from nfs4_file_open(), hit
> > _nfs4_opendata_to_nfs4_state(opendata), find ->rpc_done to be true
> > in there, hit nfs4_opendata_find_nfs4_state(), have it call
> > nfs4_opendata_get_inode() and run into a server without
> > NFS_CAP_ATOMIC_OPEN_V1.  Then you get ->o_arg.claim set to
> > NFS4_OPEN_CLAIM_NULL and hit this:
> >                 inode = nfs_fhget(data->dir->d_sb, &data->o_res.fh,
> >                                 &data->f_attr);
> > finding not the same inode as your dentry has attached to it.
> > 
> > So the test might end up not being true, at least from my reading of
> > that code.
> > 
> > What I don't understand is the reasons for not failing immediately
> > with EOPENSTALE in that case.
> > 
> > TBH, I would be a lot more comfortable if the "attach inode to dentry"
> > logics in there had been taken several levels up the call chains - analysis
> > would be much easier that way...
>  
> BTW, that's one place where your scheme with locking dentry might cause
> latency problems - two opens on the same cached dentry could be sent
> in parallel, but if you hold it against renames, etc., you might end up
> with those two roundtrips serialized against each other...
> 

That is certainly something we need to explore when the time comes,
though at the moment I'm mostly focuses on trying to land APIs to
centralise the locking so that we can easily see what locking is
actually being done and can change it all in one place.

Currently all calls to ->atomic_open are exclusive on the dentry, either
because O_CREATE means the directory is locked exclusively or because
d_alloc_parallel() locked the dentry with DCACHE_PAR_LOOKUP.  ...except
a cached-negative (which was accepted by d_revalidate) without O_CREATE.
I would rather we didn't call ->atomic_open in that case.  If the
filesystem wants us to (which NFS does) they can fail ->d_revalidate
for LOOKUP_OPEN.  I have a patch to do that for NFS so we can drop the
d_alloc_parallel() from nfs_atomic_open().  I think we need that case to
be exclusive anyway.  You achieved that for NFS by using
d_alloc_parallel().

All other atomic_open functions simply call finish_no_open() or return
-ENOENT when given a non-d_in_lookup() dentry and are not asked to
CREATE.  They could all be simplified if we avoided the atomic_open call
in that case.

The proposal is to add a case when O_CREATE isn't requested and the
dentry is no longer d_in_lookup().  Probably that needs exclusive access
to the dentry anyway?  Either way it wouldn't be a regression.

I wonder if maybe we should transition from offering ->atomic_open() to
offering ->lookup_open() which is called instead of the current
lookup_open() (though with locking moved inside the function).

So for a LAST_NORM open the filesystem can take complete control after
the parent has been found.  The VFS can export a bunch of appropriate
helpers which the filesystem can mix and match.  This seems cleaner than
having a ->lookup() which bypasses LOOKUP_OPEN requests and a
->d_revalidate which does the same.

NeilBrown

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits)
  2025-09-14 20:05                                     ` Miklos Szeredi
@ 2025-09-15  8:54                                       ` Bernd Schubert
  0 siblings, 0 replies; 43+ messages in thread
From: Bernd Schubert @ 2025-09-15  8:54 UTC (permalink / raw)
  To: Miklos Szeredi, Al Viro
  Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel,
	Jan Kara, Horst Birthelmer



On 9/14/25 22:05, Miklos Szeredi wrote:
> On Sun, 14 Sept 2025 at 21:50, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> On Sun, Sep 14, 2025 at 09:01:40PM +0200, Miklos Szeredi wrote:
> 
>>> - cached positive (plain or O_CREAT): FUSE_OPEN_REVAL getting nodeid
>>> of parent + name + nodeid of child and return opened file or -ESTALE,
>>> no locking required
>>
>> What happens if the latter overlaps with rename?  Or, for a cached
>> directory inode, with lookup elsewhere picking the same inode as our
>> cached (and possibly invalid, but we don't know that) dentry?
> 
> In the overlapping rename case either result is okay: if the open
> succeeded, that means open won the race against rename.  If on the
> other hand open_reval lost and didn't find the expected child, then
> dentry is going to get invalidated and the whole thing retried ending
> up with open_atomic or open_creat with no further races possible.
> 
> With the stale dentry getting moved by d_splice_alias() it's a
> slightly different story.  The dentry starts out stale, but by being
> moved to its proper place it becomes uptodate again.  So before
> invalidating it, it would make sense to check whether the parent and
> the name still matches the one we are looking up.   But that race
> happens with plain ->d_revalidate() as well AFAICS, so that wouldn't
> be a new thing.

Thanks Miklos for bringing this up again! I'm going to try to update
the patches this week. One issue with open-atomic and open-revalidate
was testing - Horst (in CC) had started to create fuse specific
xfstests for that.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2025-09-15  8:54 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-07 20:32 [RFC] a possible way of reducing the PITA of ->d_name audits Al Viro
2025-09-07 21:51 ` Linus Torvalds
2025-09-08  0:06   ` Al Viro
2025-09-08  0:47     ` Linus Torvalds
2025-09-08  2:51       ` Al Viro
2025-09-08  3:57         ` Al Viro
2025-09-08  4:50           ` NeilBrown
2025-09-08  5:19             ` Al Viro
2025-09-08  6:25               ` NeilBrown
2025-09-08  9:05                 ` Al Viro
2025-09-10  2:45                   ` NeilBrown
2025-09-10  7:24                     ` Al Viro
2025-09-10 22:52                       ` NeilBrown
2025-09-12  5:49                       ` ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) Al Viro
2025-09-12  8:23                         ` Miklos Szeredi
2025-09-12 18:29                           ` Al Viro
2025-09-12 19:22                             ` Miklos Szeredi
2025-09-12 20:36                               ` Al Viro
2025-09-12 20:50                                 ` Al Viro
2025-09-13  3:36                             ` NeilBrown
2025-09-13  5:07                               ` Al Viro
2025-09-13  5:50                                 ` NeilBrown
2025-09-14 19:01                                 ` Miklos Szeredi
2025-09-14 19:50                                   ` Al Viro
2025-09-14 20:05                                     ` Miklos Szeredi
2025-09-15  8:54                                       ` Bernd Schubert
2025-09-12 18:55                         ` Al Viro
2025-09-12 18:59                           ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro
2025-09-12 18:59                             ` [PATCH 2/9] 9p: simplify v9fs_vfs_atomic_open() Al Viro
2025-09-12 18:59                             ` [PATCH 3/9] 9p: simplify v9fs_vfs_atomic_open_dotl() Al Viro
2025-09-12 18:59                             ` [PATCH 4/9] simplify cifs_atomic_open() Al Viro
2025-09-12 18:59                             ` [PATCH 5/9] simplify vboxsf_dir_atomic_open() Al Viro
2025-09-12 18:59                             ` [PATCH 6/9] simplify nfs_atomic_open_v23() Al Viro
2025-09-12 18:59                             ` [PATCH 7/9] simplify fuse_atomic_open() Al Viro
2025-09-12 18:59                             ` [PATCH 8/9] simplify gfs2_atomic_open() Al Viro
2025-09-12 18:59                             ` [PATCH 9/9] slightly simplify nfs_atomic_open() Al Viro
2025-09-12 22:23                             ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Linus Torvalds
2025-09-13  3:34                             ` NeilBrown
2025-09-13 21:28                   ` [RFC] a possible way of reducing the PITA of ->d_name audits Al Viro
2025-09-14  1:05                     ` NeilBrown
2025-09-14  1:37                       ` Al Viro
2025-09-14  5:56                         ` Al Viro
2025-09-14 23:07                           ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).