* [RFC] a possible way of reducing the PITA of ->d_name audits
@ 2025-09-07 20:32 Al Viro
2025-09-07 21:51 ` Linus Torvalds
0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2025-09-07 20:32 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, Jan Kara, NeilBrown
I would like to discuss a flagday change to calling conventions
of ->create()/->unlink()/->rename()/etc. - all directory methods.
It would have no impact on the code generation, it would involve
quite a bit of localized churn and it would allow to deal with catching
->d_name races on compiler level.
A bit of background:
Unlike file->f_path, dentry->d_name *can* change for a live
dentry. We had quite a few bugs where it had been used unsafely and
new ones almost certainly will crop up.
A part of a problem is that locking rules are fairly
convoluted; they do cover most of the uses in filesystem code these days
(->d_revalidate() was the last major source of headache), but they are
not fun to verify. I've no better solution than a code audit once in a while,
and every time it's really unpleasant.
It starts with verifying that ->d_name contents of a live dentry
can only be changed by __d_move(). The next step is tracing the callers
of that and verifying several predicates regarding the locks held
by callers. Next we need to go over the places that access ->d_name
and check that locking in those does give sufficient exclusion.
That's about 800 locations to look through right now, and examining those
(and the call chains leading to them) is very distinctly Not Fun(tm).
It used to be worse; reducing the number of places that needed
to be examined is something I'd been doing for quite a while (debugfs
bunch was the latest mostly taken out). The problem is, most of the
remaining stuff is genuinely needed - foofs_unlink() *does* need to
know which directory entry to remove, etc.
The typical part of that audit goes like that:
* foo_do_something(whatever, dentry, flags) uses dentry->d_name.{name,len}.
* we need to examine all callchains that might lead to it and prove
that locking environment in each of those is sufficient to stabilize
these values. Note that locking environments may vary between the
callchains, with their intersection being too weak.
* one callchain is
call of ->i_op->symlink() hits foo_symlink()
foo_symlink() calls foo_mknod()
foo_mknod() call foo_create_object()
foo_create_object() calls foo_do_something()
with dentry in the last one coming from the third argument of ->symlink().
All callers of ->symlink() guarantee the sufficient locking environment
for stability of that argument - both its ->d_name and ->d_parent are
not going to change under us (that, BTW, is a separate audit, thankfully
a much smaller one). Therefore we know that this callchain is OK.
Lather, rinse, repeat...
None of that is all that complicated (well, unless you are looking
at something with obscene call chains *cough*ceph*cough*). The trouble
is, there are literally hundreds of places to examine, and that needs to
be multiplied by the number of callchains. Doing that manually is bloody
painful; "AI" (s)tools are worse than useless in that area - verifying the
output is actually _harder_ than doing the whole thing manually.
We need some annotations ("this dentry is guaranteed to be stable"),
some way to verify their correctness and helpers that would give access
to members in question (both ->d_name and ->d_parent), with some way to
check that they are only used for stable ones. IMO that's a good fit for
type system. And AFAICS, the C type system, weak as it is, can be used
for that.
Suppose we introduce something like
struct stable_dentry {
struct dentry *__touch_that_and_suffer;
};
static inline struct stable_dentry claim_stability(struct dentry *dentry)
{
return (struct stable_dentry){dentry};
}
static inline struct dentry *unwrap_dentry(struct stable_dentry v)
{
/* this would better be the only place using that identifier */
return v.__touch_that_and_suffer;
}
static const struct qstr *dentry_name(struct stable_dentry v)
{
return &unwrap_dentry(v)->d_name;
}
static struct dentry *dentry_parent(struct stable_dentry v)
{
return unwrap_dentry(v)->d_parent;
}
Those things get passed as argument and returned in the same way
dentry pointers are, AFAICS on all ABIs we care about. And they can
serve as annotations - to pass such object (by value) instead of passing
a dentry reference == claim that dentry in question is stable and will
remain such for duration of call.
So we can do a series of patches along the lines of
* switch ->symlink() to use of struct stable_dentry
* replace
static int ramfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
struct dentry *dentry, const char *symname)
{
with
static int ramfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
struct stable_dentry child, const char *symname)
{
struct dentry *dentry = unwrap_dentry(child);
and similar for all instances
* replace the caller(s) (there's only one, in this case - in vfs_symlink())
so that
error = dir->i_op->symlink(idmap, dir, dentry, oldname);
becomes
error = dir->i_op->symlink(idmap, dir, claim_stability(dentry), oldname);
possibly with a followup that switches vfs_symlink() to stable_dentry as well
(in which case claim_stability() shifts to the callers).
* Add to D/f/porting.rst:
**mandatory**
->symlink() takes struct stable_dentry now; if you are affected, replace
the third argument with struct stable_dentry and use unwrap_dentry()
to obtain the dentry reference from it. If your ->symlink() instance
happens to be called directly, wrap the argument into claim_stability()
at the call site (and check that you *do* have sufficient locking
environment there - if you didn't, it's a bug right there and you'd
need to fix it first, so that it could be backported without dragging
the calling conventions change along).
That's the flagday part; it would be fairly mechanical, with very easy
way to conform for anything outside of mainline tree.
Once that is done, we can eliminate the direct uses of ->d_name
- dentry_name(child) would do it; might make sense to propagate
stable_dentry down into helpers - depends upon the filesystem.
What we get out of that is a much smaller set of places to audit, all
easily catchable by grep.
1) claim_stability() calls. There are _far_ fewer of those,
and each does need to be verified anyway.
2) places where we use ->d_name in filesystems. Hopefully to
be much reduced. Something that e.g. grabs ->d_lock and looks at ->d_name
is safe, and that ->d_lock would typically be taken within a few lines
before the use.
3) places that simulate claim_stability; catchable by grep -
that can be defeated by preprocessor-level obfuscation, but anything
like _that_ would be a confession of malicious intent. And sparse would
not be hard to teach catching those anyway.
Everything else would be caught by compiler.
IMO getting rid of that headache (several days of unpleasant mechanical
work every time) would be very tempting. OTOH, it is a flagday change
and such calling conventions changes risk the bikeshedding fest from hell...
Comments?
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-07 20:32 [RFC] a possible way of reducing the PITA of ->d_name audits Al Viro @ 2025-09-07 21:51 ` Linus Torvalds 2025-09-08 0:06 ` Al Viro 0 siblings, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2025-09-07 21:51 UTC (permalink / raw) To: Al Viro; +Cc: Christian Brauner, linux-fsdevel, Jan Kara, NeilBrown On Sun, 7 Sept 2025 at 13:32, Al Viro <viro@zeniv.linux.org.uk> wrote: > > I would like to discuss a flagday change to calling conventions > of ->create()/->unlink()/->rename()/etc. - all directory methods. > It would have no impact on the code generation, it would involve > quite a bit of localized churn and it would allow to deal with catching > ->d_name races on compiler level. Can you make this more concrete by actually sending an example patch. Well, two patches: first the patch for the "claim_stability" helper type and functions, and then a separate patch for converting _one_ of the users (eg 'symlink'). Because I have a hard time visualizing just how noisy the result would be (and whether it would be legible end result). And I do wonder if it might not be simpler to have a model where filesystems always get a stable dentry name - either because we hold the parent lock on a VFS level (fairly common, I think), or because we pass a separate copy to the filesystem You did that with the d_revalidate() callback, and I think that was a clear success. Can we extend on *that* pattern, perhaps? Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-07 21:51 ` Linus Torvalds @ 2025-09-08 0:06 ` Al Viro 2025-09-08 0:47 ` Linus Torvalds 0 siblings, 1 reply; 43+ messages in thread From: Al Viro @ 2025-09-08 0:06 UTC (permalink / raw) To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, Jan Kara, NeilBrown On Sun, Sep 07, 2025 at 02:51:18PM -0700, Linus Torvalds wrote: > On Sun, 7 Sept 2025 at 13:32, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > I would like to discuss a flagday change to calling conventions > > of ->create()/->unlink()/->rename()/etc. - all directory methods. > > It would have no impact on the code generation, it would involve > > quite a bit of localized churn and it would allow to deal with catching > > ->d_name races on compiler level. > > Can you make this more concrete by actually sending an example patch. > > Well, two patches: first the patch for the "claim_stability" helper > type and functions, and then a separate patch for converting _one_ of > the users (eg 'symlink'). > > Because I have a hard time visualizing just how noisy the result would > be (and whether it would be legible end result). Will do... BTW, for a non-obvious example of the things that show up in process: ceph_wait_on_conflict_unlink() is actually guaranteed that its argument is stable. The way it copies ->d_name locally (with a good reason - there's a loop using that for comparisons, so it's a valid optimization) would otherwise be seriously racy. > And I do wonder if it might not be simpler to have a model where > filesystems always get a stable dentry name - either because we hold > the parent lock on a VFS level (fairly common, I think), or because we > pass a separate copy to the filesystem > > You did that with the d_revalidate() callback, and I think that was a > clear success. Can we extend on *that* pattern, perhaps? Umm... For one thing, there's something wrong with passing two arguments that always differ by constant offset (and type, of course)... More serious reason is that in quite a few cases we want both dentry and stable name accessible for a helper called from those (ceph_wait_on_conflict_unlink(), for example). And for those we either have to pass both dentry and const struct qstr * or we still have to trace their call chains every time we do that audit. Passing struct stable_dentry instead avoids that... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-08 0:06 ` Al Viro @ 2025-09-08 0:47 ` Linus Torvalds 2025-09-08 2:51 ` Al Viro 0 siblings, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2025-09-08 0:47 UTC (permalink / raw) To: Al Viro; +Cc: Christian Brauner, linux-fsdevel, Jan Kara, NeilBrown On Sun, 7 Sept 2025 at 17:06, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > You did that with the d_revalidate() callback, and I think that was a > > clear success. Can we extend on *that* pattern, perhaps? > > Umm... For one thing, there's something wrong with passing two arguments > that always differ by constant offset (and type, of course)... I would expect that you *only* do this for the functions where the name isn't stable (ie it's called without the parent locked). So rmdir/mknod/link/etc would continue to just pass the dentry, because the name is stable and filesystems using dentry->d_name is perfectly fine. Only the cases where we pass a dentry otherwise, would we do that separate 'struct qname' - and so then the name would *not* be 'dentry->d_name', but a copy. Exactly like d_revalidate. Of course, d_revalidate() already *had* that separate copy (created for the lookup), and when that isn't true you would need to waste time and effort in creating it. Ie we'd end up using take_dentry_name_snapshot(). Would that be horribly bad? Don't a lot of routines already have the parent locked - all the normal functions where we actually _modify_ the directory - so the name is stable. No? Then, the other thing we could do is just mark "struct qstr d_name" as 'const' in struct dentry, and then the (very very few) places that actually modify the name will have to use an alias to do the modification. Wouldn't that deal with at least a majority of the worries of people playing games? This is where you go "Oh, Linus, you sweet summer child", and shake your head. You've been walking through the call-chains, I haven't. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-08 0:47 ` Linus Torvalds @ 2025-09-08 2:51 ` Al Viro 2025-09-08 3:57 ` Al Viro 0 siblings, 1 reply; 43+ messages in thread From: Al Viro @ 2025-09-08 2:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, Jan Kara, NeilBrown On Sun, Sep 07, 2025 at 05:47:31PM -0700, Linus Torvalds wrote: > On Sun, 7 Sept 2025 at 17:06, Al Viro <viro@zeniv.linux.org.uk> wrote: > I would expect that you *only* do this for the functions where the > name isn't stable (ie it's called without the parent locked). > > So rmdir/mknod/link/etc would continue to just pass the dentry, > because the name is stable and filesystems using dentry->d_name is > perfectly fine. Absolute majority of those ~800 hits *are* in those or in functions called only from them. > Ie we'd end up using take_dentry_name_snapshot(). > > Would that be horribly bad? In some cases we'll have to (trace shite, mostly - racy, but not "the sky is falling" stuff), but that (and any real crap) drowns in the false positives. > Don't a lot of routines already have the parent locked - all the > normal functions where we actually _modify_ the directory - so the > name is stable. No? Yes, it is. > Then, the other thing we could do is just mark "struct qstr d_name" as > 'const' in struct dentry, and then the (very very few) places that > actually modify the name will have to use an alias to do the > modification. Yes, and I have that (that's what I mentioned as the first and easy part of those audits - see #work.qstr, doing exactly that). > Wouldn't that deal with at least a majority of the worries of people > playing games? > > This is where you go "Oh, Linus, you sweet summer child", and shake > your head. You've been walking through the call-chains, I haven't. The problem is that a regression (e.g. somebody suddenly starting to read ->d_name in their ->open() or ->setattr(), or... - we had such cases) is always possible and the way the things are it's fucking hard to spot. Most of the uses *are* done to stable dentries; it's just that we have no way to tell which ones are like that. That's exactly the reason why I'm thinking about such annotations. The only way to catch regressions is to grep, go over the list and manually verify that all of those places still are reachable only from the "safe" methods. Worse, comparing the grep output from the previous cycle (modulo line number shifts, etc.) doesn't help - if you have one in a function 3 levels down from a method and its caller grows an extra call site, you would see nothing in the diff anywhere near the places where ->d_name is used or any of the directory methods. I'm not so worried about anyone malicious - honest fuckups happen and AFAICS all of them so far had been of that sort. I would really like to be able to find such fuckups without ~8--12 hours of digging; needs breaks, too, since it's monotonous enough to cause something similar to highway hypnosis, IME ;-/ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-08 2:51 ` Al Viro @ 2025-09-08 3:57 ` Al Viro 2025-09-08 4:50 ` NeilBrown 0 siblings, 1 reply; 43+ messages in thread From: Al Viro @ 2025-09-08 3:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, Jan Kara, NeilBrown On Mon, Sep 08, 2025 at 03:51:35AM +0100, Al Viro wrote: > Most of the uses *are* done to stable dentries; it's just that we have no > way to tell which ones are like that. Random example to get the taste of that joy: static void xfs_dentry_to_name( struct xfs_name *namep, struct dentry *dentry) { namep->name = dentry->d_name.name; namep->len = dentry->d_name.len; namep->type = XFS_DIR3_FT_UNKNOWN; } OK, fetches from ->d_name. Callers: xfs_cleanup_inode() xfs_generic_create() xfs_vn_mknod() == xfs_dir_inode_operations.mknod == xfs_dir_ci_inode_operations.mknod xfs_vn_create() == xfs_dir_inode_operations.create == xfs_dir_ci_inode_operations.create xfs_vn_mkdir() == xfs_dir_inode_operations.mkdir == xfs_dir_ci_inode_operations.mkdir xfs_vn_tmpfile() # WTF? xfs_vn_symlink() == xfs_dir_inode_operations.symlink == xfs_dir_ci_inode_operations.symlink xfs_vn_lookup() == xfs_dir_inode_operations.lookup xfs_vn_ci_lookup() == xfs_dir_ci_inode_operations.lookup xfs_vn_unlink() == xfs_dir_inode_operations.unlink == xfs_dir_inode_operations.rmdir == xfs_dir_ci_inode_operations.unlink == xfs_dir_ci_inode_operations.rmdir (+ checking that in all cases dentry has come from the method argument) WTF is going on with xfs_vn_tmpfile()? It doesn't *have* any useful name... looking... Aha. vfs_generic_create(_, _, _, _, _, p) only calls xfs_cleanup_inode() in case when p is NULL; xfs_vn_tmpfile() is called only as ->tmpfile(), and the only caller of that is this: file->f_path.mnt = parentpath->mnt; file->f_path.dentry = child; mode = vfs_prepare_mode(idmap, dir, mode, mode, mode); error = dir->i_op->tmpfile(idmap, dir, file, mode); so the method never gets called with NULL as the 3rd argument. Safe... And that's just one example - two grep hits. Right next to them, static int xfs_dentry_mode_to_name( struct xfs_name *namep, struct dentry *dentry, int mode) { namep->name = dentry->d_name.name; namep->len = dentry->d_name.len; namep->type = xfs_mode_to_ftype(mode); if (unlikely(namep->type == XFS_DIR3_FT_UNKNOWN)) return -EFSCORRUPTED; return 0; } Callers: xfs_generic_create() same callers as above, but this time it's *not* conditional: ------------------------------------------------------------------ /* Verify mode is valid also for tmpfile case */ error = xfs_dentry_mode_to_name(&name, dentry, args.mode); if (unlikely(error)) goto out_free_acl; ------------------------------------------------------------------ presumably it's "we may fetch shite for tmpfile, but in that case we won't use that shite". xfs_vn_link() == xfs_dir_inode_operations.link == xfs_dir_ci_inode_operations.link xfs_vn_symlink() seen above, same dentry as above xfs_vn_rename() # for odentry == xfs_dir_inode_operations.rename == xfs_dir_ci_inode_operations.rename xfs_vn_rename() # for ndentry seen above That's not all for xfs, though - there's also error = xfs_inode_init_security(inode, dir, &dentry->d_name); in the same xfs_generic_create() - and also called for tmpfile case, AFAISC. Which is quite likely a bug - ->d_name is stable there, all right, but at that stage it's "/"; what selinux (the only thing that cares about the basename of object being created) would do to that is an interesting question, might depend upon the policy. Non-tmpfile callers are OK, as seen above. Another one: error = xfs_inode_init_security(inode, dir, &dentry->d_name); in xfs_vn_symlink(), safe per above. Another: if (dentry->d_name.len >= MAXNAMELEN) return ERR_PTR(-ENAMETOOLONG); in xfs_vn_lookup() and xfs_vn_ci_lookup(). Safe. ... and finally there's this, in all its foul glory: DECLARE_EVENT_CLASS(xrep_dentry_class, TP_PROTO(struct xfs_mount *mp, const struct dentry *dentry), TP_ARGS(mp, dentry), TP_STRUCT__entry( __field(dev_t, dev) __field(unsigned int, flags) __field(unsigned long, ino) __field(bool, positive) __field(unsigned long, parent_ino) __field(unsigned int, namelen) __dynamic_array(char, name, dentry->d_name.len) ), TP_fast_assign( __entry->dev = mp->m_super->s_dev; __entry->flags = dentry->d_flags; __entry->positive = d_is_positive(dentry); if (dentry->d_parent && d_inode(dentry->d_parent)) __entry->parent_ino = d_inode(dentry->d_parent)->i_ino; else __entry->parent_ino = -1UL; __entry->ino = d_inode(dentry) ? d_inode(dentry)->i_ino : 0; __entry->namelen = dentry->d_name.len; memcpy(__get_str(name), dentry->d_name.name, dentry->d_name.len); ), TP_printk("dev %d:%d flags 0x%x positive? %d parent_ino 0x%lx ino 0x%lx name '%.*s'", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->flags, __entry->positive, __entry->parent_ino, __entry->ino, __entry->namelen, __get_str(name)) ); #define DEFINE_REPAIR_DENTRY_EVENT(name) \ DEFINE_EVENT(xrep_dentry_class, name, \ TP_PROTO(struct xfs_mount *mp, const struct dentry *dentry), \ TP_ARGS(mp, dentry)) DEFINE_REPAIR_DENTRY_EVENT(xrep_adoption_check_child); DEFINE_REPAIR_DENTRY_EVENT(xrep_adoption_invalidate_child); DEFINE_REPAIR_DENTRY_EVENT(xrep_dirtree_delete_child); used by trace_xrep_adoption_check_child(sc->mp, d_child); in xrep_adoption_check_dcache(), two calls of trace_xrep_adoption_invalidate_child(sc->mp, d_child); in xrep_adoption_zap_dcache() and trace_xrep_dirtree_delete_child(dp->i_mount, child_dentry); in xrep_dirtree_purge_dentry() The last 3 are *not* stable - fuck knows if they can happen in parallel with lookups from other threads (those can end up moving dentries on sufficiently buggered filesystem), but IMO these deserve take_dentry_name_snapshot() treatment - if tracepoint is active, that is. The rest all get stable dentries; I would really prefer to have that checked by compiler, with sufficient annotations given to it. In this case - struct stable_dentry arguments for lookup/create/mkdir/mknod/symlink/unlink/rmdir, rename and link as part of calling conventions change + stable_dentry as argument of xfs_generic_create(), xfs_dentry_mode_to_name(), xfs_dentry_to_name() and xfs_cleanup_inode() + claim of stability in xfs_vn_tmpfile(): STATIC int xfs_vn_tmpfile( struct mnt_idmap *idmap, struct inode *dir, struct file *file, umode_t mode) { int err = xfs_generic_create(idmap, dir, claim_stability(file->f_path.dentry), mode, 0, file); // at this point in ->tmpfile() dentry is nameless and negative; // nothing can move it until we get to finish_open_...() return finish_open_simple(file, err); } That way xfs hits will be down to that claim_stability() and the obscenity in trace.h - until the users of the latter get wrapped into something that would take snapshots and pass those instead of messing with ->d_name. Considering the fun quoted above, not having to repeat that digging is something I'd count as a win... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-08 3:57 ` Al Viro @ 2025-09-08 4:50 ` NeilBrown 2025-09-08 5:19 ` Al Viro 0 siblings, 1 reply; 43+ messages in thread From: NeilBrown @ 2025-09-08 4:50 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Mon, 08 Sep 2025, Al Viro wrote: > That way xfs hits will be down to that claim_stability() and the obscenity in > trace.h - until the users of the latter get wrapped into something that would > take snapshots and pass those instead of messing with ->d_name. Considering > the fun quoted above, not having to repeat that digging is something I'd > count as a win... > What would you think of providing an accessor function and insisting everyone use it - and have some sort of lockdep_assert_held() to that function so that developers who test their code will see these problem? Then a simple grep can find any unapproved uses. NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-08 4:50 ` NeilBrown @ 2025-09-08 5:19 ` Al Viro 2025-09-08 6:25 ` NeilBrown 0 siblings, 1 reply; 43+ messages in thread From: Al Viro @ 2025-09-08 5:19 UTC (permalink / raw) To: NeilBrown; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Mon, Sep 08, 2025 at 02:50:10PM +1000, NeilBrown wrote: > On Mon, 08 Sep 2025, Al Viro wrote: > > That way xfs hits will be down to that claim_stability() and the obscenity in > > trace.h - until the users of the latter get wrapped into something that would > > take snapshots and pass those instead of messing with ->d_name. Considering > > the fun quoted above, not having to repeat that digging is something I'd > > count as a win... > > > > What would you think of providing an accessor function and insisting > everyone use it - and have some sort of lockdep_assert_held() to that > function so that developers who test their code will see these problem? > > Then a simple grep can find any unapproved uses. Really? Consider ->link(). Both arguments *are* stable, but the reasons are not just different - they don't even intersect. Old: known to be a regular file, held locked. The former guarantees that it can't be moved by d_splice_alias(), the latter prevents that being done by vfs_rename(), which also locks the objects being moved. New: has been looked up after its parent had been locked. Note that _after_ is important here - you can't just blindly fetch ->d_parent and check if its inode is locked (not to mention anything else, you'd need to check it being non-NULL, do it under rcu_read_lock(), et sodding cetera - ->d_parent stability rules are not that different from ->d_name ones). And this "everyone use it" is not going to fly - you still have places where it's done simply under ->d_lock. Or ->d_lock on known parent - either would suffice. Besides, there's "which primitive do I use here?" - with the annotation approach it's as simple as "if I have it as stable_dentry, just use stable_dentry_name(), otherwise think hard - it may be tricky". I don't believe that lockdep is an answer here - annotations (and these *are* annotations - no code generation changes at all) give better coverage, and bitrot tends to happen in rarely taken failure exits. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-08 5:19 ` Al Viro @ 2025-09-08 6:25 ` NeilBrown 2025-09-08 9:05 ` Al Viro 0 siblings, 1 reply; 43+ messages in thread From: NeilBrown @ 2025-09-08 6:25 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Mon, 08 Sep 2025, Al Viro wrote: > On Mon, Sep 08, 2025 at 02:50:10PM +1000, NeilBrown wrote: > > On Mon, 08 Sep 2025, Al Viro wrote: > > > That way xfs hits will be down to that claim_stability() and the obscenity in > > > trace.h - until the users of the latter get wrapped into something that would > > > take snapshots and pass those instead of messing with ->d_name. Considering > > > the fun quoted above, not having to repeat that digging is something I'd > > > count as a win... > > > > > > > What would you think of providing an accessor function and insisting > > everyone use it - and have some sort of lockdep_assert_held() to that > > function so that developers who test their code will see these problem? > > > > Then a simple grep can find any unapproved uses. > > Really? Consider ->link(). Both arguments *are* stable, but the reasons > are not just different - they don't even intersect. Might the locking rules be too complex? Are they even documented? As you know I want to change directory locking so that a ->d_flags bit locks a dentry in much the same way that locking the parent directory currently does. I had wondered why vfs_link() locks the inode being linked and thought it was only to protect ->i_nlink. If it is needed to protect against rename too, that could usefully be documented - or we could use the same ->d_flags bit to ensure that lock. I guess I'm a bit concerned that your goal here is to transition from "lots of auditing" to "much less auditing" and I would rather it was "no auditing needed" - one day you won't want to (or be able to) audit any more. Fudging some type-state with C may well be useful but I suspect it is at most part of a solution. Simplification, documentation, run-time checks might also be important parts. As the type-state flag-day is a big thing, maybe it shouldn't be first. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-08 6:25 ` NeilBrown @ 2025-09-08 9:05 ` Al Viro 2025-09-10 2:45 ` NeilBrown 2025-09-13 21:28 ` [RFC] a possible way of reducing the PITA of ->d_name audits Al Viro 0 siblings, 2 replies; 43+ messages in thread From: Al Viro @ 2025-09-08 9:05 UTC (permalink / raw) To: NeilBrown; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Mon, Sep 08, 2025 at 04:25:26PM +1000, NeilBrown wrote: > Might the locking rules be too complex? Are they even documented? For ->d_name and ->d_parent? Very poorly. That's the absolute worst part of struct dentry locking rules. _That_ is the place where attempts to write a coherent documentation stall every time, often diverting into yet another side story from hell. ->d_revalidate() calling convention change came from one of those; at least that got dealt with for good now - and took a couple of years in the making, what with getting sidetracked to other stuff. Keep in mind that there are weird filesystems that manage to get away with playing very odd games. Or do not manage, really - apparmorfs is the most recent weird thing. Locking of their own, *interspersed* between the directory locks. With ->mkdir() and ->rmdir() both unlocking and relocking the parent. At least apparmor folks have finally admitted that there's a problem... I don't believe that any of us can get away with imposing changes that would break configfs (another weird horror). Or apparmor. Or autofs, for that matter... Anyway, the current rules are: * only __d_move() ever changes name/parent of live dentry (there's d_mark_tmpfile(), but that's done to a dentry that is invisible to anyone other than caller - it flips name from "/" to "#<inumber>" just before attaching the inode to it; part of ->tmpfile()) * any change happens under rename_lock. * any change happens under ->d_lock on everything involved (parent(s) included). * move from one parent to another happens only under ->s_vfs_rename_mutex. * exclusive lock on directory is enough to prevent moves to or from that directory. * only positive dentries ever get moved/renamed * d_splice_alias() may move an existing directory dentry, but no non-directory is going to be touched by it. * with a couple of exceptions (told you it's messy) d_move() and d_exchange() are called in locking conditions equivalent to vfs_rename() - ->s_vfs_rename_mutex if cross-directory, parent(s) exclusive, etc. Exceptions are vfat_lookup() and exfat_lookup() - both would be in trouble if ->link() was supported there. What it boils down to for filesystems is * any dentry passed to directory-modifying operation has stable ->d_name and ->d_parent. That also applies to ->atomic_open() even without O_CREAT in flags, except for the mess with directories in some cases (without O_CREAT if you are given an in-lookup dentry and it turns out to be a directory, from that point on you have no promise that it won't be reparented by d_splice_alias(); whether it can actually happens depends upon the filesystem, but instances tend to just call finish_no_open() in that case and bugger off). * any dentry passed to ->lookup() has stable ->d_name/->d_parent until it had been passed to d_splice_alias() (which is normally the last time we look at it). * ->d_revalidate() and ->d_compare() get stable name passed as argument; they should not look at dentry->d_name at all. > As you know I want to change directory locking so that a ->d_flags bit > locks a dentry in much the same way that locking the parent directory > currently does. > > I had wondered why vfs_link() locks the inode being linked and thought it was > only to protect ->i_nlink. If it is needed to protect against rename > too, that could usefully be documented - or we could use the same > ->d_flags bit to ensure that lock. We could. FWIW, I *like* the notion of "dentry that is held in place and won't go away/get renamed/dropped until we say so". That's the major reason why I'm interested in your patchset, actually. > I guess I'm a bit concerned that your goal here is to transition from > "lots of auditing" to "much less auditing" and I would rather it was "no > auditing needed" - one day you won't want to (or be able to) audit any > more. > > Fudging some type-state with C may well be useful but I suspect it is at > most part of a solution. Simplification, documentation, run-time checks > might also be important parts. As the type-state flag-day is a big > thing, maybe it shouldn't be first. All of that requires being able to answer questions about what's there in the existing filesystems. Which is pretty much the same problem as those audits, obviously. And static annotations are way easier to reason about. Anyway, I'm about to fall asleep (it's nearly 5am here); I've put an initial sketch of infrastructure and of symlink calling conventions change to #experimenta.stable_dentry, but it's very raw at the moment - and lacks the followups that would make use of that stuff. I'll continue tomorrow (right now I've got ->create() half-done as well), but by now one thing I'm certain about is that it will be trivial to reorder wrt your stuff - in either direction. So that worry can be discarded - change might or might not be a good idea, but as flagdays go it's trivial. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-08 9:05 ` Al Viro @ 2025-09-10 2:45 ` NeilBrown 2025-09-10 7:24 ` Al Viro 2025-09-13 21:28 ` [RFC] a possible way of reducing the PITA of ->d_name audits Al Viro 1 sibling, 1 reply; 43+ messages in thread From: NeilBrown @ 2025-09-10 2:45 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Mon, 08 Sep 2025, Al Viro wrote: > On Mon, Sep 08, 2025 at 04:25:26PM +1000, NeilBrown wrote: > > > Might the locking rules be too complex? Are they even documented? > > For ->d_name and ->d_parent? Very poorly. That's the absolute worst > part of struct dentry locking rules. _That_ is the place where attempts > to write a coherent documentation stall every time, often diverting > into yet another side story from hell. ->d_revalidate() calling convention > change came from one of those; at least that got dealt with for good > now - and took a couple of years in the making, what with getting sidetracked > to other stuff. > > Keep in mind that there are weird filesystems that manage to get away with > playing very odd games. Or do not manage, really - apparmorfs is the > most recent weird thing. Locking of their own, *interspersed* between > the directory locks. With ->mkdir() and ->rmdir() both unlocking and > relocking the parent. At least apparmor folks have finally admitted that > there's a problem... > > I don't believe that any of us can get away with imposing changes that > would break configfs (another weird horror). Or apparmor. Or autofs, > for that matter... > > Anyway, the current rules are: > * only __d_move() ever changes name/parent of live dentry (there's > d_mark_tmpfile(), but that's done to a dentry that is invisible to anyone > other than caller - it flips name from "/" to "#<inumber>" just before > attaching the inode to it; part of ->tmpfile()) > * any change happens under rename_lock. > * any change happens under ->d_lock on everything involved (parent(s) > included). > * move from one parent to another happens only under ->s_vfs_rename_mutex. > * exclusive lock on directory is enough to prevent moves to or from > that directory. > * only positive dentries ever get moved/renamed > * d_splice_alias() may move an existing directory dentry, but no > non-directory is going to be touched by it. > * with a couple of exceptions (told you it's messy) d_move() and d_exchange() > are called in locking conditions equivalent to vfs_rename() - ->s_vfs_rename_mutex > if cross-directory, parent(s) exclusive, etc. Exceptions are vfat_lookup() > and exfat_lookup() - both would be in trouble if ->link() was supported there. > > What it boils down to for filesystems is > * any dentry passed to directory-modifying operation has stable ->d_name > and ->d_parent. That also applies to ->atomic_open() even without O_CREAT in > flags, except for the mess with directories in some cases (without O_CREAT > if you are given an in-lookup dentry and it turns out to be a directory, > from that point on you have no promise that it won't be reparented by d_splice_alias(); > whether it can actually happens depends upon the filesystem, but instances tend to > just call finish_no_open() in that case and bugger off). > * any dentry passed to ->lookup() has stable ->d_name/->d_parent until it > had been passed to d_splice_alias() (which is normally the last time we look at > it). > * ->d_revalidate() and ->d_compare() get stable name passed as argument; > they should not look at dentry->d_name at all. So if we were to provide kdoc documentation for claim_stability() it could say something like: It is only safe to claim_stability() for a dentry if: - ->d_parent->d_inode is locked exclusively This is guaranteed for target dentries passed to directory-modifying inode_operations, but not e.g. the old_dentry passed to ->link. It is also guaranteed for dentry passed to ->atomic_open when create it set. - ->d_inode is locked exclusively. This is guaranteed for the old_dentry passed to ->link. - DCACHE_PAR_LOOKUP is set. Dentries passed to ->lookup will either have this flag or the parent will be locked exclusively - dentry is negative and a shared lock is held on parent inode. This is guaranteed for dentries passed to ->atomic_open when create is NOT set. - the ->d_lock is held on the dentry or its ->d_parent All uses of claim_stability() should be preceded by a comment identifying which condition is claimed to be met. > > > As you know I want to change directory locking so that a ->d_flags bit > > locks a dentry in much the same way that locking the parent directory > > currently does. > > > > I had wondered why vfs_link() locks the inode being linked and thought it was > > only to protect ->i_nlink. If it is needed to protect against rename > > too, that could usefully be documented - or we could use the same > > ->d_flags bit to ensure that lock. > > We could. FWIW, I *like* the notion of "dentry that is held in place and won't > go away/get renamed/dropped until we say so". That's the major reason why I'm > interested in your patchset, actually. > > > I guess I'm a bit concerned that your goal here is to transition from > > "lots of auditing" to "much less auditing" and I would rather it was "no > > auditing needed" - one day you won't want to (or be able to) audit any > > more. > > > > Fudging some type-state with C may well be useful but I suspect it is at > > most part of a solution. Simplification, documentation, run-time checks > > might also be important parts. As the type-state flag-day is a big > > thing, maybe it shouldn't be first. > > All of that requires being able to answer questions about what's there in > the existing filesystems. Which is pretty much the same problem as > those audits, obviously. And static annotations are way easier to > reason about. And they are good places to encourage people to put comments. I'm liking this idea more because it provides a focus for documenting a non-trivial dependency. > > Anyway, I'm about to fall asleep (it's nearly 5am here); I've put an > initial sketch of infrastructure and of symlink calling conventions change > to #experimenta.stable_dentry, but it's very raw at the moment - and lacks > the followups that would make use of that stuff. I'll continue tomorrow > (right now I've got ->create() half-done as well), but by now one thing > I'm certain about is that it will be trivial to reorder wrt your stuff - > in either direction. So that worry can be discarded - change might or > might not be a good idea, but as flagdays go it's trivial. > One thing I don't like is the name "unwrap_dentry()". It says what is done rather than what it means or what the purpose is. Maybe "access_dentry()" (a bit like rcu_access_pointer()). Maybe "dentry_of()" - then we would want to call stable dentries "stable_foo" or similar. So: static int afs_symlink(struct mnt_idmap *idmap, struct inode *dir, struct stable_dentry stable_child, const char *content) { struct dentry *dentry = dentry_of(stable_child); Passing around a struct that contains a single pointer feels strange and anything we can do to give it a more natural feel and make it easy to read would likely be a win. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-10 2:45 ` NeilBrown @ 2025-09-10 7:24 ` Al Viro 2025-09-10 22:52 ` NeilBrown 2025-09-12 5:49 ` ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) Al Viro 0 siblings, 2 replies; 43+ messages in thread From: Al Viro @ 2025-09-10 7:24 UTC (permalink / raw) To: NeilBrown; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Wed, Sep 10, 2025 at 12:45:41PM +1000, NeilBrown wrote: > - dentry is negative and a shared lock is held on parent inode. > This is guaranteed for dentries passed to ->atomic_open when create > is NOT set. Umm... The same goes for tmpfile dentry while it's still negative (nobody else could make it positive - it's only reachable via the parent's list of children and those who traverse such will ignore anything negative unhashed they find there. > One thing I don't like is the name "unwrap_dentry()". It says what is > done rather than what it means or what the purpose is. > Maybe "access_dentry()" (a bit like rcu_access_pointer()). > Maybe "dentry_of()" - then we would want to call stable dentries > "stable_foo" or similar. So: > > static int afs_symlink(struct mnt_idmap *idmap, struct inode *dir, > struct stable_dentry stable_child, const char *content) That has too much of over-the-top hungarian notation feel for my taste, TBH... Note that these unwrap_dentry() are very likely to move into helpers - if some function is always called with unwrapped_dentry(something) as an argument, great, that's probably a candidate for struct stable_dentry. I'll hold onto the current variant for now... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-10 7:24 ` Al Viro @ 2025-09-10 22:52 ` NeilBrown 2025-09-12 5:49 ` ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) Al Viro 1 sibling, 0 replies; 43+ messages in thread From: NeilBrown @ 2025-09-10 22:52 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Wed, 10 Sep 2025, Al Viro wrote: > On Wed, Sep 10, 2025 at 12:45:41PM +1000, NeilBrown wrote: > > > - dentry is negative and a shared lock is held on parent inode. > > This is guaranteed for dentries passed to ->atomic_open when create > > is NOT set. > > Umm... The same goes for tmpfile dentry while it's still negative (nobody > else could make it positive - it's only reachable via the parent's list > of children and those who traverse such will ignore anything negative unhashed > they find there. > > > One thing I don't like is the name "unwrap_dentry()". It says what is > > done rather than what it means or what the purpose is. > > Maybe "access_dentry()" (a bit like rcu_access_pointer()). > > Maybe "dentry_of()" - then we would want to call stable dentries > > "stable_foo" or similar. So: > > > > static int afs_symlink(struct mnt_idmap *idmap, struct inode *dir, > > struct stable_dentry stable_child, const char *content) > > That has too much of over-the-top hungarian notation feel for my taste, TBH... > > Note that these unwrap_dentry() are very likely to move into helpers - if some > function is always called with unwrapped_dentry(something) as an argument, > great, that's probably a candidate for struct stable_dentry. > > I'll hold onto the current variant for now... > Another idea - maybe we don't need "unwrap_dentry()" at all. Just have struct stable_dentry { struct dentry *dentry; }; and code can use child.dentry. There is no concern about people misusing unwrap_dentry() so there is no need to be able to grep for it. So we don't need it. It is only uses of claim_stability() and of ->d_name and ->d_parent that we might want to monitor. So I imagine that one day all accesses to ->d_name and ->d_parent will be centralised either dentry_name() and dentry_parent() or in fs/*.c code - and then we can rename them to ->_private_d_name and ->_private_d_parent or whatever to discourage unwrapped access. But accessing the dentry in a stable_dentry is safe and uninteresting so there is no need to wrap it. NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-10 7:24 ` Al Viro 2025-09-10 22:52 ` NeilBrown @ 2025-09-12 5:49 ` Al Viro 2025-09-12 8:23 ` Miklos Szeredi 2025-09-12 18:55 ` Al Viro 1 sibling, 2 replies; 43+ messages in thread From: Al Viro @ 2025-09-12 5:49 UTC (permalink / raw) To: NeilBrown Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Miklos Szeredi On Wed, Sep 10, 2025 at 08:24:23AM +0100, Al Viro wrote: > Note that these unwrap_dentry() are very likely to move into helpers - if some > function is always called with unwrapped_dentry(something) as an argument, > great, that's probably a candidate for struct stable_dentry. > > I'll hold onto the current variant for now... BTW, fun fallout from that experiment once I've got to ->atomic_open() - things get nicer if we teach finish_no_open() to accept ERR_PTR() for dentry: int finish_no_open(struct file *file, struct dentry *dentry) { if (IS_ERR(dentry)) return PTR_ERR(dentry); file->f_path.dentry = dentry; return 0; } For example, we get int nfs_atomic_open_v23(struct inode *dir, struct dentry *dentry, struct file *file, unsigned int open_flags, umode_t mode) { struct dentry *res; /* Same as look+open from lookup_open(), but with different O_TRUNC * handling. */ int error = 0; if (dentry->d_name.len > NFS_SERVER(dir)->namelen) return -ENAMETOOLONG; if (open_flags & O_CREAT) { file->f_mode |= FMODE_CREATED; error = nfs_do_create(dir, dentry, mode, open_flags); if (error) return error; return finish_open(file, dentry, NULL); } if (d_in_lookup(dentry)) { /* The only flags nfs_lookup considers are * LOOKUP_EXCL and LOOKUP_RENAME_TARGET, and * we want those to be zero so the lookup isn't skipped. */ res = nfs_lookup(dir, dentry, 0); } return finish_no_open(file, res); } and in cifs !O_CREAT case folds into if (!(oflags & O_CREAT)) { /* * Check for hashed negative dentry. We have already revalidated * the dentry and it is fine. No need to perform another lookup. */ if (!d_in_lookup(direntry)) return -ENOENT; return finish_no_open(cifs_lookup(inode, direntry, 0)); } In vboxsf, and similar in 9p and fuse: if (d_in_lookup(dentry)) { struct dentry *res = vboxsf_dir_lookup(parent, dentry, 0); if (res || d_really_is_positive(dentry)) return finish_no_open(file, res); } /* Only creates */ if (!(flags & O_CREAT)) return finish_no_open(file, NULL); ... The thing is, in that form it's really clear that dentry stays stable if we proceed past those chunks; basically, d_splice_alias() only returns non-NULL if it's got an error or a preexisting directory alias, in which case result will be positive. And it's more compact and easier to follow in that form... While we are at it, Miklos mentioned some plans for changing ->atomic_open() calling conventions. Might be a good time to revisit that... Miklos, could you give a braindump on those plans? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-12 5:49 ` ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) Al Viro @ 2025-09-12 8:23 ` Miklos Szeredi 2025-09-12 18:29 ` Al Viro 2025-09-12 18:55 ` Al Viro 1 sibling, 1 reply; 43+ messages in thread From: Miklos Szeredi @ 2025-09-12 8:23 UTC (permalink / raw) To: Al Viro Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Bernd Schubert On Fri, 12 Sept 2025 at 07:49, Al Viro <viro@zeniv.linux.org.uk> wrote: > While we are at it, Miklos mentioned some plans for changing ->atomic_open() > calling conventions. Might be a good time to revisit that... Miklos, > could you give a braindump on those plans? [Cc: Bernd] What we want is ->atomic_open() being able to do an atomic revalidate + open (cached positive) case. This is the only case currently that can't be done with a single ATOMIC_OPEN request but needs two roundtrips to the server. The ->atomic_open() interface shouldn't need any changes, since it's already allowed to use a different dentry from the supplied one. Based on (flags & LOOKUP_OPEN) ->revalidate() needs to tell the caller that it's expecting the subsequent ->atomic_open() call to do the actual revalidation. The proposed interface for that was to add a D_REVALIDATE_ATOMIC = 2 constant to use as a return value in this case. Thanks, Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-12 8:23 ` Miklos Szeredi @ 2025-09-12 18:29 ` Al Viro 2025-09-12 19:22 ` Miklos Szeredi 2025-09-13 3:36 ` NeilBrown 0 siblings, 2 replies; 43+ messages in thread From: Al Viro @ 2025-09-12 18:29 UTC (permalink / raw) To: Miklos Szeredi Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Bernd Schubert On Fri, Sep 12, 2025 at 10:23:39AM +0200, Miklos Szeredi wrote: > On Fri, 12 Sept 2025 at 07:49, Al Viro <viro@zeniv.linux.org.uk> wrote: > > While we are at it, Miklos mentioned some plans for changing ->atomic_open() > > calling conventions. Might be a good time to revisit that... Miklos, > > could you give a braindump on those plans? > > [Cc: Bernd] > > What we want is ->atomic_open() being able to do an atomic revalidate > + open (cached positive) case. This is the only case currently that > can't be done with a single ATOMIC_OPEN request but needs two > roundtrips to the server. > > The ->atomic_open() interface shouldn't need any changes, since it's > already allowed to use a different dentry from the supplied one. > > Based on (flags & LOOKUP_OPEN) ->revalidate() needs to tell the caller > that it's expecting the subsequent ->atomic_open() call to do the > actual revalidation. The proposed interface for that was to add a > D_REVALIDATE_ATOMIC = 2 constant to use as a return value in this > case. Umm... Unless I'm misunderstanding you, that *would* change the calling conventions - dentry could bloody well be positive, couldn't it? And that changes quite a bit - without O_CREAT in flags you could get parent locked only shared and pass a positive hashed dentry attached to a directory inode to ->atomic_open(). The thing is, in that case it can be moved by d_splice_alias()... Or am I misreading you? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-12 18:29 ` Al Viro @ 2025-09-12 19:22 ` Miklos Szeredi 2025-09-12 20:36 ` Al Viro 2025-09-13 3:36 ` NeilBrown 1 sibling, 1 reply; 43+ messages in thread From: Miklos Szeredi @ 2025-09-12 19:22 UTC (permalink / raw) To: Al Viro Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Bernd Schubert On Fri, 12 Sept 2025 at 20:29, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Based on (flags & LOOKUP_OPEN) ->revalidate() needs to tell the caller > > that it's expecting the subsequent ->atomic_open() call to do the > > actual revalidation. The proposed interface for that was to add a > > D_REVALIDATE_ATOMIC = 2 constant to use as a return value in this > > case. > > Umm... Unless I'm misunderstanding you, that *would* change the > calling conventions - dentry could bloody well be positive, couldn't it? > And that changes quite a bit - without O_CREAT in flags you could get > parent locked only shared and pass a positive hashed dentry attached > to a directory inode to ->atomic_open(). The thing is, in that case it > can be moved by d_splice_alias()... You are talking about the disconnected directory case? That can't happen in this call path, since it's following a normal component from a parent directory, which by definition isn't disconnected. Just realized, that open_last_lookups() will bypass lookup_open() on cached positive anyway, so really no point in handing that inside lookup_open(). Should this be done via a new i_op->reval_open() called without inode lock, with a positive dentry and gets a name just like d_revalidate? Thanks, Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-12 19:22 ` Miklos Szeredi @ 2025-09-12 20:36 ` Al Viro 2025-09-12 20:50 ` Al Viro 0 siblings, 1 reply; 43+ messages in thread From: Al Viro @ 2025-09-12 20:36 UTC (permalink / raw) To: Miklos Szeredi Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Bernd Schubert On Fri, Sep 12, 2025 at 09:22:05PM +0200, Miklos Szeredi wrote: > You are talking about the disconnected directory case? That can't > happen in this call path, since it's following a normal component from > a parent directory, which by definition isn't disconnected. Huh? No, just a lookup in a different place with server giving you the nodeid of the directory you had in that place in dcache. d_splice_alias() has no choice other than "move the existing alias into the right place" - directory inodes *can't* have multiple hashed dentries, period. > Just realized, that open_last_lookups() will bypass lookup_open() on > cached positive anyway, so really no point in handing that inside > lookup_open(). IIRC, when the last time it came up, it was along the lines of "could we call ->atomic_open() for positive dentries as well?" and I think it had been about FUSE. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-12 20:36 ` Al Viro @ 2025-09-12 20:50 ` Al Viro 0 siblings, 0 replies; 43+ messages in thread From: Al Viro @ 2025-09-12 20:50 UTC (permalink / raw) To: Miklos Szeredi Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Bernd Schubert On Fri, Sep 12, 2025 at 09:36:18PM +0100, Al Viro wrote: > On Fri, Sep 12, 2025 at 09:22:05PM +0200, Miklos Szeredi wrote: > > > You are talking about the disconnected directory case? That can't > > happen in this call path, since it's following a normal component from > > a parent directory, which by definition isn't disconnected. > > Huh? No, just a lookup in a different place with server giving you > the nodeid of the directory you had in that place in dcache. > d_splice_alias() has no choice other than "move the existing alias > into the right place" - directory inodes *can't* have multiple > hashed dentries, period. > > > Just realized, that open_last_lookups() will bypass lookup_open() on > > cached positive anyway, so really no point in handing that inside > > lookup_open(). > > IIRC, when the last time it came up, it was along the lines of "could > we call ->atomic_open() for positive dentries as well?" and I think > it had been about FUSE. The last iteration of those threads I've seen was in October 2023; has anything changed since then? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-12 18:29 ` Al Viro 2025-09-12 19:22 ` Miklos Szeredi @ 2025-09-13 3:36 ` NeilBrown 2025-09-13 5:07 ` Al Viro 1 sibling, 1 reply; 43+ messages in thread From: NeilBrown @ 2025-09-13 3:36 UTC (permalink / raw) To: Al Viro Cc: Miklos Szeredi, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Bernd Schubert On Sat, 13 Sep 2025, Al Viro wrote: > On Fri, Sep 12, 2025 at 10:23:39AM +0200, Miklos Szeredi wrote: > > On Fri, 12 Sept 2025 at 07:49, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > While we are at it, Miklos mentioned some plans for changing ->atomic_open() > > > calling conventions. Might be a good time to revisit that... Miklos, > > > could you give a braindump on those plans? > > > > [Cc: Bernd] > > > > What we want is ->atomic_open() being able to do an atomic revalidate > > + open (cached positive) case. This is the only case currently that > > can't be done with a single ATOMIC_OPEN request but needs two > > roundtrips to the server. > > > > The ->atomic_open() interface shouldn't need any changes, since it's > > already allowed to use a different dentry from the supplied one. > > > > Based on (flags & LOOKUP_OPEN) ->revalidate() needs to tell the caller > > that it's expecting the subsequent ->atomic_open() call to do the > > actual revalidation. The proposed interface for that was to add a > > D_REVALIDATE_ATOMIC = 2 constant to use as a return value in this > > case. > > Umm... Unless I'm misunderstanding you, that *would* change the > calling conventions - dentry could bloody well be positive, couldn't it? > And that changes quite a bit - without O_CREAT in flags you could get > parent locked only shared and pass a positive hashed dentry attached > to a directory inode to ->atomic_open(). The thing is, in that case it > can be moved by d_splice_alias()... Once we get per-dentry locking for dirops this will cease to be a problem. The dentry would be locked exclusively whether we create or not and the lock would prevent the d_splice_alias() rename. NeilBrown > > Or am I misreading you? > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-13 3:36 ` NeilBrown @ 2025-09-13 5:07 ` Al Viro 2025-09-13 5:50 ` NeilBrown 2025-09-14 19:01 ` Miklos Szeredi 0 siblings, 2 replies; 43+ messages in thread From: Al Viro @ 2025-09-13 5:07 UTC (permalink / raw) To: NeilBrown Cc: Miklos Szeredi, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Bernd Schubert On Sat, Sep 13, 2025 at 01:36:49PM +1000, NeilBrown wrote: > > Umm... Unless I'm misunderstanding you, that *would* change the > > calling conventions - dentry could bloody well be positive, couldn't it? > > And that changes quite a bit - without O_CREAT in flags you could get > > parent locked only shared and pass a positive hashed dentry attached > > to a directory inode to ->atomic_open(). The thing is, in that case it > > can be moved by d_splice_alias()... > > Once we get per-dentry locking for dirops this will cease to be a > problem. The dentry would be locked exclusively whether we create or > not and the lock would prevent the d_splice_alias() rename. Umm... Interesting, but... what would happen in such case, really? You have one thread with allegedly directory dentry it had skipped verification for. It tries to combine open with revalidation, and sends such request. In the meanwhile, another thread has found the same thing in a different place - possibly because of fs corruption, possibly because the damn thing got moved on server, right after it has sent "OK, I've opened it" reply to the first thread. What would you do? Have the second thread spin/fail with some error/something else? Alternatively, move succeeds first, the lookup in the new place arrives, then revalidate+open. The first thread gets a nodeid mismatch, and...? How would that combined revalidate+open work for fuse, anyway? The former is basically a lookup - you send nodeid of parent + name, get nodeid + attributes of child. The latter goes strictly by nodeid of child and gets a 64bit number that apparently tells one opened file from another (not to be confused with fhandle). Combined request of some sort? I think we need Bernd to bring a fresh braindump on the plans in that area. And we'd damn better *NOT* make it another "this filesystem is special, the locking works for it because $REASONS (half missing from the discussion) and nobody else returns that special ->d_revalidate() return value (at the moment)" - every time something of that sort had been kludged in we ended up paying for that later. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-13 5:07 ` Al Viro @ 2025-09-13 5:50 ` NeilBrown 2025-09-14 19:01 ` Miklos Szeredi 1 sibling, 0 replies; 43+ messages in thread From: NeilBrown @ 2025-09-13 5:50 UTC (permalink / raw) To: Al Viro Cc: Miklos Szeredi, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Bernd Schubert On Sat, 13 Sep 2025, Al Viro wrote: > On Sat, Sep 13, 2025 at 01:36:49PM +1000, NeilBrown wrote: > > > > Umm... Unless I'm misunderstanding you, that *would* change the > > > calling conventions - dentry could bloody well be positive, couldn't it? > > > And that changes quite a bit - without O_CREAT in flags you could get > > > parent locked only shared and pass a positive hashed dentry attached > > > to a directory inode to ->atomic_open(). The thing is, in that case it > > > can be moved by d_splice_alias()... > > > > Once we get per-dentry locking for dirops this will cease to be a > > problem. The dentry would be locked exclusively whether we create or > > not and the lock would prevent the d_splice_alias() rename. > > Umm... Interesting, but... what would happen in such case, really? > > You have one thread with allegedly directory dentry it had skipped > verification for. It tries to combine open with revalidation, and > sends such request. In the meanwhile, another thread has found the > same thing in a different place - possibly because of fs corruption, > possibly because the damn thing got moved on server, right after it has > sent "OK, I've opened it" reply to the first thread. What would you do? > Have the second thread spin/fail with some error/something else? If there is a race with d_splice_alias() wanting the move the directory while open_atomic has it locked, -ESTALE will be returned and the lookup will be retried with LOOKUP_REVAL which should allow the filesystem to do any needed synchronisation. If the directory has simply been moved on the server, then the revalidate in atomic_open will notice this and invalidate the old dentry and use lookup to create a new one. Importantly it will have a stable d_name so that it can do that. This "invalidate and create a new one with the same name" will require careful locking but should be achievable. NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-13 5:07 ` Al Viro 2025-09-13 5:50 ` NeilBrown @ 2025-09-14 19:01 ` Miklos Szeredi 2025-09-14 19:50 ` Al Viro 1 sibling, 1 reply; 43+ messages in thread From: Miklos Szeredi @ 2025-09-14 19:01 UTC (permalink / raw) To: Al Viro Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Bernd Schubert On Sat, 13 Sept 2025 at 07:07, Al Viro <viro@zeniv.linux.org.uk> wrote: > How would that combined revalidate+open work for fuse, anyway? The former > is basically a lookup - you send nodeid of parent + name, get nodeid + > attributes of child. The latter goes strictly by nodeid of child and > gets a 64bit number that apparently tells one opened file from another > (not to be confused with fhandle). Combined request of some sort? There are already two combined ones: FUSE_CREATE and FUSE_TMPFILE both get nodeid of parent + name and return attributes of child plus opened file. FUSE_CREATE gets invoked in the uncached or cached negative case from ->atomic_open() with inode lock for held exclusive. That leaves 2 cases: - uncached plain open: FUSE_OPEN_ATOMIC with same semantics as FUSE_CREATE, inode lock held shared - cached positive (plain or O_CREAT): FUSE_OPEN_REVAL getting nodeid of parent + name + nodeid of child and return opened file or -ESTALE, no locking required Thanks, Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-14 19:01 ` Miklos Szeredi @ 2025-09-14 19:50 ` Al Viro 2025-09-14 20:05 ` Miklos Szeredi 0 siblings, 1 reply; 43+ messages in thread From: Al Viro @ 2025-09-14 19:50 UTC (permalink / raw) To: Miklos Szeredi Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Bernd Schubert On Sun, Sep 14, 2025 at 09:01:40PM +0200, Miklos Szeredi wrote: > On Sat, 13 Sept 2025 at 07:07, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > How would that combined revalidate+open work for fuse, anyway? The former > > is basically a lookup - you send nodeid of parent + name, get nodeid + > > attributes of child. The latter goes strictly by nodeid of child and > > gets a 64bit number that apparently tells one opened file from another > > (not to be confused with fhandle). Combined request of some sort? > > There are already two combined ones: FUSE_CREATE and FUSE_TMPFILE both > get nodeid of parent + name and return attributes of child plus opened > file. FUSE_CREATE gets invoked in the uncached or cached negative > case from ->atomic_open() with inode lock for held exclusive. > > That leaves 2 cases: > > - uncached plain open: FUSE_OPEN_ATOMIC with same semantics as > FUSE_CREATE, inode lock held shared > > - cached positive (plain or O_CREAT): FUSE_OPEN_REVAL getting nodeid > of parent + name + nodeid of child and return opened file or -ESTALE, > no locking required What happens if the latter overlaps with rename? Or, for a cached directory inode, with lookup elsewhere picking the same inode as our cached (and possibly invalid, but we don't know that) dentry? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-14 19:50 ` Al Viro @ 2025-09-14 20:05 ` Miklos Szeredi 2025-09-15 8:54 ` Bernd Schubert 0 siblings, 1 reply; 43+ messages in thread From: Miklos Szeredi @ 2025-09-14 20:05 UTC (permalink / raw) To: Al Viro Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Bernd Schubert On Sun, 14 Sept 2025 at 21:50, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Sun, Sep 14, 2025 at 09:01:40PM +0200, Miklos Szeredi wrote: > > - cached positive (plain or O_CREAT): FUSE_OPEN_REVAL getting nodeid > > of parent + name + nodeid of child and return opened file or -ESTALE, > > no locking required > > What happens if the latter overlaps with rename? Or, for a cached > directory inode, with lookup elsewhere picking the same inode as our > cached (and possibly invalid, but we don't know that) dentry? In the overlapping rename case either result is okay: if the open succeeded, that means open won the race against rename. If on the other hand open_reval lost and didn't find the expected child, then dentry is going to get invalidated and the whole thing retried ending up with open_atomic or open_creat with no further races possible. With the stale dentry getting moved by d_splice_alias() it's a slightly different story. The dentry starts out stale, but by being moved to its proper place it becomes uptodate again. So before invalidating it, it would make sense to check whether the parent and the name still matches the one we are looking up. But that race happens with plain ->d_revalidate() as well AFAICS, so that wouldn't be a new thing. Thanks, Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-14 20:05 ` Miklos Szeredi @ 2025-09-15 8:54 ` Bernd Schubert 0 siblings, 0 replies; 43+ messages in thread From: Bernd Schubert @ 2025-09-15 8:54 UTC (permalink / raw) To: Miklos Szeredi, Al Viro Cc: NeilBrown, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Horst Birthelmer On 9/14/25 22:05, Miklos Szeredi wrote: > On Sun, 14 Sept 2025 at 21:50, Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> On Sun, Sep 14, 2025 at 09:01:40PM +0200, Miklos Szeredi wrote: > >>> - cached positive (plain or O_CREAT): FUSE_OPEN_REVAL getting nodeid >>> of parent + name + nodeid of child and return opened file or -ESTALE, >>> no locking required >> >> What happens if the latter overlaps with rename? Or, for a cached >> directory inode, with lookup elsewhere picking the same inode as our >> cached (and possibly invalid, but we don't know that) dentry? > > In the overlapping rename case either result is okay: if the open > succeeded, that means open won the race against rename. If on the > other hand open_reval lost and didn't find the expected child, then > dentry is going to get invalidated and the whole thing retried ending > up with open_atomic or open_creat with no further races possible. > > With the stale dentry getting moved by d_splice_alias() it's a > slightly different story. The dentry starts out stale, but by being > moved to its proper place it becomes uptodate again. So before > invalidating it, it would make sense to check whether the parent and > the name still matches the one we are looking up. But that race > happens with plain ->d_revalidate() as well AFAICS, so that wouldn't > be a new thing. Thanks Miklos for bringing this up again! I'm going to try to update the patches this week. One issue with open-atomic and open-revalidate was testing - Horst (in CC) had started to create fuse specific xfstests for that. Thanks, Bernd ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) 2025-09-12 5:49 ` ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) Al Viro 2025-09-12 8:23 ` Miklos Szeredi @ 2025-09-12 18:55 ` Al Viro 2025-09-12 18:59 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro 1 sibling, 1 reply; 43+ messages in thread From: Al Viro @ 2025-09-12 18:55 UTC (permalink / raw) To: NeilBrown Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara, Miklos Szeredi On Fri, Sep 12, 2025 at 06:49:07AM +0100, Al Viro wrote: > On Wed, Sep 10, 2025 at 08:24:23AM +0100, Al Viro wrote: > > > Note that these unwrap_dentry() are very likely to move into helpers - if some > > function is always called with unwrapped_dentry(something) as an argument, > > great, that's probably a candidate for struct stable_dentry. > > > > I'll hold onto the current variant for now... > > BTW, fun fallout from that experiment once I've got to ->atomic_open() - things > get nicer if we teach finish_no_open() to accept ERR_PTR() for dentry: See git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.finish_no_open Patches in followups. Shortlog: allow finish_no_open(file, ERR_PTR(-E...)) 9p: simplify v9fs_vfs_atomic_open() 9p: simplify v9fs_vfs_atomic_open_dotl() simplify cifs_atomic_open() simplify vboxsf_dir_atomic_open() simplify nfs_atomic_open_v23() simplify fuse_atomic_open() simplify gfs2_atomic_open() slightly simplify nfs_atomic_open() Diffstat: fs/9p/vfs_inode.c | 34 ++++++++++++---------------------- fs/9p/vfs_inode_dotl.c | 15 +++++---------- fs/fuse/dir.c | 21 +++++++-------------- fs/gfs2/inode.c | 26 +++++++++----------------- fs/nfs/dir.c | 18 +++++------------- fs/open.c | 10 ++++++---- fs/smb/client/dir.c | 8 +------- fs/vboxsf/dir.c | 25 +++++++++---------------- 8 files changed, 54 insertions(+), 103 deletions(-) ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) 2025-09-12 18:55 ` Al Viro @ 2025-09-12 18:59 ` Al Viro 2025-09-12 18:59 ` [PATCH 2/9] 9p: simplify v9fs_vfs_atomic_open() Al Viro ` (9 more replies) 0 siblings, 10 replies; 43+ messages in thread From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw) To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos ... allowing any ->lookup() return value to be passed to it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/open.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/fs/open.c b/fs/open.c index 9655158c3885..4890b13461c7 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1059,18 +1059,20 @@ EXPORT_SYMBOL(finish_open); * finish_no_open - finish ->atomic_open() without opening the file * * @file: file pointer - * @dentry: dentry or NULL (as returned from ->lookup()) + * @dentry: dentry, ERR_PTR(-E...) or NULL (as returned from ->lookup()) * - * This can be used to set the result of a successful lookup in ->atomic_open(). + * This can be used to set the result of a lookup in ->atomic_open(). * * NB: unlike finish_open() this function does consume the dentry reference and * the caller need not dput() it. * - * Returns "0" which must be the return value of ->atomic_open() after having - * called this function. + * Returns 0 or -E..., which must be the return value of ->atomic_open() after + * having called this function. */ int finish_no_open(struct file *file, struct dentry *dentry) { + if (IS_ERR(dentry)) + return PTR_ERR(dentry); file->f_path.dentry = dentry; return 0; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 2/9] 9p: simplify v9fs_vfs_atomic_open() 2025-09-12 18:59 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro @ 2025-09-12 18:59 ` Al Viro 2025-09-12 18:59 ` [PATCH 3/9] 9p: simplify v9fs_vfs_atomic_open_dotl() Al Viro ` (8 subsequent siblings) 9 siblings, 0 replies; 43+ messages in thread From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw) To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos if v9fs_vfs_lookup() returns a preexisting alias, it is guaranteed to be positive. IOW, in that case we will immediately return finish_no_open(), leaving only the case res == NULL past that point. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/9p/vfs_inode.c | 34 ++++++++++++---------------------- 1 file changed, 12 insertions(+), 22 deletions(-) diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 399d455d50d6..d0c77ec31b1d 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -768,22 +768,18 @@ v9fs_vfs_atomic_open(struct inode *dir, struct dentry *dentry, struct v9fs_inode __maybe_unused *v9inode; struct v9fs_session_info *v9ses; struct p9_fid *fid; - struct dentry *res = NULL; struct inode *inode; int p9_omode; if (d_in_lookup(dentry)) { - res = v9fs_vfs_lookup(dir, dentry, 0); - if (IS_ERR(res)) - return PTR_ERR(res); - - if (res) - dentry = res; + struct dentry *res = v9fs_vfs_lookup(dir, dentry, 0); + if (res || d_really_is_positive(dentry)) + return finish_no_open(file, res); } /* Only creates */ - if (!(flags & O_CREAT) || d_really_is_positive(dentry)) - return finish_no_open(file, res); + if (!(flags & O_CREAT)) + return finish_no_open(file, NULL); v9ses = v9fs_inode2v9ses(dir); perm = unixmode2p9mode(v9ses, mode); @@ -795,17 +791,17 @@ v9fs_vfs_atomic_open(struct inode *dir, struct dentry *dentry, "write-only file with writeback enabled, creating w/ O_RDWR\n"); } fid = v9fs_create(v9ses, dir, dentry, NULL, perm, p9_omode); - if (IS_ERR(fid)) { - err = PTR_ERR(fid); - goto error; - } + if (IS_ERR(fid)) + return PTR_ERR(fid); v9fs_invalidate_inode_attr(dir); inode = d_inode(dentry); v9inode = V9FS_I(inode); err = finish_open(file, dentry, generic_file_open); - if (err) - goto error; + if (unlikely(err)) { + p9_fid_put(fid); + return err; + } file->private_data = fid; #ifdef CONFIG_9P_FSCACHE @@ -818,13 +814,7 @@ v9fs_vfs_atomic_open(struct inode *dir, struct dentry *dentry, v9fs_open_fid_add(inode, &fid); file->f_mode |= FMODE_CREATED; -out: - dput(res); - return err; - -error: - p9_fid_put(fid); - goto out; + return 0; } /** -- 2.47.2 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 3/9] 9p: simplify v9fs_vfs_atomic_open_dotl() 2025-09-12 18:59 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro 2025-09-12 18:59 ` [PATCH 2/9] 9p: simplify v9fs_vfs_atomic_open() Al Viro @ 2025-09-12 18:59 ` Al Viro 2025-09-12 18:59 ` [PATCH 4/9] simplify cifs_atomic_open() Al Viro ` (7 subsequent siblings) 9 siblings, 0 replies; 43+ messages in thread From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw) To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos again, preexisting aliases will always be positive Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/9p/vfs_inode_dotl.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c index 5b5fda617b80..be297e335468 100644 --- a/fs/9p/vfs_inode_dotl.c +++ b/fs/9p/vfs_inode_dotl.c @@ -238,20 +238,16 @@ v9fs_vfs_atomic_open_dotl(struct inode *dir, struct dentry *dentry, struct p9_fid *dfid = NULL, *ofid = NULL; struct v9fs_session_info *v9ses; struct posix_acl *pacl = NULL, *dacl = NULL; - struct dentry *res = NULL; if (d_in_lookup(dentry)) { - res = v9fs_vfs_lookup(dir, dentry, 0); - if (IS_ERR(res)) - return PTR_ERR(res); - - if (res) - dentry = res; + struct dentry *res = v9fs_vfs_lookup(dir, dentry, 0); + if (res || d_really_is_positive(dentry)) + return finish_no_open(file, res); } /* Only creates */ - if (!(flags & O_CREAT) || d_really_is_positive(dentry)) - return finish_no_open(file, res); + if (!(flags & O_CREAT)) + return finish_no_open(file, NULL); v9ses = v9fs_inode2v9ses(dir); @@ -337,7 +333,6 @@ v9fs_vfs_atomic_open_dotl(struct inode *dir, struct dentry *dentry, p9_fid_put(ofid); p9_fid_put(fid); v9fs_put_acl(dacl, pacl); - dput(res); return err; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 4/9] simplify cifs_atomic_open() 2025-09-12 18:59 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro 2025-09-12 18:59 ` [PATCH 2/9] 9p: simplify v9fs_vfs_atomic_open() Al Viro 2025-09-12 18:59 ` [PATCH 3/9] 9p: simplify v9fs_vfs_atomic_open_dotl() Al Viro @ 2025-09-12 18:59 ` Al Viro 2025-09-12 18:59 ` [PATCH 5/9] simplify vboxsf_dir_atomic_open() Al Viro ` (6 subsequent siblings) 9 siblings, 0 replies; 43+ messages in thread From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw) To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos now that finish_no_open() does the right thing if it's given ERR_PTR() as dentry... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/smb/client/dir.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c index 5223edf6d11a..47710aa13822 100644 --- a/fs/smb/client/dir.c +++ b/fs/smb/client/dir.c @@ -484,8 +484,6 @@ cifs_atomic_open(struct inode *inode, struct dentry *direntry, * in network traffic in the other paths. */ if (!(oflags & O_CREAT)) { - struct dentry *res; - /* * Check for hashed negative dentry. We have already revalidated * the dentry and it is fine. No need to perform another lookup. @@ -493,11 +491,7 @@ cifs_atomic_open(struct inode *inode, struct dentry *direntry, if (!d_in_lookup(direntry)) return -ENOENT; - res = cifs_lookup(inode, direntry, 0); - if (IS_ERR(res)) - return PTR_ERR(res); - - return finish_no_open(file, res); + return finish_no_open(file, cifs_lookup(inode, direntry, 0)); } xid = get_xid(); -- 2.47.2 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 5/9] simplify vboxsf_dir_atomic_open() 2025-09-12 18:59 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro ` (2 preceding siblings ...) 2025-09-12 18:59 ` [PATCH 4/9] simplify cifs_atomic_open() Al Viro @ 2025-09-12 18:59 ` Al Viro 2025-09-12 18:59 ` [PATCH 6/9] simplify nfs_atomic_open_v23() Al Viro ` (5 subsequent siblings) 9 siblings, 0 replies; 43+ messages in thread From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw) To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos similar to 9p et.al. --- fs/vboxsf/dir.c | 25 +++++++++---------------- 1 file changed, 9 insertions(+), 16 deletions(-) diff --git a/fs/vboxsf/dir.c b/fs/vboxsf/dir.c index 770e29ec3557..42bedc4ec7af 100644 --- a/fs/vboxsf/dir.c +++ b/fs/vboxsf/dir.c @@ -315,46 +315,39 @@ static int vboxsf_dir_atomic_open(struct inode *parent, struct dentry *dentry, { struct vboxsf_sbi *sbi = VBOXSF_SBI(parent->i_sb); struct vboxsf_handle *sf_handle; - struct dentry *res = NULL; u64 handle; int err; if (d_in_lookup(dentry)) { - res = vboxsf_dir_lookup(parent, dentry, 0); - if (IS_ERR(res)) - return PTR_ERR(res); - - if (res) - dentry = res; + struct dentry *res = vboxsf_dir_lookup(parent, dentry, 0); + if (res || d_really_is_positive(dentry)) + return finish_no_open(file, res); } /* Only creates */ - if (!(flags & O_CREAT) || d_really_is_positive(dentry)) - return finish_no_open(file, res); + if (!(flags & O_CREAT)) + return finish_no_open(file, NULL); err = vboxsf_dir_create(parent, dentry, mode, false, flags & O_EXCL, &handle); if (err) - goto out; + return err; sf_handle = vboxsf_create_sf_handle(d_inode(dentry), handle, SHFL_CF_ACCESS_READWRITE); if (IS_ERR(sf_handle)) { vboxsf_close(sbi->root, handle); - err = PTR_ERR(sf_handle); - goto out; + return PTR_ERR(sf_handle); } err = finish_open(file, dentry, generic_file_open); if (err) { /* This also closes the handle passed to vboxsf_create_sf_handle() */ vboxsf_release_sf_handle(d_inode(dentry), sf_handle); - goto out; + return err; } file->private_data = sf_handle; file->f_mode |= FMODE_CREATED; -out: - dput(res); - return err; + return 0; } static int vboxsf_dir_unlink(struct inode *parent, struct dentry *dentry) -- 2.47.2 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 6/9] simplify nfs_atomic_open_v23() 2025-09-12 18:59 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro ` (3 preceding siblings ...) 2025-09-12 18:59 ` [PATCH 5/9] simplify vboxsf_dir_atomic_open() Al Viro @ 2025-09-12 18:59 ` Al Viro 2025-09-12 18:59 ` [PATCH 7/9] simplify fuse_atomic_open() Al Viro ` (4 subsequent siblings) 9 siblings, 0 replies; 43+ messages in thread From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw) To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos 1) finish_no_open() takes ERR_PTR() as dentry now. 2) caller of ->atomic_open() will call d_lookup_done() itself, no need to do it here. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/nfs/dir.c | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index d81217923936..c8dd1d0b8d85 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -2260,7 +2260,7 @@ int nfs_atomic_open_v23(struct inode *dir, struct dentry *dentry, struct file *file, unsigned int open_flags, umode_t mode) { - + struct dentry *res = NULL; /* Same as look+open from lookup_open(), but with different O_TRUNC * handling. */ @@ -2275,21 +2275,15 @@ int nfs_atomic_open_v23(struct inode *dir, struct dentry *dentry, if (error) return error; return finish_open(file, dentry, NULL); - } else if (d_in_lookup(dentry)) { + } + if (d_in_lookup(dentry)) { /* The only flags nfs_lookup considers are * LOOKUP_EXCL and LOOKUP_RENAME_TARGET, and * we want those to be zero so the lookup isn't skipped. */ - struct dentry *res = nfs_lookup(dir, dentry, 0); - - d_lookup_done(dentry); - if (unlikely(res)) { - if (IS_ERR(res)) - return PTR_ERR(res); - return finish_no_open(file, res); - } + res = nfs_lookup(dir, dentry, 0); } - return finish_no_open(file, NULL); + return finish_no_open(file, res); } EXPORT_SYMBOL_GPL(nfs_atomic_open_v23); -- 2.47.2 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 7/9] simplify fuse_atomic_open() 2025-09-12 18:59 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro ` (4 preceding siblings ...) 2025-09-12 18:59 ` [PATCH 6/9] simplify nfs_atomic_open_v23() Al Viro @ 2025-09-12 18:59 ` Al Viro 2025-09-12 18:59 ` [PATCH 8/9] simplify gfs2_atomic_open() Al Viro ` (3 subsequent siblings) 9 siblings, 0 replies; 43+ messages in thread From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw) To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/fuse/dir.c | 21 +++++++-------------- 1 file changed, 7 insertions(+), 14 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 2d817d7cab26..d3076bfddb89 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -739,22 +739,18 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry, int err; struct mnt_idmap *idmap = file_mnt_idmap(file); struct fuse_conn *fc = get_fuse_conn(dir); - struct dentry *res = NULL; if (fuse_is_bad(dir)) return -EIO; if (d_in_lookup(entry)) { - res = fuse_lookup(dir, entry, 0); - if (IS_ERR(res)) - return PTR_ERR(res); - - if (res) - entry = res; + struct dentry *res = fuse_lookup(dir, entry, 0); + if (res || d_really_is_positive(entry)) + return finish_no_open(file, res); } - if (!(flags & O_CREAT) || d_really_is_positive(entry)) - goto no_open; + if (!(flags & O_CREAT)) + return finish_no_open(file, NULL); /* Only creates */ file->f_mode |= FMODE_CREATED; @@ -768,16 +764,13 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry, goto mknod; } else if (err == -EEXIST) fuse_invalidate_entry(entry); -out_dput: - dput(res); return err; mknod: err = fuse_mknod(idmap, dir, entry, mode, 0); if (err) - goto out_dput; -no_open: - return finish_no_open(file, res); + return err; + return finish_no_open(file, NULL); } /* -- 2.47.2 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 8/9] simplify gfs2_atomic_open() 2025-09-12 18:59 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro ` (5 preceding siblings ...) 2025-09-12 18:59 ` [PATCH 7/9] simplify fuse_atomic_open() Al Viro @ 2025-09-12 18:59 ` Al Viro 2025-09-12 18:59 ` [PATCH 9/9] slightly simplify nfs_atomic_open() Al Viro ` (2 subsequent siblings) 9 siblings, 0 replies; 43+ messages in thread From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw) To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos the difference from 9p et.al. is that on gfs2 the lookup side might end up opening the file. That's what the FMODE_OPENED check there is about - and it might actually be seen with finish_open() having failed, if it fails late enough. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/gfs2/inode.c | 26 +++++++++----------------- 1 file changed, 9 insertions(+), 17 deletions(-) diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index 8760e7e20c9d..8a7ed80d9f2d 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -1368,27 +1368,19 @@ static int gfs2_atomic_open(struct inode *dir, struct dentry *dentry, struct file *file, unsigned flags, umode_t mode) { - struct dentry *d; bool excl = !!(flags & O_EXCL); - if (!d_in_lookup(dentry)) - goto skip_lookup; - - d = __gfs2_lookup(dir, dentry, file); - if (IS_ERR(d)) - return PTR_ERR(d); - if (d != NULL) - dentry = d; - if (d_really_is_positive(dentry)) { - if (!(file->f_mode & FMODE_OPENED)) + if (d_in_lookup(dentry)) { + struct dentry *d = __gfs2_lookup(dir, dentry, file); + if (file->f_mode & FMODE_OPENED) { + if (IS_ERR(d)) + return PTR_ERR(d); + dput(d); + return excl && (flags & O_CREAT) ? -EEXIST : 0; + } + if (d || d_really_is_positive(dentry)) return finish_no_open(file, d); - dput(d); - return excl && (flags & O_CREAT) ? -EEXIST : 0; } - - BUG_ON(d != NULL); - -skip_lookup: if (!(flags & O_CREAT)) return -ENOENT; -- 2.47.2 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 9/9] slightly simplify nfs_atomic_open() 2025-09-12 18:59 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro ` (6 preceding siblings ...) 2025-09-12 18:59 ` [PATCH 8/9] simplify gfs2_atomic_open() Al Viro @ 2025-09-12 18:59 ` Al Viro 2025-09-12 22:23 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Linus Torvalds 2025-09-13 3:34 ` NeilBrown 9 siblings, 0 replies; 43+ messages in thread From: Al Viro @ 2025-09-12 18:59 UTC (permalink / raw) To: neil; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/nfs/dir.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index c8dd1d0b8d85..5f7d9be6f022 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -2198,8 +2198,6 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, else dput(dentry); } - if (IS_ERR(res)) - return PTR_ERR(res); return finish_no_open(file, res); } EXPORT_SYMBOL_GPL(nfs_atomic_open); -- 2.47.2 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) 2025-09-12 18:59 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro ` (7 preceding siblings ...) 2025-09-12 18:59 ` [PATCH 9/9] slightly simplify nfs_atomic_open() Al Viro @ 2025-09-12 22:23 ` Linus Torvalds 2025-09-13 3:34 ` NeilBrown 9 siblings, 0 replies; 43+ messages in thread From: Linus Torvalds @ 2025-09-12 22:23 UTC (permalink / raw) To: Al Viro; +Cc: neil, linux-fsdevel, brauner, jack, miklos On Fri, 12 Sept 2025 at 11:59, Al Viro <viro@zeniv.linux.org.uk> wrote: > > ... allowing any ->lookup() return value to be passed to it. Ack, the whole series looks like a nice cleanup. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) 2025-09-12 18:59 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro ` (8 preceding siblings ...) 2025-09-12 22:23 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Linus Torvalds @ 2025-09-13 3:34 ` NeilBrown 9 siblings, 0 replies; 43+ messages in thread From: NeilBrown @ 2025-09-13 3:34 UTC (permalink / raw) To: Al Viro; +Cc: torvalds, linux-fsdevel, brauner, jack, miklos On Sat, 13 Sep 2025, Al Viro wrote: > ... allowing any ->lookup() return value to be passed to it. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/open.c | 10 ++++++---- > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/fs/open.c b/fs/open.c > index 9655158c3885..4890b13461c7 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -1059,18 +1059,20 @@ EXPORT_SYMBOL(finish_open); > * finish_no_open - finish ->atomic_open() without opening the file > * > * @file: file pointer > - * @dentry: dentry or NULL (as returned from ->lookup()) > + * @dentry: dentry, ERR_PTR(-E...) or NULL (as returned from ->lookup()) > * > - * This can be used to set the result of a successful lookup in ->atomic_open(). > + * This can be used to set the result of a lookup in ->atomic_open(). This is my favourite part of the series - removing the word "successful". It makes the API more general. My only other thought is that if (d_in_lookup(dentry)) { struct dentry *res = dir->i_op->lookup(dir, dentry, 0); if (res || d_really_is_positive(dentry)) return finish_no_open(file, res); } is a common pattern. It would be nice if we could factor it out into a help but I cannot think of a clean way to do that. You can add Reviewed-by: NeilBrown <neil@brown.name> if you like. Thanks, NeilBrown > * > * NB: unlike finish_open() this function does consume the dentry reference and > * the caller need not dput() it. > * > - * Returns "0" which must be the return value of ->atomic_open() after having > - * called this function. > + * Returns 0 or -E..., which must be the return value of ->atomic_open() after > + * having called this function. > */ > int finish_no_open(struct file *file, struct dentry *dentry) > { > + if (IS_ERR(dentry)) > + return PTR_ERR(dentry); > file->f_path.dentry = dentry; > return 0; > } > -- > 2.47.2 > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-08 9:05 ` Al Viro 2025-09-10 2:45 ` NeilBrown @ 2025-09-13 21:28 ` Al Viro 2025-09-14 1:05 ` NeilBrown 1 sibling, 1 reply; 43+ messages in thread From: Al Viro @ 2025-09-13 21:28 UTC (permalink / raw) To: NeilBrown; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Mon, Sep 08, 2025 at 10:05:57AM +0100, Al Viro wrote: > > Fudging some type-state with C may well be useful but I suspect it is at > > most part of a solution. Simplification, documentation, run-time checks > > might also be important parts. As the type-state flag-day is a big > > thing, maybe it shouldn't be first. > > All of that requires being able to answer questions about what's there in > the existing filesystems. Which is pretty much the same problem as > those audits, obviously. And static annotations are way easier to > reason about. Speaking of annoyances, d_exact_alias() is gone, and good riddance, but there's another fun issue in the same area - environment for d_splice_alias() call *and* for one of those d_drop()-just-in-case. The call tree is still the same: _nfs4_open_and_get_state() <- _nfs4_do_open() <- nfs4_do_open() <- nfs4_atomic_open() == nfs_rpc_ops:open_context <- nfs_atomic_open() == ->atomic_open <- nfs4_file_open() == ->open <- nfs4_proc_create() == nfs_rpc_ops:create <- nfs_do_create() <- nfs_create() == ->create <- nfs_atomic_open_v23(), with O_CREAT == ->atomic_open # won't reach nfs4 stuff? ->create() and ->atomic_open() have the parent held at least shared; ->open() does not, but the chunk in question is hit only if dentry is negative, which won't happen in case of ->open(). Additional complication comes from the possibility for _nfs4_open_and_get_state() to fail after that d_splice_alias(). In that case we have _nfs4_do_open() return an error; its caller is inside a do-while loop in nfs4_do_open() and I think we can't end up going around the loop after such late failure (the only error possible after that is -EACCES/-NFS4ERR_ACCESS and that's not one of those that can lead to more iterations. However, looking at that late failure, that's the only call of nfs4_opendata_access(), and that function seems to expect the possibility of state->inode being a directory; can that really happen? Because if it can, we have a problem: alias = d_splice_alias(igrab(state->inode), dentry); /* d_splice_alias() can't fail here - it's a non-directory */ if (alias) { dput(ctx->dentry); ctx->dentry = dentry = alias; } very much *can* fail if it's reached with state->inode being a directory - we can get ERR_PTR() out of that d_splice_alias() and that will oops at if (d_inode(dentry) == state->inode) nfs_inode_attach_open_context(ctx); shortly afterwards (incidentally, what is that check about? It can only fail in case of nfs4_file_open(); should we have open(2) succeed in such situation?) Sigh... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-13 21:28 ` [RFC] a possible way of reducing the PITA of ->d_name audits Al Viro @ 2025-09-14 1:05 ` NeilBrown 2025-09-14 1:37 ` Al Viro 0 siblings, 1 reply; 43+ messages in thread From: NeilBrown @ 2025-09-14 1:05 UTC (permalink / raw) To: Al Viro, Trond Myklebust Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Sun, 14 Sep 2025, Al Viro wrote: > On Mon, Sep 08, 2025 at 10:05:57AM +0100, Al Viro wrote: > > > > Fudging some type-state with C may well be useful but I suspect it is at > > > most part of a solution. Simplification, documentation, run-time checks > > > might also be important parts. As the type-state flag-day is a big > > > thing, maybe it shouldn't be first. > > > > All of that requires being able to answer questions about what's there in > > the existing filesystems. Which is pretty much the same problem as > > those audits, obviously. And static annotations are way easier to > > reason about. > > Speaking of annoyances, d_exact_alias() is gone, and good riddance, but there's > another fun issue in the same area - environment for d_splice_alias() call *and* > for one of those d_drop()-just-in-case. > > The call tree is still the same: > _nfs4_open_and_get_state() > <- _nfs4_do_open() > <- nfs4_do_open() > <- nfs4_atomic_open() > == nfs_rpc_ops:open_context > <- nfs_atomic_open() > == ->atomic_open > <- nfs4_file_open() > == ->open > <- nfs4_proc_create() > == nfs_rpc_ops:create > <- nfs_do_create() > <- nfs_create() > == ->create > <- nfs_atomic_open_v23(), with O_CREAT > == ->atomic_open > # won't reach nfs4 stuff? > > ->create() and ->atomic_open() have the parent held at least shared; > ->open() does not, but the chunk in question is hit only if dentry > is negative, which won't happen in case of ->open(). > > Additional complication comes from the possibility for _nfs4_open_and_get_state() > to fail after that d_splice_alias(). In that case we have _nfs4_do_open() > return an error; its caller is inside a do-while loop in nfs4_do_open() and > I think we can't end up going around the loop after such late failure (the > only error possible after that is -EACCES/-NFS4ERR_ACCESS and that's not one > of those that can lead to more iterations. > > However, looking at that late failure, that's the only call of > nfs4_opendata_access(), and that function seems to expect the possibility > of state->inode being a directory; can that really happen? No that cannot happen. In the NFSv4 protocol, OPEN is only used for files. Directories use the same model as v3 where you pass a filehandle to READDIR without establishing and "open" state. Why do you think nfs4_opendata_access() expects the possibilty of a directory? The purpose of the function is to handle to Posix weirdness around being able to exec() a file that you cannot read(). Over NFS the client needs to be able to read a file in order to execute it, so NFS blurs the two permissions together. This function is check which permission was asked for and then checking access permissions to make sure it is allowed. i.e. the server might let the client read the file storing the application, but the client might not let the application be run. > > Because if it can, we have a problem: > alias = d_splice_alias(igrab(state->inode), dentry); > /* d_splice_alias() can't fail here - it's a non-directory */ > if (alias) { > dput(ctx->dentry); > ctx->dentry = dentry = alias; > } > very much *can* fail if it's reached with state->inode being a directory - > we can get ERR_PTR() out of that d_splice_alias() and that will oops at > > if (d_inode(dentry) == state->inode) > nfs_inode_attach_open_context(ctx); > shortly afterwards (incidentally, what is that check about? It can only > fail in case of nfs4_file_open(); should we have open(2) succeed in such > situation?) I don't know what is going on here. Based on the commit that introduced this code Commit: c45ffdd26961 ("NFSv4: Close another NFSv4 recovery race") there is presumably some race that can cause the test to fail. Maybe Trond (cc:ed) could help explain. NeilBrown > > Sigh... > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-14 1:05 ` NeilBrown @ 2025-09-14 1:37 ` Al Viro 2025-09-14 5:56 ` Al Viro 0 siblings, 1 reply; 43+ messages in thread From: Al Viro @ 2025-09-14 1:37 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Sun, Sep 14, 2025 at 11:05:08AM +1000, NeilBrown wrote: > READDIR without establishing and "open" state. > > Why do you think nfs4_opendata_access() expects the possibilty of a > directory? static int nfs4_opendata_access(const struct cred *cred, struct nfs4_opendata *opendata, struct nfs4_state *state, fmode_t fmode) { struct nfs_access_entry cache; u32 mask, flags; /* access call failed or for some reason the server doesn't * support any access modes -- defer access call until later */ if (opendata->o_res.access_supported == 0) return 0; mask = 0; if (fmode & FMODE_EXEC) { /* ONLY check for exec rights */ if (S_ISDIR(state->inode->i_mode)) ^^^^^^^ How else would you describe this? mask = NFS4_ACCESS_LOOKUP; else mask = NFS4_ACCESS_EXECUTE; > > if (d_inode(dentry) == state->inode) > > nfs_inode_attach_open_context(ctx); > > shortly afterwards (incidentally, what is that check about? It can only > > fail in case of nfs4_file_open(); should we have open(2) succeed in such > > situation?) > > I don't know what is going on here. > Based on the commit that introduced this code > > Commit: c45ffdd26961 ("NFSv4: Close another NFSv4 recovery race") > > there is presumably some race that can cause the test to fail. > Maybe Trond (cc:ed) could help explain. AFAICS, it can happen if you are there from nfs4_file_open(), hit _nfs4_opendata_to_nfs4_state(opendata), find ->rpc_done to be true in there, hit nfs4_opendata_find_nfs4_state(), have it call nfs4_opendata_get_inode() and run into a server without NFS_CAP_ATOMIC_OPEN_V1. Then you get ->o_arg.claim set to NFS4_OPEN_CLAIM_NULL and hit this: inode = nfs_fhget(data->dir->d_sb, &data->o_res.fh, &data->f_attr); finding not the same inode as your dentry has attached to it. So the test might end up not being true, at least from my reading of that code. What I don't understand is the reasons for not failing immediately with EOPENSTALE in that case. TBH, I would be a lot more comfortable if the "attach inode to dentry" logics in there had been taken several levels up the call chains - analysis would be much easier that way... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-14 1:37 ` Al Viro @ 2025-09-14 5:56 ` Al Viro 2025-09-14 23:07 ` NeilBrown 0 siblings, 1 reply; 43+ messages in thread From: Al Viro @ 2025-09-14 5:56 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Sun, Sep 14, 2025 at 02:37:30AM +0100, Al Viro wrote: > AFAICS, it can happen if you are there from nfs4_file_open(), hit > _nfs4_opendata_to_nfs4_state(opendata), find ->rpc_done to be true > in there, hit nfs4_opendata_find_nfs4_state(), have it call > nfs4_opendata_get_inode() and run into a server without > NFS_CAP_ATOMIC_OPEN_V1. Then you get ->o_arg.claim set to > NFS4_OPEN_CLAIM_NULL and hit this: > inode = nfs_fhget(data->dir->d_sb, &data->o_res.fh, > &data->f_attr); > finding not the same inode as your dentry has attached to it. > > So the test might end up not being true, at least from my reading of > that code. > > What I don't understand is the reasons for not failing immediately > with EOPENSTALE in that case. > > TBH, I would be a lot more comfortable if the "attach inode to dentry" > logics in there had been taken several levels up the call chains - analysis > would be much easier that way... BTW, that's one place where your scheme with locking dentry might cause latency problems - two opens on the same cached dentry could be sent in parallel, but if you hold it against renames, etc., you might end up with those two roundtrips serialized against each other... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC] a possible way of reducing the PITA of ->d_name audits 2025-09-14 5:56 ` Al Viro @ 2025-09-14 23:07 ` NeilBrown 0 siblings, 0 replies; 43+ messages in thread From: NeilBrown @ 2025-09-14 23:07 UTC (permalink / raw) To: Al Viro Cc: Trond Myklebust, Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Sun, 14 Sep 2025, Al Viro wrote: > On Sun, Sep 14, 2025 at 02:37:30AM +0100, Al Viro wrote: > > > AFAICS, it can happen if you are there from nfs4_file_open(), hit > > _nfs4_opendata_to_nfs4_state(opendata), find ->rpc_done to be true > > in there, hit nfs4_opendata_find_nfs4_state(), have it call > > nfs4_opendata_get_inode() and run into a server without > > NFS_CAP_ATOMIC_OPEN_V1. Then you get ->o_arg.claim set to > > NFS4_OPEN_CLAIM_NULL and hit this: > > inode = nfs_fhget(data->dir->d_sb, &data->o_res.fh, > > &data->f_attr); > > finding not the same inode as your dentry has attached to it. > > > > So the test might end up not being true, at least from my reading of > > that code. > > > > What I don't understand is the reasons for not failing immediately > > with EOPENSTALE in that case. > > > > TBH, I would be a lot more comfortable if the "attach inode to dentry" > > logics in there had been taken several levels up the call chains - analysis > > would be much easier that way... > > BTW, that's one place where your scheme with locking dentry might cause > latency problems - two opens on the same cached dentry could be sent > in parallel, but if you hold it against renames, etc., you might end up > with those two roundtrips serialized against each other... > That is certainly something we need to explore when the time comes, though at the moment I'm mostly focuses on trying to land APIs to centralise the locking so that we can easily see what locking is actually being done and can change it all in one place. Currently all calls to ->atomic_open are exclusive on the dentry, either because O_CREATE means the directory is locked exclusively or because d_alloc_parallel() locked the dentry with DCACHE_PAR_LOOKUP. ...except a cached-negative (which was accepted by d_revalidate) without O_CREATE. I would rather we didn't call ->atomic_open in that case. If the filesystem wants us to (which NFS does) they can fail ->d_revalidate for LOOKUP_OPEN. I have a patch to do that for NFS so we can drop the d_alloc_parallel() from nfs_atomic_open(). I think we need that case to be exclusive anyway. You achieved that for NFS by using d_alloc_parallel(). All other atomic_open functions simply call finish_no_open() or return -ENOENT when given a non-d_in_lookup() dentry and are not asked to CREATE. They could all be simplified if we avoided the atomic_open call in that case. The proposal is to add a case when O_CREATE isn't requested and the dentry is no longer d_in_lookup(). Probably that needs exclusive access to the dentry anyway? Either way it wouldn't be a regression. I wonder if maybe we should transition from offering ->atomic_open() to offering ->lookup_open() which is called instead of the current lookup_open() (though with locking moved inside the function). So for a LAST_NORM open the filesystem can take complete control after the parent has been found. The VFS can export a bunch of appropriate helpers which the filesystem can mix and match. This seems cleaner than having a ->lookup() which bypasses LOOKUP_OPEN requests and a ->d_revalidate which does the same. NeilBrown ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2025-09-15 8:54 UTC | newest] Thread overview: 43+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-07 20:32 [RFC] a possible way of reducing the PITA of ->d_name audits Al Viro 2025-09-07 21:51 ` Linus Torvalds 2025-09-08 0:06 ` Al Viro 2025-09-08 0:47 ` Linus Torvalds 2025-09-08 2:51 ` Al Viro 2025-09-08 3:57 ` Al Viro 2025-09-08 4:50 ` NeilBrown 2025-09-08 5:19 ` Al Viro 2025-09-08 6:25 ` NeilBrown 2025-09-08 9:05 ` Al Viro 2025-09-10 2:45 ` NeilBrown 2025-09-10 7:24 ` Al Viro 2025-09-10 22:52 ` NeilBrown 2025-09-12 5:49 ` ->atomic_open() fun (was Re: [RFC] a possible way of reducing the PITA of ->d_name audits) Al Viro 2025-09-12 8:23 ` Miklos Szeredi 2025-09-12 18:29 ` Al Viro 2025-09-12 19:22 ` Miklos Szeredi 2025-09-12 20:36 ` Al Viro 2025-09-12 20:50 ` Al Viro 2025-09-13 3:36 ` NeilBrown 2025-09-13 5:07 ` Al Viro 2025-09-13 5:50 ` NeilBrown 2025-09-14 19:01 ` Miklos Szeredi 2025-09-14 19:50 ` Al Viro 2025-09-14 20:05 ` Miklos Szeredi 2025-09-15 8:54 ` Bernd Schubert 2025-09-12 18:55 ` Al Viro 2025-09-12 18:59 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Al Viro 2025-09-12 18:59 ` [PATCH 2/9] 9p: simplify v9fs_vfs_atomic_open() Al Viro 2025-09-12 18:59 ` [PATCH 3/9] 9p: simplify v9fs_vfs_atomic_open_dotl() Al Viro 2025-09-12 18:59 ` [PATCH 4/9] simplify cifs_atomic_open() Al Viro 2025-09-12 18:59 ` [PATCH 5/9] simplify vboxsf_dir_atomic_open() Al Viro 2025-09-12 18:59 ` [PATCH 6/9] simplify nfs_atomic_open_v23() Al Viro 2025-09-12 18:59 ` [PATCH 7/9] simplify fuse_atomic_open() Al Viro 2025-09-12 18:59 ` [PATCH 8/9] simplify gfs2_atomic_open() Al Viro 2025-09-12 18:59 ` [PATCH 9/9] slightly simplify nfs_atomic_open() Al Viro 2025-09-12 22:23 ` [PATCH 1/9] allow finish_no_open(file, ERR_PTR(-E...)) Linus Torvalds 2025-09-13 3:34 ` NeilBrown 2025-09-13 21:28 ` [RFC] a possible way of reducing the PITA of ->d_name audits Al Viro 2025-09-14 1:05 ` NeilBrown 2025-09-14 1:37 ` Al Viro 2025-09-14 5:56 ` Al Viro 2025-09-14 23:07 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).