Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH 36/53] cephfs: remove d_alloc from CEPH_MDS_OP_LOOKUPNAME handling in ceph_fill_trace()
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

When performing a get_name export_operation, ceph sends a LOOKUPNAME op
to the server.  When it gets a reply it tries to look up the name
locally and if the name exists in the dcache with the wrong inode, it
discards the result and tries again.

If it doesn't find the name in the dcache it will allocate a new dentry
and never make any use of it.  The dentry is never instantiated and is
assigned to ->r_dentry which is then freed by post-op cleanup.

As this is a waste, and as there is a plan to remove d_alloc(), this
code is discarded.

Also try_lookup_noperm() is used in place of full_name_hash() and
d_lookup(), and QSTR_LEN() is used to initialise dname.

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/ceph/inode.c | 29 +++++++----------------------
 1 file changed, 7 insertions(+), 22 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 59f9f6948bb2..0982fbda2a82 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -15,6 +15,7 @@
 #include <linux/sort.h>
 #include <linux/iversion.h>
 #include <linux/fscrypt.h>
+#include <linux/namei.h>
 
 #include "super.h"
 #include "mds_client.h"
@@ -1623,33 +1624,17 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 				ceph_fname_free_buffer(parent_dir, &oname);
 				goto done;
 			}
-			dname.name = oname.name;
-			dname.len = oname.len;
-			dname.hash = full_name_hash(parent, dname.name, dname.len);
+			dname = QSTR_LEN(oname.name, oname.len);
 			tvino.ino = le64_to_cpu(rinfo->targeti.in->ino);
 			tvino.snap = le64_to_cpu(rinfo->targeti.in->snapid);
 retry_lookup:
-			dn = d_lookup(parent, &dname);
+			dn = try_lookup_noperm(&dname, parent);
 			doutc(cl, "d_lookup on parent=%p name=%.*s got %p\n",
 			      parent, dname.len, dname.name, dn);
-
-			if (!dn) {
-				dn = d_alloc(parent, &dname);
-				doutc(cl, "d_alloc %p '%.*s' = %p\n", parent,
-				      dname.len, dname.name, dn);
-				if (!dn) {
-					dput(parent);
-					ceph_fname_free_buffer(parent_dir, &oname);
-					err = -ENOMEM;
-					goto done;
-				}
-				if (is_nokey) {
-					spin_lock(&dn->d_lock);
-					dn->d_flags |= DCACHE_NOKEY_NAME;
-					spin_unlock(&dn->d_lock);
-				}
-				err = 0;
-			} else if (d_really_is_positive(dn) &&
+			if (IS_ERR(dn))
+				/* should be impossible */
+				dn = NULL;
+			if (dn && d_really_is_positive(dn) &&
 				   (ceph_ino(d_inode(dn)) != tvino.ino ||
 				    ceph_snap(d_inode(dn)) != tvino.snap)) {
 				doutc(cl, " dn %p points to wrong inode %p\n",
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 37/53] cephfs: Use d_alloc_noblock() in ceph_readdir_prepopulate()
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

cephfs uses the results of readdir to prime the dcache.  Using d_alloc()
is no longer safe, even with an exclusive lock on the parent, as
d_alloc_parallel() will be allowed to run unlocked.  The safe interface
is d_alloc_noblock().  In the rare case that this blocks because there
is a concurrent lookup for the same name there is little cost in not
completing the allocating in the directory code.

It it still possible to create an inode at this point so we do that even
when there is no dentry.

So change to use d_alloc_noblock() and handle -EWOULDBLOCK.  Also use
QSTR_LEN() to initialise dname, and try_lookup_noperm instead of
full_name_hash() and d_lookup().

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/ceph/inode.c | 37 ++++++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 0982fbda2a82..8557b207d337 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2011,9 +2011,7 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 		struct ceph_mds_reply_dir_entry *rde = rinfo->dir_entries + i;
 		struct ceph_vino tvino;
 
-		dname.name = rde->name;
-		dname.len = rde->name_len;
-		dname.hash = full_name_hash(parent, dname.name, dname.len);
+		dname = QSTR_LEN(rde->name, rde->name_len);
 
 		tvino.ino = le64_to_cpu(rde->inode.in->ino);
 		tvino.snap = le64_to_cpu(rde->inode.in->snapid);
@@ -2029,20 +2027,24 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 		}
 
 retry_lookup:
-		dn = d_lookup(parent, &dname);
+		dn = try_lookup_noperm(&dname, parent);
 		doutc(cl, "d_lookup on parent=%p name=%.*s got %p\n",
 		      parent, dname.len, dname.name, dn);
-
-		if (!dn) {
-			dn = d_alloc(parent, &dname);
-			doutc(cl, "d_alloc %p '%.*s' = %p\n", parent,
+		if (IS_ERR(dn)) {
+			err = PTR_ERR(dn);
+			goto out;
+		} else if (!dn) {
+			dn = d_alloc_noblock(parent, &dname);
+			doutc(cl, "d_alloc_noblock %p '%.*s' = %p\n", parent,
 			      dname.len, dname.name, dn);
-			if (!dn) {
-				doutc(cl, "d_alloc badness\n");
-				err = -ENOMEM;
+			if (dn == ERR_PTR(-EWOULDBLOCK)) {
+				/* Just handle the inode info */
+				dn = NULL;
+			} else if (IS_ERR(dn)) {
+				doutc(cl, "d_alloc_noblock badness\n");
+				err = PTR_ERR(dn);
 				goto out;
-			}
-			if (rde->is_nokey) {
+			} else if (rde->is_nokey) {
 				spin_lock(&dn->d_lock);
 				dn->d_flags |= DCACHE_NOKEY_NAME;
 				spin_unlock(&dn->d_lock);
@@ -2069,7 +2071,7 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 		}
 
 		/* inode */
-		if (d_really_is_positive(dn)) {
+		if (dn && d_really_is_positive(dn)) {
 			in = d_inode(dn);
 		} else {
 			in = ceph_get_inode(parent->d_sb, tvino, NULL);
@@ -2087,21 +2089,22 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 		if (ret < 0) {
 			pr_err_client(cl, "badness on %p %llx.%llx\n", in,
 				      ceph_vinop(in));
-			if (d_really_is_negative(dn)) {
+			if (!dn || d_really_is_negative(dn)) {
 				if (inode_state_read_once(in) & I_NEW) {
 					ihold(in);
 					discard_new_inode(in);
 				}
 				iput(in);
 			}
-			d_drop(dn);
+			if (dn)
+				d_drop(dn);
 			err = ret;
 			goto next_item;
 		}
 		if (inode_state_read_once(in) & I_NEW)
 			unlock_new_inode(in);
 
-		if (d_really_is_negative(dn)) {
+		if (d_in_lookup(dn) || d_really_is_negative(dn)) {
 			if (ceph_security_xattr_deadlock(in)) {
 				doutc(cl, " skip splicing dn %p to inode %p"
 				      " (security xattr deadlock)\n", dn, in);
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Sean Christopherson @ 2026-03-13  0:36 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Fuad Tabba, kvm, linux-doc, linux-kernel, linux-kselftest,
	linux-trace-kernel, x86, aik, andrew.jones, binbin.wu, bp,
	brauner, chao.p.peng, chao.p.peng, chenhuacai, corbet,
	dave.hansen, david, hpa, ira.weiny, jgg, jmattson, jroedel,
	jthoughton, maobibo, mathieu.desnoyers, maz, mhiramat,
	michael.roth, mingo, mlevitsk, oupton, pankaj.gupta, pbonzini,
	prsampat, qperret, ricarkol, rick.p.edgecombe, rientjes, rostedt,
	shivankg, shuah, steven.price, tglx, vannapurve, vbabka, willy,
	wyihan, yan.y.zhao
In-Reply-To: <CAEvNRgFUc+9xCoN9Yo5NThHrvbccWAhPwp9nNM2fvx7QqrcJsg@mail.gmail.com>

On Thu, Mar 12, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Thu, Mar 12, 2026, Fuad Tabba wrote:
> >> Hi Ackerley,
> >>
> >> Before getting into the UAPI semantics, thank you for all the heavy
> >> lifting you've done here. Figuring out how to make it all work across
> >> the different platforms is not easy :)
> >>
> >> <snip>
> >>
> >> > The policy definitions below provide more details:
> >
> > Please drop "CONTENT_POLICY" from the KVM documentation.  From KVM's perspective,
> > these are not "policy", they are purely properties of the underlying memory.
> > Userspace will likely use the attributes to implement policy of some kind, but
> > KVM straight up doesn't care.
> 
> Policy might have been the wrong word. I think this is a property of the
> conversion process/request, not a property of the memory like how
> shared/private is a property of the memory?
> 
> I'll have to find another word to describe this enum of

Or just don't?  I'm 100% serious, because unless we carve out a field _just_ for
these two flags, they're eventually going to get mixed with other stuff.  At that
point, having a precisely named enum container just gets in the way.

> I see you dropped any documentation to do with testing.

Yes.

> I meant to document it (at least something about the unspecified case) so it
> can be relied on in selftests, with the understanding (already specified
> elsewhere in Documentation/virt/kvm/api.rst) that nothing about
> KVM_X86_SW_PROTECTED_VM is to be relied on in production, and can be changed
> anytime. What do you think?

KVM_X86_SW_PROTECTED_VM should self-report like all other VM types, and shouldn't
support anything that isn't documented as possible.  I.e. we shouldn't allow
ZERO on shared=>private "for testing".

What I do think we should do is scribble memory on conversions without ZERO or
PRIVATE, probably guarded by a Kconfig or maybe a module param, to do a best
effort enforcement of the ABI, i.e. to try and prevent userspace from depending
on uarch/vendor specific behavior.

^ permalink raw reply

* Re: [PATCH RFC 00/53] lift lookup out of exclive lock for dir ops
From: NeilBrown @ 2026-03-13  0:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Masami Hiramatsu,
	Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko, Tyler Hicks,
	Andreas Gruenbacher, Richard Weinberger, Anton Ivanov,
	Johannes Berg, Jeremy Kerr, Ard Biesheuvel, linux-fsdevel,
	linux-nfs, linux-xfs, linux-unionfs, coda, linux-mm, linux-afs,
	linux-cifs, linux-ext4, linux-kernel, linux-trace-kernel,
	ceph-devel, ecryptfs, gfs2, linux-um, linux-efi
In-Reply-To: <20260312193847.28c32a2c@gandalf.local.home>

On Fri, 13 Mar 2026, Steven Rostedt wrote:
> On Fri, 13 Mar 2026 08:11:47 +1100
> NeilBrown <neilb@ownmail.net> wrote:
> 
> > *[PATCH 26/53] smb/client: don't unhashed and rehash to prevent new
> > *[PATCH 27/53] smb/client: use d_splice_alias() in atomic_open
> >  [PATCH 28/53] smb/client: Use d_alloc_noblock() in
> > *[PATCH 29/53] exfat: simplify exfat_lookup()
> > *[PATCH 30/53] configfs: remove d_add() calls before
> >  [PATCH 31/53] configfs: stop using d_add().
> > *[PATCH 32/53] ext4: move dcache modifying code out of __ext4_link()
> > *[PATCH 33/53] ext4: use on-stack dentries in
> 
> >  [PATCH 34/53] tracefs: stop using d_add().
> 
> Hmm, another reason I hate being Cc'd on every patch of a patch bomb where
> I only need to look at one (and maybe the first) patch.

I could try to refine my tooling, but you can't please all the people
all the time...  I wonder how many people would be bothered if only the
cover-letter was sent to everyone, and the patches only went to lkml -
to be fetched from lore if not subscribed.

You would probably need to look at 02/53

https://github.com/neilbrown/linux/commit/aebdc6545eb18e5b6a7d41320f30d752996b3c6c

to have the context to understand 34/53

> 
> For some reason, I'm missing several patches, and this is one of them :-p

They don't seem to have made it to lore.kernel.org either.  Maybe I'm
being rate-limited somewhere.

https://github.com/neilbrown/linux/commit/77074c04a94176d6b2b2caf44dd84f0788a420c4

Thanks,
NeilBrown

> 
> -- Steve
> 
> 
> >  [PATCH 35/53] cephfs: stop using d_add().
> > *[PATCH 36/53] cephfs: remove d_alloc from CEPH_MDS_OP_LOOKUPNAME
> >  [PATCH 37/53] cephfs: Use d_alloc_noblock() in
> >  [PATCH 38/53] cephfs: Don't d_drop() before d_splice_alias()
> >  [PATCH 39/53] ecryptfs: stop using d_add().
> >  [PATCH 40/53] gfs2: stop using d_add().
> 


^ permalink raw reply

* Re: [PATCH RFC 00/53] lift lookup out of exclive lock for dir ops
From: NeilBrown @ 2026-03-13  0:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Jeff Layton,
	Trond Myklebust, Anna Schumaker, Carlos Maiolino, Miklos Szeredi,
	Amir Goldstein, Jan Harkes, Hugh Dickins, Baolin Wang,
	David Howells, Marc Dionne, Steve French, Namjae Jeon,
	Sungjong Seo, Yuezhang Mo, Andreas Hindborg, Breno Leitao,
	Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel,
	linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <CAHk-=wh92deXvH5iXCo9mThXCBYt-jRcVu=z4kiH-f3+wZQOHA@mail.gmail.com>

On Fri, 13 Mar 2026, Linus Torvalds wrote:
> On Thu, 12 Mar 2026 at 14:44, NeilBrown <neilb@ownmail.net> wrote:
> >
> > This patch set progresses my effort to improve concurrency of
> > directory operations and specifically to allow concurrent updates
> > in a given directory.
> 
> I only got about half the patches, but the ones I did get didn't raise
> my hackles.
> 
> HOWEVER.
> 
> This is very much a "absolutely requires ACKs from Al" series. Al?

Yes, I'm looking forward to Al's thoughts

> 
> Also, because I only got about half the patches, and there's 53 of
> them total, I'd really like to see a git branch for something like
> this. It makes it easier to review for me, and I suspect it makes it
> easier for some of the test robots too.

github.com/neilbrown/linux.git branch pdirops

But if you have only time for one patch, 52/53 is the one to look at.

Thanks,
NeilBrown

> 
> But again - this needs Al to look at it. Iirc he had some fundamental
> concern with the last version - hopefully now fixed, but ...
> 
>                  Linus
> 
> 


^ permalink raw reply

* Re: [PATCH RFC 00/53] lift lookup out of exclive lock for dir ops
From: Linus Torvalds @ 2026-03-12 23:46 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Jeff Layton,
	Trond Myklebust, Anna Schumaker, Carlos Maiolino, Miklos Szeredi,
	Amir Goldstein, Jan Harkes, Hugh Dickins, Baolin Wang,
	David Howells, Marc Dionne, Steve French, Namjae Jeon,
	Sungjong Seo, Yuezhang Mo, Andreas Hindborg, Breno Leitao,
	Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel,
	linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

On Thu, 12 Mar 2026 at 14:44, NeilBrown <neilb@ownmail.net> wrote:
>
> This patch set progresses my effort to improve concurrency of
> directory operations and specifically to allow concurrent updates
> in a given directory.

I only got about half the patches, but the ones I did get didn't raise
my hackles.

HOWEVER.

This is very much a "absolutely requires ACKs from Al" series. Al?

Also, because I only got about half the patches, and there's 53 of
them total, I'd really like to see a git branch for something like
this. It makes it easier to review for me, and I suspect it makes it
easier for some of the test robots too.

But again - this needs Al to look at it. Iirc he had some fundamental
concern with the last version - hopefully now fixed, but ...

                 Linus

^ permalink raw reply

* Re: [PATCH] tracing: Generate undef symbols allowlist for simple_ring_buffer
From: Nathan Chancellor @ 2026-03-12 23:51 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: maz, rostedt, arnd, linux-trace-kernel, kvmarm, kernel-team
In-Reply-To: <20260312182010.111013-1-vdonnefort@google.com>

Hi Vincent,

On Thu, Mar 12, 2026 at 06:20:10PM +0000, Vincent Donnefort wrote:
> Compiler and tooling-generated symbols are difficult to maintain
> across all supported architectures. Make the allowlist more robust by
> replacing the harcoded list with a mechanism that automatically detects
> these symbols.
> 
> This mechanism generates a C function designed to trigger common
> compiler-inserted symbols.

This certainly seems more robust.

> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> 
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index beb15936829d..3b427b76434a 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -136,17 +136,37 @@ obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o
>  # simple_ring_buffer is used by the pKVM hypervisor which does not have access
>  # to all kernel symbols. Fail the build if forbidden symbols are found.
>  #
> -UNDEFINED_ALLOWLIST := memset alt_cb_patch_nops __x86 __ubsan __asan __kasan __gcov __aeabi_unwind
> -UNDEFINED_ALLOWLIST += __stack_chk_fail stackleak_track_stack __ref_stack __sanitizer llvm_gcda llvm_gcov
> -UNDEFINED_ALLOWLIST += .TOC\. __clear_pages_unrolled __memmove copy_page warn_slowpath_fmt
> -UNDEFINED_ALLOWLIST += ftrace_likely_update __hwasan_load __hwasan_store __hwasan_tag_memory
> -UNDEFINED_ALLOWLIST += warn_bogus_irq_restore __stack_chk_guard
> -UNDEFINED_ALLOWLIST := $(addprefix -e , $(UNDEFINED_ALLOWLIST))
> +# undefsyms_base generates a set of compiler and tooling-generated symbols that can
> +# safely be ignored for simple_ring_buffer.
> +#
> +$(obj)/undefsyms_base.c: FORCE
> +	$(Q)echo '#include <asm/page.h>' > $@
> +	$(Q)echo '#include <asm/local.h>' >> $@
> +	$(Q)echo 'static char page[PAGE_SIZE] __aligned(PAGE_SIZE);' >> $@
> +	$(Q)echo 'void undefsyms_base(int n);' >> $@
> +	$(Q)echo 'void undefsyms_base(int n) {' >> $@
> +	$(Q)echo '	char buffer[256] = { 0 };' >> $@
> +	$(Q)echo '	u32 u = 0;' >> $@
> +	$(Q)echo '	memset((char * volatile)page, 8, PAGE_SIZE);' >> $@
> +	$(Q)echo '	memset((char * volatile)buffer, 8, sizeof(buffer));' >> $@
> +	$(Q)echo '	cmpxchg((u32 * volatile)&u, 0, 8);' >> $@
> +	$(Q)echo '	WARN_ON(n == 0xdeadbeef);' >> $@
> +	$(Q)echo '}' >> $@

This should use filechk, otherwise undefsyms_base.c will be regenerated
every build, resulting in undefsyms_base.o being rebuilt every time.

  $ make -skj"$(nproc)" ARCH=x86_64 mrproper allmodconfig kernel/trace/

  $ make -skj"$(nproc)" ARCH=x86_64 V=2 kernel/trace/
  ...
    CC      kernel/trace/undefsyms_base.o - due to: kernel/trace/undefsyms_base.c
    NM      kernel/trace/simple_ring_buffer.o - due to target missing

filechk_undefsyms_base = {                                              \
	echo '$(pound)include <asm/page.h>';                            \
	echo '$(pound)include <asm/local.h>';                           \
	echo 'static char page[PAGE_SIZE] __aligned(PAGE_SIZE);';       \
	echo 'void undefsyms_base(int n);';                             \
	echo 'void undefsyms_base(int n) {';                            \
	echo '	char buffer[256] = { 0 };';                             \
	echo '	u32 u = 0;';                                            \
	echo '	memset((char * volatile)page, 8, PAGE_SIZE);';          \
	echo '	memset((char * volatile)buffer, 8, sizeof(buffer));';   \
	echo '	cmpxchg((u32 * volatile)&u, 0, 8);';                    \
	echo '	WARN_ON(n == 0xdeadbeef);';                             \
	echo '}';                                                       \
	}

$(obj)/undefsyms_base.c: FORCE
	$(call filechk,undefsyms_base)

  $ make -skj"$(nproc)" ARCH=x86_64 mrproper allmodconfig kernel/trace/

  $ make -skj"$(nproc)" ARCH=x86_64 V=2 kernel/trace/
    GEN     Makefile - due to target is PHONY
    DESCEND objtool
    CALL    scripts/checksyscalls.sh - due to target is PHONY
    INSTALL libsubcmd_headers
    NM      kernel/trace/simple_ring_buffer.o - due to target missing

> +clean-files += undefsyms_base.c
> +targets += undefsyms_base.c

I don't think this targets addition is necessary.

> +$(obj)/undefsyms_base.o: $(obj)/undefsyms_base.c
> +
> +extra-y += undefsyms_base.o

I think this should be

  targets += undefsyms_base.o

as extra-y is deprecated per Documentation/kbuild/makefiles.rst.

> +UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sanitizer __tsan __ubsan __x86_indirect_thunk \
> +		       $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')

With an allmodconfig + ThinLTO build, I still see:

  $ cat allmod.config
  CONFIG_GCOV_KERNEL=n
  CONFIG_KASAN=n
  CONFIG_LTO_CLANG_THIN=y

  $ make -skj"$(nproc)" ARCH=x86_64 KCONFIG_ALLCONFIG=1 LLVM=1 mrproper allmodconfig kernel/trace/
  Unexpected symbols in kernel/trace/simple_ring_buffer.o:
                   U __fortify_panic
                   U __write_overflow_field
                   U simple_ring_buffer_commit
                   U simple_ring_buffer_enable_tracing
                   U simple_ring_buffer_init
                   U simple_ring_buffer_reserve
                   U simple_ring_buffer_reset
                   U simple_ring_buffer_swap_reader_page
                   U simple_ring_buffer_unload

Something like:

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 48c415a0c7e4..0f9a6ce9abd9 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -142,13 +142,15 @@ obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o
 filechk_undefsyms_base = {                                              \
 	echo '$(pound)include <asm/page.h>';                            \
 	echo '$(pound)include <asm/local.h>';                           \
+	echo '$(pound)include <linux/string.h>';                        \
 	echo 'static char page[PAGE_SIZE] __aligned(PAGE_SIZE);';       \
-	echo 'void undefsyms_base(int n);';                             \
-	echo 'void undefsyms_base(int n) {';                            \
+	echo 'void undefsyms_base(int n, void *ptr);';                  \
+	echo 'void undefsyms_base(int n, void *ptr) {';                 \
 	echo '	char buffer[256] = { 0 };';                             \
 	echo '	u32 u = 0;';                                            \
 	echo '	memset((char * volatile)page, 8, PAGE_SIZE);';          \
 	echo '	memset((char * volatile)buffer, 8, sizeof(buffer));';   \
+	echo '	memcpy((void* volatile)ptr, buffer, sizeof(buffer));';  \
 	echo '	cmpxchg((u32 * volatile)&u, 0, 8);';                    \
 	echo '	WARN_ON(n == 0xdeadbeef);';                             \
 	echo '}';                                                       \
--

cures the first two. The simple_ring_buffer symbols are very odd...

  $ llvm-nm kernel/trace/simple_ring_buffer.o | grep simple_ring_buffer
  ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_commit_845
  ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_enable_tracing_849
  ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_init_847
  ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_reserve_841
  ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_reset_846
  ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_swap_reader_page_837
  ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_unload_848
  ---------------- t __export_symbol_simple_ring_buffer_commit
  ---------------- t __export_symbol_simple_ring_buffer_enable_tracing
  ---------------- t __export_symbol_simple_ring_buffer_init
  ---------------- t __export_symbol_simple_ring_buffer_reserve
  ---------------- t __export_symbol_simple_ring_buffer_reset
  ---------------- t __export_symbol_simple_ring_buffer_swap_reader_page
  ---------------- t __export_symbol_simple_ring_buffer_unload
  ---------------- T simple_ring_buffer_commit
                   U simple_ring_buffer_commit
  ---------------- T simple_ring_buffer_enable_tracing
                   U simple_ring_buffer_enable_tracing
                   U simple_ring_buffer_init
  ---------------- T simple_ring_buffer_init
  ---------------- T simple_ring_buffer_init_mm
  ---------------- T simple_ring_buffer_reserve
                   U simple_ring_buffer_reserve
  ---------------- T simple_ring_buffer_reset
                   U simple_ring_buffer_reset
                   U simple_ring_buffer_swap_reader_page
  ---------------- T simple_ring_buffer_swap_reader_page
                   U simple_ring_buffer_unload
  ---------------- T simple_ring_buffer_unload
  ---------------- T simple_ring_buffer_unload_mm

This is LLVM IR bitcode at this stage, which could be messing things up.

  $ file kernel/trace/simple_ring_buffer.o
  kernel/trace/simple_ring_buffer.o: LLVM IR bitcode

Maybe not worth thinking about too much and just adding it to the
allowlist manually?

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 0f9a6ce9abd9..cb1ec50a8386 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -166,6 +166,7 @@ $(obj)/undefsyms_base.o: $(obj)/undefsyms_base.c
 targets += undefsyms_base.o
 
 UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sanitizer __tsan __ubsan __x86_indirect_thunk \
+		       simple_ring_buffer \
 		       $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
 
 quiet_cmd_check_undefined = NM      $<
--

Cheers,
Nathan

^ permalink raw reply related

* Re: [PATCH RFC 00/53] lift lookup out of exclive lock for dir ops
From: Steven Rostedt @ 2026-03-12 23:38 UTC (permalink / raw)
  To: NeilBrown
  Cc: NeilBrown, Linus Torvalds, Alexander Viro, Christian Brauner,
	Jan Kara, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Carlos Maiolino, Miklos Szeredi, Amir Goldstein, Jan Harkes,
	Hugh Dickins, Baolin Wang, David Howells, Marc Dionne,
	Steve French, Namjae Jeon, Sungjong Seo, Yuezhang Mo,
	Andreas Hindborg, Breno Leitao, Theodore Ts'o, Andreas Dilger,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel,
	linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

On Fri, 13 Mar 2026 08:11:47 +1100
NeilBrown <neilb@ownmail.net> wrote:

> *[PATCH 26/53] smb/client: don't unhashed and rehash to prevent new
> *[PATCH 27/53] smb/client: use d_splice_alias() in atomic_open
>  [PATCH 28/53] smb/client: Use d_alloc_noblock() in
> *[PATCH 29/53] exfat: simplify exfat_lookup()
> *[PATCH 30/53] configfs: remove d_add() calls before
>  [PATCH 31/53] configfs: stop using d_add().
> *[PATCH 32/53] ext4: move dcache modifying code out of __ext4_link()
> *[PATCH 33/53] ext4: use on-stack dentries in

>  [PATCH 34/53] tracefs: stop using d_add().

Hmm, another reason I hate being Cc'd on every patch of a patch bomb where
I only need to look at one (and maybe the first) patch.

For some reason, I'm missing several patches, and this is one of them :-p

-- Steve


>  [PATCH 35/53] cephfs: stop using d_add().
> *[PATCH 36/53] cephfs: remove d_alloc from CEPH_MDS_OP_LOOKUPNAME
>  [PATCH 37/53] cephfs: Use d_alloc_noblock() in
>  [PATCH 38/53] cephfs: Don't d_drop() before d_splice_alias()
>  [PATCH 39/53] ecryptfs: stop using d_add().
>  [PATCH 40/53] gfs2: stop using d_add().

^ permalink raw reply

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Ackerley Tng @ 2026-03-12 21:59 UTC (permalink / raw)
  To: Sean Christopherson, Fuad Tabba
  Cc: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86, aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, shivankg, shuah,
	steven.price, tglx, vannapurve, vbabka, willy, wyihan, yan.y.zhao
In-Reply-To: <abLfWHf89TxWqeGZ@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Thu, Mar 12, 2026, Fuad Tabba wrote:
>> Hi Ackerley,
>>
>> Before getting into the UAPI semantics, thank you for all the heavy
>> lifting you've done here. Figuring out how to make it all work across
>> the different platforms is not easy :)
>>
>> <snip>
>>
>> > The policy definitions below provide more details:
>
> Please drop "CONTENT_POLICY" from the KVM documentation.  From KVM's perspective,
> these are not "policy", they are purely properties of the underlying memory.
> Userspace will likely use the attributes to implement policy of some kind, but
> KVM straight up doesn't care.

Policy might have been the wrong word. I think this is a property of the
conversion process/request, not a property of the memory like how
shared/private is a property of the memory?

I'll have to find another word to describe this enum of

* KVM_SET_MEMORY_ATTRIBUTES2_ZERO
* KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE

>
>> > ``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_ZERO`` (default)
>
> The default behavior absolutely cannot be something that's not supported on
> every conversion type.
>
>> >
>> >   On a private to shared conversion, the host will read zeros from the
>> >   converted memory on the next fault after successful return of the
>> >   KVM_SET_MEMORY_ATTRIBUTES2 ioctl.
>> >
>> >   This is not supported (-EOPNOTSUPP) for a shared to private
>> >   conversion. While some CoCo implementations do zero memory contents
>> >   such that the guest reads zeros after conversion, the guest is not
>> >   expected to trust host-provided zeroing, hence as a UAPI policy, KVM
>> >   does not make any such guarantees.
>>
>> The rationale for not supporting this in the UAPI isn't quite right
>> and I think that the prohibition should be removed. It's true that the
>> guest is not expected to trust host-provided zeroing. However, if the
>> VMM invokes this ioctl with the ZERO policy, the zeroing is performed
>> by the hypervisor, not by the (untrusted) host.
>
> What entity zeros the data doesn't matter as far as KVM's ABI is concerned.  That's
> a motivating favor to providing ZERO, e.g. it allow userspace to elide additional
> zeroing when it _knows_ the memory holds zeros, but that's orthogonal to KVM's
> contract with userspace.
>
>> Although pKVM handles fresh, zeroed memory provisioning via donation
>> rather than attribute conversion, stating that the UAPI cannot make
>> guarantees due to trust boundaries is incorrect. The hypervisor is
>
> We should avoid using "hypervisor", because (a) it means different things to
> different people and (b) even when there's consensus on what "hypervisor" means,
> whether or not the hypervisor is trusted varies per implementation.
>
>> need to be careful witho precisely the entity the guest trusts to enforce
>> this.
>>
>> The UAPI should define the semantics for a shared-to-private ZERO
>> conversion, even if current architectures return -EOPNOTSUPP because
>> they handle fresh memory provisioning via other mechanisms (like
>> pKVM's donation path).
>>
>> How about something like the following:
>>
>> On a shared to private conversion, the hypervisor will zero the memory
>
> Again, say _nothing_ about "the hypervisor".  _How_ or when anything happens is
> completely irrelevant, the only thing that matters here is _what_ happens.
>
>> contents before mapping it into the guest's private address space,
>> preventing the untrusted host from injecting arbitrary data into the
>> guest. If an architecture handles zeroed-provisioning via mechanisms
>> other than attribute conversion, it may return -EOPNOTSUPP.
>
> No.  I am 100% against bleeding vendor specific information into KVM's ABI for
> this.  What the vendor code does is irrelevant, the _only_ thing that matters
> here is KVM's contract with userspace.
>
> That doesn't mean pKVM guests can't rely on memory being zeroed, but that is a
> contract between pKVM and its guests, not between KVM and host userspace.
>

If pKVM's (kernel, or elsewhere) documentation says something like

  Shared to private (in addition to private to shared already specified
  in the userspace/KVM contract) conversions through guest_memfd
  specifying ZERO will have memory contents zeroed.

Would that then cover both perspectives? I see Fuad's point that pKVM
would like to provide guarantees in the shared to private direction too,
and I see Sean's point that the shared to private direction isn't a
userspace/KVM thing.

The awkward part is that we guarantee both directions for PRESERVE but
not for ZERO.

>> >   For testing purposes, the KVM_X86_SW_PROTECTED_VM testing vehicle
>> >   will support this policy and ensure zeroing for conversions in both
>> >   directions.
>> >
>> > ``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_PRESERVE``
>> >
>> >   On private/shared conversions in both directions, memory contents
>> >   will be preserved and readable. As a concrete example, if the host
>> >   writes ``0xbeef`` to memory and converts the memory to shared, the
>> >   guest will also read ``0xbeef``, after any necessary hardware or
>> >   software provided decryption. After a reverse shared to private
>> >   conversion, the host will also read ``0xbeef``.
>>
>> I think that this example is backwards. If the host writes to memory,
>> that memory is already shared, isn't it? Converting it to shared is
>> redundant. More importantly, if memory undergoes a shared-to-private
>> conversion, the host must lose access entirely.
>
> Ya, it's messed up.
>

Omg, it is backwards!! Might have been copypasta...

>> Maybe a clearer example would reflect actual payload injection and
>> bounce buffer sharing:
>> - Shared-to-Private (Payload Injection): The host writes a payload
>> (e.g., 0xbeef) to shared memory and converts it to private. The guest
>> reads 0xbeef in its private address space. The host loses access.
>> - Private-to-Shared (Bounce Buffer): The guest writes 0xbeef to
>> private memory and converts it to shared. The host reads 0xbeef.
>>
>> >   pKVM (ARM) is the first user of this policy. Since pKVM does not
>> >   protect memory with encryption, a content policy to preserve memory
>> >   will not will not involve any decryption. The guest will be able to
>> >   read what the host wrote with full content preservation.
>>
>> This is correct, but to be precise, I think it should explicitly
>> mention Stage-2 page tables as the protection mechanism, maybe:
>
> pKVM shouldn't be mentioned in here at all.
>
> ---
> By default, KVM makes no guarantees about the in-memory values after memory is
> convert to/from shared/private.  Optionally, userspace may instruct KVM to
> ensure the contents of memory are zeroed or preserved, e.g. to enable in-place
> sharing of data, or as an optimization to avoid having to re-zero memory when
> the trusted entity guarantees the memory will be zeroed after conversion.
>

How about:

or as an optimization to avoid having to re-zero memory when userspace
could have relied on the trusted entity to guarantee the memory will be
zeroed as part of the entire conversion process.

> The behaviors supported by a given KVM instance can be queried via <cap>.  If

I started with some implementation and was questioning the value of a
CAP. It seems like there won't be anything dynamic about this?

The userspace code can check what platform it is running on, and then
decide ZERO or PRESERVE based on the platform:

If the VM is running on TDX, it would want to specify ZERO all the
time. If the VM were running on pKVM it would want to specify PRESERVE
if it wants to enable in-place sharing, and ZERO if it wants to zero the
memory.

If someday TDX supports PRESERVE, then there's room for discovery of
which algorithm to choose when running the guest. Perhaps that's when
the CAP should be introduced?

> the requested behavior is an unsupported, KVM will return -EOPNOTSUPP and
> reject the conversion request.  Note!  The "ZERO" request is only support for
> private to shared conversion!
>

Do you mean ZERO is only guaranteed for private to shared? If we say
"ZERO is only guaranteed for private to shared", then pKVM could
additionally guarantee zeroing for shared to private. If we say it is
only supported for private to shared, then should I return -EOPNOTSUPP
and therefore not allow platforms to provide other guarantees?

I think we should stick to guarantees for this

* not specified (default) = no guarantees whatsoever
* ZERO = guaranteed zero for shared to private, no guarantees for
         private to shared. Platforms can add on more guarantees.
* PRESERVE = guaranteed preseved in both directions

-EOPNOTSUPP should probably be understood as "There is no way to
guarantee this" like how TDX would return -EOPNOTSUPP for PRESERVE
(now).

> ``KVM_SET_MEMORY_ATTRIBUTES2_ZERO``
>
>   On conversion, KVM guarantees all entities that have "allowed" access to the
>   memory will read zeros.  E.g. on private to shared conversion, both trusted
>   and untrusted code will read zeros.
>
>   Zeroing is currently only supported for private-to-shared conversions, as KVM
>   in general is untrusted and thus cannot guarantee the guest (or any trusted
>   entity) will read zeros after conversion.  Note, some CoCo implementations do
>   zero memory contents such that the guest reads zeros after conversion, and
>   the guest may choose to rely on that behavior.  But that's a contract between
>   the trusted CoCo entity and the guest, not between KVM and the guest.
>
> ``KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE``
>
>   On conversion, KVM guarantees memory contents will be preserved with respect
>   to the last written unencrypted value.  As a concrete example, if the host
>   writes ``0xbeef`` to shared memory and converts the memory to private, the
>   guest will also read ``0xbeef``, even if the in-memory data is encrypted as
>   part of the conversion.  And vice versa, if the guest writes ``0xbeef`` to
>   private memory and then converts the memory to shared, the host (and guest)
>   will read ``0xbeef`` (if the memory is accessible).

Thank you for this summary :)

I see you dropped any documentation to do with testing. I meant to
document it (at least something about the unspecified case) so it can be
relied on in selftests, with the understanding (already specified
elsewhere in Documentation/virt/kvm/api.rst) that nothing about
KVM_X86_SW_PROTECTED_VM is to be relied on in production, and can be
changed anytime. What do you think?

^ permalink raw reply

* [PATCH 53/53] VFS: remove LOOKUP_SHARED
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

->lookup is now always called with a shared lock and LOOKUP_SHARED set,
so we can discard that flag and remove the code for when it wasn't set.

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/afs/dir.c           | 10 ++--------
 fs/dcache.c            | 13 +++----------
 fs/namei.c             | 10 +++++-----
 fs/xfs/xfs_iops.c      |  3 +--
 include/linux/dcache.h |  3 +--
 include/linux/namei.h  |  3 +--
 6 files changed, 13 insertions(+), 29 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index f259ca2da383..29e39aeaf654 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -938,10 +938,7 @@ static struct dentry *afs_lookup_atsys(struct inode *dir, struct dentry *dentry,
 	/* Calling d_alloc_parallel() while holding parent locked is undesirable.
 	 * We don't really need the lock any more.
 	 */
-	if (flags & LOOKUP_SHARED)
-		inode_unlock_shared(dir);
-	else
-		inode_unlock(dir);
+	inode_unlock_shared(dir);
 	for (i = 0; i < subs->nr; i++) {
 		name = subs->subs[i];
 		len = dentry->d_name.len - 4 + strlen(name);
@@ -966,10 +963,7 @@ static struct dentry *afs_lookup_atsys(struct inode *dir, struct dentry *dentry,
 	 */
 	ret = NULL;
 out_s:
-	if (flags & LOOKUP_SHARED)
-		inode_lock_shared(dir);
-	else
-		inode_lock_nested(dir, I_MUTEX_PARENT);
+	inode_lock_shared(dir);
 	afs_put_sysnames(subs);
 	kfree(buf);
 out_p:
diff --git a/fs/dcache.c b/fs/dcache.c
index f573716d1a04..2d694e14bd22 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2224,7 +2224,6 @@ EXPORT_SYMBOL(d_obtain_root);
  * @dentry: the negative dentry that was passed to the parent's lookup func
  * @inode:  the inode case-insensitive lookup has found
  * @name:   the case-exact name to be associated with the returned dentry
- * @bool:   %true if lookup was performed with LOOKUP_SHARED
  *
  * This is to avoid filling the dcache with case-insensitive names to the
  * same inode, only the actual correct case is stored in the dcache for
@@ -2237,7 +2236,7 @@ EXPORT_SYMBOL(d_obtain_root);
  * the exact case, and return the spliced entry.
  */
 struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
-			struct qstr *name, bool shared)
+			struct qstr *name)
 {
 	struct dentry *found, *res;
 
@@ -2257,19 +2256,13 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 	 * d_in_lookup() (so ->d_parent is stable) and we are near the
 	 * end ->lookup() and will shortly drop the lock anyway.
 	 */
-	if (shared)
-		inode_unlock_shared(d_inode(dentry->d_parent));
-	else
-		inode_unlock(d_inode(dentry->d_parent));
+	inode_unlock_shared(d_inode(dentry->d_parent));
 	found = d_alloc_parallel(dentry->d_parent, name);
 	if (IS_ERR(found) || !d_in_lookup(found)) {
 		iput(inode);
 		return found;
 	}
-	if (shared)
-		inode_lock_shared(d_inode(dentry->d_parent));
-	else
-		inode_lock_nested(d_inode(dentry->d_parent), I_MUTEX_PARENT);
+	inode_lock_shared(d_inode(dentry->d_parent));
 	res = d_splice_alias(inode, found);
 	if (res) {
 		d_lookup_done(found);
diff --git a/fs/namei.c b/fs/namei.c
index 3d213070a515..9e2ac3077f72 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1826,7 +1826,7 @@ static struct dentry *lookup_one_qstr(const struct qstr *name,
 	if (unlikely(IS_DEADDIR(dir)))
 		old = ERR_PTR(-ENOENT);
 	else
-		old = dir->i_op->lookup(dir, dentry, flags | LOOKUP_SHARED);
+		old = dir->i_op->lookup(dir, dentry, flags);
 	inode_unlock_shared(dir);
 	if (unlikely(old)) {
 		d_lookup_done(dentry);
@@ -1951,7 +1951,7 @@ static struct dentry *__lookup_slow(const struct qstr *name,
 			old = ERR_PTR(-ENOENT);
 		else
 			old = inode->i_op->lookup(inode, dentry,
-						  flags | LOOKUP_SHARED);
+						  flags);
 		inode_unlock_shared(inode);
 		d_lookup_done(dentry);
 		if (unlikely(old)) {
@@ -1966,14 +1966,14 @@ static noinline struct dentry *lookup_slow(const struct qstr *name,
 				  struct dentry *dir,
 				  unsigned int flags)
 {
-	return __lookup_slow(name, dir, flags | LOOKUP_SHARED, TASK_NORMAL);
+	return __lookup_slow(name, dir, flags, TASK_NORMAL);
 }
 
 static struct dentry *lookup_slow_killable(const struct qstr *name,
 					   struct dentry *dir,
 					   unsigned int flags)
 {
-	return __lookup_slow(name, dir, flags | LOOKUP_SHARED, TASK_KILLABLE);
+	return __lookup_slow(name, dir, flags, TASK_KILLABLE);
 }
 
 static inline int may_lookup(struct mnt_idmap *idmap,
@@ -4513,7 +4513,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 			res = ERR_PTR(-ENOENT);
 		else
 			res = dir_inode->i_op->lookup(dir_inode, dentry,
-						      nd->flags | LOOKUP_SHARED);
+						      nd->flags);
 		inode_unlock_shared(dir_inode);
 		d_lookup_done(dentry);
 		if (unlikely(res)) {
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 2641061ba1db..cfd1cb42a29f 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -35,7 +35,6 @@
 #include <linux/security.h>
 #include <linux/iversion.h>
 #include <linux/fiemap.h>
-#include <linux/namei.h> // for LOOKUP_SHARED
 
 /*
  * Directories have different lock order w.r.t. mmap_lock compared to regular
@@ -370,7 +369,7 @@ xfs_vn_ci_lookup(
 	/* else case-insensitive match... */
 	dname.name = ci_name.name;
 	dname.len = ci_name.len;
-	dentry = d_add_ci(dentry, VFS_I(ip), &dname, !!(flags & LOOKUP_SHARED));
+	dentry = d_add_ci(dentry, VFS_I(ip), &dname);
 	kfree(ci_name.name);
 	return dentry;
 }
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index eb1a59b6fca7..74607dbcb7f0 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -250,8 +250,7 @@ struct dentry *d_duplicate(struct dentry *dentry);
 /* weird procfs mess; *NOT* exported */
 extern struct dentry * d_splice_alias_ops(struct inode *, struct dentry *,
 					  const struct dentry_operations *);
-extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *,
-				bool);
+extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *);
 extern bool d_same_name(const struct dentry *dentry, const struct dentry *parent,
 			const struct qstr *name);
 extern struct dentry *d_find_any_alias(struct inode *inode);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index cb79e84c718d..643d862a7fda 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -37,9 +37,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
 #define LOOKUP_CREATE		BIT(17)	/* ... in object creation */
 #define LOOKUP_EXCL		BIT(18)	/* ... in target must not exist */
 #define LOOKUP_RENAME_TARGET	BIT(19)	/* ... in destination of rename() */
-#define LOOKUP_SHARED		BIT(20) /* Parent lock is held shared */
 
-/* 3 spare bits for intent */
+/* 4 spare bits for intent */
 
 /* Scoping flags for lookup. */
 #define LOOKUP_NO_SYMLINKS	BIT(24) /* No symlink crossing. */
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 52/53] VFS: lift d_alloc_parallel above inode_lock
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

d_alloc_parallel() can block waiting on a d_in_lookup() dentry
so it is important to order it consistently with other blocking locks
such as inode_lock().

Currenty d_alloc_parallel() is ordered after inode_lock(): it can be
called while the inode is locked, and so the inode cannot be locked
while a d_in_lookup() dentry is held.

This patch reverses that order.  d_alloc_parallel() must now be called
*before* locking the directory, and must not be called afterwards.  This
allows directory locking to be moved closer to the filesystem
operations, and ultimately into those operations.

lookup_one_qstr_excl() is now called without an lock held, exclusive or
otherwise, so the "_excl" is dropped - it is now lookup_one_qstr().

As a lock is taken *after* lookup, start_dirop() and start_renaming()
must ensure that if the dentry isn't d_in_lookup() that after the lock
is taken the parent is still correct and the dentry is still hashed.

lookup_one_qstr() and lookup_slow() don't need to re-check the parent as
the dentry is always d_in_lookup() so parent cannot change.

The locking in lookup_slow() is moved into __lookup_slow() immediately
before/after ->lookup, and lookup_slow() just sets the task state for
waiting.

Parent locking is removed from open_last_lookups() and performed in
lookup_open().  A shared lock is taken if ->lookup() needs to be called.
An exclusive lock is taken separately if ->create() needs to be called -
with checks that the dentry hasn't become positive.

If ->atomic_open is needed we take exclusive or shared parent lock as
appropriate and check for a positive dentry or DEAD parent.

The fsnotify_create() call is kept inside the locked region in
lookup_open().  I don't know if this is important.

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/namei.c | 239 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 154 insertions(+), 85 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index bba419f2fc53..3d213070a515 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1773,8 +1773,19 @@ static struct dentry *lookup_dcache(const struct qstr *name,
 	return dentry;
 }
 
+static inline bool inode_lock_shared_state(struct inode *inode, unsigned int state)
+{
+	if (state == TASK_KILLABLE) {
+		if (down_read_killable(&inode->i_rwsem) != 0) {
+			return false;
+		}
+	} else {
+		inode_lock_shared(inode);
+	}
+	return true;
+}
+
 /*
- * Parent directory has inode locked.
  * If Lookup_EXCL or LOOKUP_RENAME_TARGET is set
  * d_lookup_done() must be called before the dentry is dput()
  * If the dentry is not d_in_lookup():
@@ -1783,8 +1794,9 @@ static struct dentry *lookup_dcache(const struct qstr *name,
  * If it is d_in_lookup() then these conditions can only be checked by the
  * file system when carrying out the intent (create or rename).
  */
-static struct dentry *lookup_one_qstr_excl(const struct qstr *name,
-					   struct dentry *base, unsigned int flags)
+static struct dentry *lookup_one_qstr(const struct qstr *name,
+				      struct dentry *base, unsigned int flags,
+				      unsigned int state)
 {
 	struct dentry *dentry;
 	struct dentry *old;
@@ -1806,7 +1818,16 @@ static struct dentry *lookup_one_qstr_excl(const struct qstr *name,
 		/* Raced with another thread which did the lookup */
 		goto found;
 
-	old = dir->i_op->lookup(dir, dentry, flags);
+	if (!inode_lock_shared_state(dir, state)) {
+		d_lookup_done(dentry);
+		dput(dentry);
+		return ERR_PTR(-EINTR);
+	}
+	if (unlikely(IS_DEADDIR(dir)))
+		old = ERR_PTR(-ENOENT);
+	else
+		old = dir->i_op->lookup(dir, dentry, flags | LOOKUP_SHARED);
+	inode_unlock_shared(dir);
 	if (unlikely(old)) {
 		d_lookup_done(dentry);
 		dput(dentry);
@@ -1897,7 +1918,8 @@ static struct dentry *lookup_fast(struct nameidata *nd)
 /* Fast lookup failed, do it the slow way */
 static struct dentry *__lookup_slow(const struct qstr *name,
 				    struct dentry *dir,
-				    unsigned int flags)
+				    unsigned int flags,
+				    unsigned int state)
 {
 	struct dentry *dentry, *old;
 	struct inode *inode = dir->d_inode;
@@ -1920,8 +1942,17 @@ static struct dentry *__lookup_slow(const struct qstr *name,
 			dput(dentry);
 			dentry = ERR_PTR(error);
 		}
+	} else if (!inode_lock_shared_state(inode, state)) {
+		d_lookup_done(dentry);
+		dput(dentry);
+		return ERR_PTR(-EINTR);
 	} else {
-		old = inode->i_op->lookup(inode, dentry, flags);
+		if (unlikely(IS_DEADDIR(inode)))
+			old = ERR_PTR(-ENOENT);
+		else
+			old = inode->i_op->lookup(inode, dentry,
+						  flags | LOOKUP_SHARED);
+		inode_unlock_shared(inode);
 		d_lookup_done(dentry);
 		if (unlikely(old)) {
 			dput(dentry);
@@ -1935,26 +1966,14 @@ static noinline struct dentry *lookup_slow(const struct qstr *name,
 				  struct dentry *dir,
 				  unsigned int flags)
 {
-	struct inode *inode = dir->d_inode;
-	struct dentry *res;
-	inode_lock_shared(inode);
-	res = __lookup_slow(name, dir, flags | LOOKUP_SHARED);
-	inode_unlock_shared(inode);
-	return res;
+	return __lookup_slow(name, dir, flags | LOOKUP_SHARED, TASK_NORMAL);
 }
 
 static struct dentry *lookup_slow_killable(const struct qstr *name,
 					   struct dentry *dir,
 					   unsigned int flags)
 {
-	struct inode *inode = dir->d_inode;
-	struct dentry *res;
-
-	if (inode_lock_shared_killable(inode))
-		return ERR_PTR(-EINTR);
-	res = __lookup_slow(name, dir, flags | LOOKUP_SHARED);
-	inode_unlock_shared(inode);
-	return res;
+	return __lookup_slow(name, dir, flags | LOOKUP_SHARED, TASK_KILLABLE);
 }
 
 static inline int may_lookup(struct mnt_idmap *idmap,
@@ -2908,18 +2927,26 @@ static struct dentry *__start_dirop(struct dentry *parent, struct qstr *name,
 	struct dentry *dentry;
 	struct inode *dir = d_inode(parent);
 
-	if (state == TASK_KILLABLE) {
-		int ret = down_write_killable_nested(&dir->i_rwsem,
-						     I_MUTEX_PARENT);
-		if (ret)
-			return ERR_PTR(ret);
-	} else {
-		inode_lock_nested(dir, I_MUTEX_PARENT);
-	}
-	dentry = lookup_one_qstr_excl(name, parent, lookup_flags);
-	if (IS_ERR(dentry))
+	while(1) {
+		dentry = lookup_one_qstr(name, parent, lookup_flags, state);
+		if (IS_ERR(dentry))
+			return dentry;
+		if (state == TASK_KILLABLE) {
+			if (down_write_killable_nested(&dir->i_rwsem, I_MUTEX_PARENT) != 0) {
+				d_lookup_done(dentry);
+				dput(dentry);
+				return ERR_PTR(-EINTR);
+			}
+		} else {
+			inode_lock_nested(dir, I_MUTEX_PARENT);
+		}
+		if (d_in_lookup(dentry) ||
+		    (!d_unhashed(dentry) && dentry->d_parent == parent))
+			return dentry;
 		inode_unlock(dir);
-	return dentry;
+		d_lookup_done(dentry);
+		dput(dentry);
+	}
 }
 
 /**
@@ -3830,26 +3857,37 @@ __start_renaming(struct renamedata *rd, int lookup_flags,
 	if (rd->flags & RENAME_NOREPLACE)
 		target_flags |= LOOKUP_EXCL;
 
-	trap = lock_rename(rd->old_parent, rd->new_parent);
-	if (IS_ERR(trap))
-		return PTR_ERR(trap);
-
-	d1 = lookup_one_qstr_excl(old_last, rd->old_parent,
-				  lookup_flags);
+retry:
+	d1 = lookup_one_qstr(old_last, rd->old_parent,
+			     lookup_flags, TASK_NORMAL);
 	err = PTR_ERR(d1);
 	if (IS_ERR(d1))
-		goto out_unlock;
+		goto out_err;
 
-	d2 = lookup_one_qstr_excl(new_last, rd->new_parent,
-				  lookup_flags | target_flags);
+	d2 = lookup_one_qstr(new_last, rd->new_parent,
+			     lookup_flags | target_flags, TASK_NORMAL);
 	err = PTR_ERR(d2);
 	if (IS_ERR(d2))
 		goto out_dput_d1;
 
+	trap = lock_rename(rd->old_parent, rd->new_parent);
+	err = PTR_ERR(trap);
+	if (IS_ERR(trap))
+		goto out_unlock;
+
+	if (unlikely((!d_in_lookup(d1) && d_unhashed(d1)) || d1->d_parent != rd->old_parent ||
+		     (!d_in_lookup(d2) && d_unhashed(d2)) || d2->d_parent != rd->new_parent)) {
+		unlock_rename(rd->old_parent, rd->new_parent);
+		d_lookup_done(d1); dput(d1);
+		d_lookup_done(d2); dput(d2);
+		dput(trap);
+		goto retry;
+	}
+
 	if (d1 == trap) {
 		/* source is an ancestor of target */
 		err = -EINVAL;
-		goto out_dput_d2;
+		goto out_unlock;
 	}
 
 	if (d2 == trap) {
@@ -3858,7 +3896,7 @@ __start_renaming(struct renamedata *rd, int lookup_flags,
 			err = -EINVAL;
 		else
 			err = -ENOTEMPTY;
-		goto out_dput_d2;
+		goto out_unlock;
 	}
 
 	rd->old_dentry = d1;
@@ -3866,14 +3904,14 @@ __start_renaming(struct renamedata *rd, int lookup_flags,
 	dget(rd->old_parent);
 	return 0;
 
-out_dput_d2:
+out_unlock:
+	unlock_rename(rd->old_parent, rd->new_parent);
 	d_lookup_done(d2);
 	dput(d2);
 out_dput_d1:
 	d_lookup_done(d1);
 	dput(d1);
-out_unlock:
-	unlock_rename(rd->old_parent, rd->new_parent);
+out_err:
 	return err;
 }
 
@@ -3927,10 +3965,22 @@ __start_renaming_dentry(struct renamedata *rd, int lookup_flags,
 	if (rd->flags & RENAME_NOREPLACE)
 		target_flags |= LOOKUP_EXCL;
 
-	/* Already have the dentry - need to be sure to lock the correct parent */
+retry:
+	d2 = lookup_one_qstr(new_last, rd->new_parent,
+			     lookup_flags | target_flags, TASK_NORMAL);
+	err = PTR_ERR(d2);
+	if (IS_ERR(d2))
+		goto out_unlock;
+
+	/*
+	 * Already have the old_dentry - need to be sure to lock
+	 * the correct parent
+	 */
 	trap = lock_rename_child(old_dentry, rd->new_parent);
+	err = PTR_ERR(trap);
 	if (IS_ERR(trap))
-		return PTR_ERR(trap);
+		goto out_dput_d2;
+
 	if (d_unhashed(old_dentry) ||
 	    (rd->old_parent && rd->old_parent != old_dentry->d_parent)) {
 		/* dentry was removed, or moved and explicit parent requested */
@@ -3938,16 +3988,19 @@ __start_renaming_dentry(struct renamedata *rd, int lookup_flags,
 		goto out_unlock;
 	}
 
-	d2 = lookup_one_qstr_excl(new_last, rd->new_parent,
-				  lookup_flags | target_flags);
-	err = PTR_ERR(d2);
-	if (IS_ERR(d2))
-		goto out_unlock;
+	if (unlikely((!d_in_lookup(d2) && d_unhashed(d2)) ||
+		     d2->d_parent != rd->new_parent)) {
+		/* d2 was moved/removed before lock - repeat lookup */
+		unlock_rename(old_dentry->d_parent, rd->new_parent);
+		d_lookup_done(d2); dput(d2);
+		dput(trap);
+		goto retry;
+	}
 
 	if (old_dentry == trap) {
 		/* source is an ancestor of target */
 		err = -EINVAL;
-		goto out_dput_d2;
+		goto out_unlock;
 	}
 
 	if (d2 == trap) {
@@ -3956,7 +4009,7 @@ __start_renaming_dentry(struct renamedata *rd, int lookup_flags,
 			err = -EINVAL;
 		else
 			err = -ENOTEMPTY;
-		goto out_dput_d2;
+		goto out_unlock;
 	}
 
 	rd->old_dentry = dget(old_dentry);
@@ -3964,11 +4017,11 @@ __start_renaming_dentry(struct renamedata *rd, int lookup_flags,
 	rd->old_parent = dget(old_dentry->d_parent);
 	return 0;
 
+out_unlock:
+	unlock_rename(old_dentry->d_parent, rd->new_parent);
 out_dput_d2:
 	d_lookup_done(d2);
 	dput(d2);
-out_unlock:
-	unlock_rename(old_dentry->d_parent, rd->new_parent);
 	return err;
 }
 
@@ -4319,8 +4372,19 @@ static struct dentry *atomic_open(const struct path *path, struct dentry *dentry
 
 	file->__f_path.dentry = DENTRY_NOT_SET;
 	file->__f_path.mnt = path->mnt;
-	error = dir->i_op->atomic_open(dir, dentry, file,
-				       open_to_namei_flags(open_flag), mode);
+
+	if (open_flag & O_CREAT)
+		inode_lock(dir);
+	else
+		inode_lock_shared(dir);
+	if (dentry->d_inode)
+		error = finish_no_open(file, NULL);
+	else if (unlikely(IS_DEADDIR(dir)))
+		error = -ENOENT;
+	else
+		error = dir->i_op->atomic_open(dir, dentry, file,
+					       open_to_namei_flags(open_flag),
+					       mode);
 	d_lookup_done(dentry);
 	if (!error) {
 		if (file->f_mode & FMODE_OPENED) {
@@ -4339,6 +4403,13 @@ static struct dentry *atomic_open(const struct path *path, struct dentry *dentry
 				error = -ENOENT;
 		}
 	}
+	if (!error && (file->f_mode & FMODE_CREATED))
+		fsnotify_create(dir, dentry);
+	if (open_flag & O_CREAT)
+		inode_unlock(dir);
+	else
+		inode_unlock_shared(dir);
+
 	if (error) {
 		dput(dentry);
 		dentry = ERR_PTR(error);
@@ -4372,10 +4443,6 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	struct dentry *dentry;
 	int error, create_error = 0;
 	umode_t mode = op->mode;
-	unsigned int shared_flag = (op->open_flag & O_CREAT) ? 0 : LOOKUP_SHARED;
-
-	if (unlikely(IS_DEADDIR(dir_inode)))
-		return ERR_PTR(-ENOENT);
 
 	file->f_mode &= ~FMODE_CREATED;
 	dentry = d_lookup(dir, &nd->last);
@@ -4420,7 +4487,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	if (open_flag & O_CREAT) {
 		if (open_flag & O_EXCL)
 			open_flag &= ~O_TRUNC;
-		mode = vfs_prepare_mode(idmap, dir->d_inode, mode, mode, mode);
+		mode = vfs_prepare_mode(idmap, dir_inode, mode, mode, mode);
 		if (likely(got_write))
 			create_error = may_o_create(idmap, &nd->path,
 						    dentry, mode);
@@ -4439,8 +4506,15 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	}
 
 	if (d_in_lookup(dentry)) {
-		struct dentry *res = dir_inode->i_op->lookup(dir_inode, dentry,
-							     nd->flags | shared_flag);
+		struct dentry *res;
+
+		inode_lock_shared(dir_inode);
+		if (IS_DEADDIR(dir_inode))
+			res = ERR_PTR(-ENOENT);
+		else
+			res = dir_inode->i_op->lookup(dir_inode, dentry,
+						      nd->flags | LOOKUP_SHARED);
+		inode_unlock_shared(dir_inode);
 		d_lookup_done(dentry);
 		if (unlikely(res)) {
 			if (IS_ERR(res)) {
@@ -4459,15 +4533,22 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		if (error)
 			goto out_dput;
 
-		file->f_mode |= FMODE_CREATED;
-		audit_inode_child(dir_inode, dentry, AUDIT_TYPE_CHILD_CREATE);
-		if (!dir_inode->i_op->create) {
-			error = -EACCES;
-			goto out_dput;
-		}
+		inode_lock(dir_inode);
+		if (!dentry->d_inode && !unlikely(IS_DEADDIR(dir_inode))) {
+			file->f_mode |= FMODE_CREATED;
+			audit_inode_child(dir_inode, dentry, AUDIT_TYPE_CHILD_CREATE);
+			if (!dir_inode->i_op->create) {
+				error = -EACCES;
+				goto out_dput;
+			}
 
-		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
-						mode, open_flag & O_EXCL);
+			error = dir_inode->i_op->create(idmap, dir_inode, dentry,
+							mode, open_flag & O_EXCL);
+			if (!error)
+				fsnotify_create(dir_inode, dentry);
+		} else if (!dentry->d_inode)
+			error = -ENOENT;
+		inode_unlock(dir_inode);
 		if (error)
 			goto out_dput;
 	}
@@ -4522,7 +4603,6 @@ static const char *open_last_lookups(struct nameidata *nd,
 		   struct file *file, const struct open_flags *op)
 {
 	struct delegated_inode delegated_inode = { };
-	struct dentry *dir = nd->path.dentry;
 	int open_flag = op->open_flag;
 	bool got_write = false;
 	struct dentry *dentry;
@@ -4562,22 +4642,11 @@ static const char *open_last_lookups(struct nameidata *nd,
 		 * dropping this one anyway.
 		 */
 	}
-	if (open_flag & O_CREAT)
-		inode_lock(dir->d_inode);
-	else
-		inode_lock_shared(dir->d_inode);
 	dentry = lookup_open(nd, file, op, got_write, &delegated_inode);
 	if (!IS_ERR(dentry)) {
-		if (file->f_mode & FMODE_CREATED)
-			fsnotify_create(dir->d_inode, dentry);
 		if (file->f_mode & FMODE_OPENED)
 			fsnotify_open(file);
 	}
-	if (open_flag & O_CREAT)
-		inode_unlock(dir->d_inode);
-	else
-		inode_unlock_shared(dir->d_inode);
-
 	if (got_write)
 		mnt_drop_write(nd->path.mnt);
 
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 51/53] VFS: use d_alloc_parallel() in lookup_one_qstr_excl().
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

lookup_one_qstr_excl() is used for lookups prior to directory
modifications (other than open(O_CREATE)), whether create, remove,
or rename.

A future patch will lift lookup out of the i_rwsem lock that protects
the directory during these operations (only taking a shared lock if the
target name is not yet in the dcache).

To prepare for this change and particularly to allow lookup to
always be done outside the parent i_rwsem, change lookup_one_qstr_excl()
to use d_alloc_parallel().

For the target of create and rename some filesystems skip the
preliminary lookup and combine it with the main operation.  This is only
safe if the operation has exclusive access to the dentry.  Currently
this is guaranteed by an exclusive lock on the directory.
d_alloc_parallel() provides alternate exclusive access (in the case
where the name isn't in the dcache and ->lookup will be called).

As a result of this change, ->lookup is now only ever called with a
d_in_lookup() dentry.  Consequently we can remove the d_in_lookup()
check from d_add_ci() which is only used in ->lookup.

If LOOKUP_EXCL or LOOKUP_RENAME_TARGET is passed, the caller must ensure
d_lookup_done() is called at an appropriate time, and must not assume
that it can test for positive or negative dentries without confirming
that the dentry is no longer d_in_lookup() - unless it is filesystem
code acting on itself and *knows* that ->lookup() always completes the
lookup (currently true for all filesystems other than NFS).
This is all handled in start_creating() and end_dirop() and friends.

Note that as lookup_one_qstr_excl() is called with an exclusive lock on
the directory, d_alloc_parallel() cannot race with another thread and
cannot return a non-in-lookup dentry.  However that is expected to
change so that case is handled with this patch.

Signed-off-by: NeilBrown <neil@brown.name>
---
 Documentation/filesystems/porting.rst | 14 ++++++++++
 fs/dcache.c                           | 16 +++--------
 fs/namei.c                            | 38 ++++++++++++++++++++-------
 3 files changed, 46 insertions(+), 22 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 7e83bd3c5a12..5ddc5ecfcc64 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1403,3 +1403,17 @@ you really don't want it.
 lookup_one() and lookup_noperm() are no longer available.  Use
 start_creating() or similar instead.
 
+
+---
+
+**mandatory**
+
+All start_creating and start_renaming functions may return a
+d_in_lookup() dentry if passed "O_CREATE|O_EXCL" or "O_RENAME_TARGET".
+end_dirop() calls the necessary d_lookup_done().  If the caller
+*knows* which filesystem is being used, it may know that this is not
+possible.  Otherwise it must be careful testing if the dentry is
+positive or negative as the lookup may not have been performed yet.
+
+inode_operations.lookup() is now only ever called with a d_in_lookup()
+dentry.
diff --git a/fs/dcache.c b/fs/dcache.c
index abb96ad8e015..f573716d1a04 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2261,18 +2261,10 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 		inode_unlock_shared(d_inode(dentry->d_parent));
 	else
 		inode_unlock(d_inode(dentry->d_parent));
-	if (d_in_lookup(dentry)) {
-		found = d_alloc_parallel(dentry->d_parent, name);
-		if (IS_ERR(found) || !d_in_lookup(found)) {
-			iput(inode);
-			return found;
-		}
-	} else {
-		found = d_alloc(dentry->d_parent, name);
-		if (!found) {
-			iput(inode);
-			return ERR_PTR(-ENOMEM);
-		}
+	found = d_alloc_parallel(dentry->d_parent, name);
+	if (IS_ERR(found) || !d_in_lookup(found)) {
+		iput(inode);
+		return found;
 	}
 	if (shared)
 		inode_lock_shared(d_inode(dentry->d_parent));
diff --git a/fs/namei.c b/fs/namei.c
index cb80490a869f..bba419f2fc53 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1774,13 +1774,14 @@ static struct dentry *lookup_dcache(const struct qstr *name,
 }
 
 /*
- * Parent directory has inode locked exclusive.  This is one
- * and only case when ->lookup() gets called on non in-lookup
- * dentries - as the matter of fact, this only gets called
- * when directory is guaranteed to have no in-lookup children
- * at all.
- * Will return -ENOENT if name isn't found and LOOKUP_CREATE wasn't passed.
- * Will return -EEXIST if name is found and LOOKUP_EXCL was passed.
+ * Parent directory has inode locked.
+ * If Lookup_EXCL or LOOKUP_RENAME_TARGET is set
+ * d_lookup_done() must be called before the dentry is dput()
+ * If the dentry is not d_in_lookup():
+ *   Will return -ENOENT if name isn't found and LOOKUP_CREATE wasn't passed.
+ *   Will return -EEXIST if name is found and LOOKUP_EXCL was passed.
+ * If it is d_in_lookup() then these conditions can only be checked by the
+ * file system when carrying out the intent (create or rename).
  */
 static struct dentry *lookup_one_qstr_excl(const struct qstr *name,
 					   struct dentry *base, unsigned int flags)
@@ -1798,18 +1799,27 @@ static struct dentry *lookup_one_qstr_excl(const struct qstr *name,
 	if (unlikely(IS_DEADDIR(dir)))
 		return ERR_PTR(-ENOENT);
 
-	dentry = d_alloc(base, name);
-	if (unlikely(!dentry))
-		return ERR_PTR(-ENOMEM);
+	dentry = d_alloc_parallel(base, name);
+	if (unlikely(IS_ERR(dentry)))
+		return dentry;
+	if (unlikely(!d_in_lookup(dentry)))
+		/* Raced with another thread which did the lookup */
+		goto found;
 
 	old = dir->i_op->lookup(dir, dentry, flags);
 	if (unlikely(old)) {
+		d_lookup_done(dentry);
 		dput(dentry);
 		dentry = old;
 	}
 found:
 	if (IS_ERR(dentry))
 		return dentry;
+	if (d_in_lookup(dentry))
+		/* We cannot check for errors - the caller will have to
+		 * wait for any create-etc attempt to get relevant errors.
+		 */
+		return dentry;
 	if (d_is_negative(dentry) && !(flags & LOOKUP_CREATE)) {
 		dput(dentry);
 		return ERR_PTR(-ENOENT);
@@ -2921,6 +2931,8 @@ static struct dentry *__start_dirop(struct dentry *parent, struct qstr *name,
  * The lookup is performed and necessary locks are taken so that, on success,
  * the returned dentry can be operated on safely.
  * The qstr must already have the hash value calculated.
+ * The dentry may be d_in_lookup() if %LOOKUP_EXCL or %LOOKUP_RENAME_TARGET
+ * is given, depending on the filesystem.
  *
  * Returns: a locked dentry, or an error.
  *
@@ -2942,6 +2954,7 @@ void end_dirop(struct dentry *de)
 {
 	if (!IS_ERR(de)) {
 		inode_unlock(de->d_parent->d_inode);
+		d_lookup_done(de);
 		dput(de);
 	}
 }
@@ -3854,8 +3867,10 @@ __start_renaming(struct renamedata *rd, int lookup_flags,
 	return 0;
 
 out_dput_d2:
+	d_lookup_done(d2);
 	dput(d2);
 out_dput_d1:
+	d_lookup_done(d1);
 	dput(d1);
 out_unlock:
 	unlock_rename(rd->old_parent, rd->new_parent);
@@ -3950,6 +3965,7 @@ __start_renaming_dentry(struct renamedata *rd, int lookup_flags,
 	return 0;
 
 out_dput_d2:
+	d_lookup_done(d2);
 	dput(d2);
 out_unlock:
 	unlock_rename(old_dentry->d_parent, rd->new_parent);
@@ -4059,6 +4075,8 @@ EXPORT_SYMBOL(start_renaming_two_dentries);
 
 void end_renaming(struct renamedata *rd)
 {
+	d_lookup_done(rd->old_dentry);
+	d_lookup_done(rd->new_dentry);
 	unlock_rename(rd->old_parent, rd->new_parent);
 	dput(rd->old_dentry);
 	dput(rd->new_dentry);
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 50/53] VFS: remove lookup_one() and lookup_noperm()
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

These are no longer used, so remove them.

Signed-off-by: NeilBrown <neil@brown.name>
---
 Documentation/filesystems/porting.rst |  7 +++
 fs/ecryptfs/inode.c                   |  2 +-
 fs/namei.c                            | 61 ++-------------------------
 include/linux/namei.h                 |  2 -
 4 files changed, 12 insertions(+), 60 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 154a38cd7801..7e83bd3c5a12 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1396,3 +1396,10 @@ and d_alloc_parallel() or d_alloc_noblock() when standard interfaces can be used
 d_rehash() is gone. It should never be needed.  Only unhash a dentry if
 you really don't want it.
 
+---
+
+** mandatory**
+
+lookup_one() and lookup_noperm() are no longer available.  Use
+start_creating() or similar instead.
+
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index beb9e2c8b8b3..a7a596d51d67 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -414,7 +414,7 @@ static struct dentry *ecryptfs_lookup(struct inode *ecryptfs_dir_inode,
 
 	lower_dentry = lookup_noperm_unlocked(&qname, lower_dir_dentry);
 	if (IS_ERR(lower_dentry)) {
-		ecryptfs_printk(KERN_DEBUG, "%s: lookup_noperm() returned "
+		ecryptfs_printk(KERN_DEBUG, "%s: lookup_noperm_unlocked() returned "
 				"[%ld] on lower_dentry = [%s]\n", __func__,
 				PTR_ERR(lower_dentry),
 				qname.name);
diff --git a/fs/namei.c b/fs/namei.c
index eed388ee8a30..cb80490a869f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3148,59 +3148,6 @@ struct dentry *try_lookup_noperm(struct qstr *name, struct dentry *base)
 }
 EXPORT_SYMBOL(try_lookup_noperm);
 
-/**
- * lookup_noperm - filesystem helper to lookup single pathname component
- * @name:	qstr storing pathname component to lookup
- * @base:	base directory to lookup from
- *
- * Note that this routine is purely a helper for filesystem usage and should
- * not be called by generic code.  It does no permission checking.
- *
- * The caller must hold base->i_rwsem.
- */
-struct dentry *lookup_noperm(struct qstr *name, struct dentry *base)
-{
-	struct dentry *dentry;
-	int err;
-
-	WARN_ON_ONCE(!inode_is_locked(base->d_inode));
-
-	err = lookup_noperm_common(name, base);
-	if (err)
-		return ERR_PTR(err);
-
-	dentry = lookup_dcache(name, base, 0);
-	return dentry ? dentry : __lookup_slow(name, base, 0);
-}
-EXPORT_SYMBOL(lookup_noperm);
-
-/**
- * lookup_one - lookup single pathname component
- * @idmap:	idmap of the mount the lookup is performed from
- * @name:	qstr holding pathname component to lookup
- * @base:	base directory to lookup from
- *
- * This can be used for in-kernel filesystem clients such as file servers.
- *
- * The caller must hold base->i_rwsem.
- */
-struct dentry *lookup_one(struct mnt_idmap *idmap, struct qstr *name,
-			  struct dentry *base)
-{
-	struct dentry *dentry;
-	int err;
-
-	WARN_ON_ONCE(!inode_is_locked(base->d_inode));
-
-	err = lookup_one_common(idmap, name, base);
-	if (err)
-		return ERR_PTR(err);
-
-	dentry = lookup_dcache(name, base, 0);
-	return dentry ? dentry : __lookup_slow(name, base, 0);
-}
-EXPORT_SYMBOL(lookup_one);
-
 /**
  * lookup_one_unlocked - lookup single pathname component
  * @idmap:	idmap of the mount the lookup is performed from
@@ -3209,8 +3156,8 @@ EXPORT_SYMBOL(lookup_one);
  *
  * This can be used for in-kernel filesystem clients such as file servers.
  *
- * Unlike lookup_one, it should be called without the parent
- * i_rwsem held, and will take the i_rwsem itself if necessary.
+ * It should be called without the parent i_rwsem held, and will take
+ * the i_rwsem itself if necessary.
  *
  * Returns: - A dentry, possibly negative, or
  *	    - same errors as try_lookup_noperm() or
@@ -3322,8 +3269,8 @@ EXPORT_SYMBOL(lookup_one_positive_unlocked);
  * Note that this routine is purely a helper for filesystem usage and should
  * not be called by generic code. It does no permission checking.
  *
- * Unlike lookup_noperm(), it should be called without the parent
- * i_rwsem held, and will take the i_rwsem itself if necessary.
+ * This should be called without the parent i_rwsem held, and will take
+ * the i_rwsem itself if necessary.
  *
  * Unlike try_lookup_noperm() it *does* revalidate the dentry if it already
  * existed.
diff --git a/include/linux/namei.h b/include/linux/namei.h
index b3346a513d8f..cb79e84c718d 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -74,10 +74,8 @@ int vfs_path_lookup(struct dentry *, struct vfsmount *, const char *,
 		    unsigned int, struct path *);
 
 extern struct dentry *try_lookup_noperm(struct qstr *, struct dentry *);
-extern struct dentry *lookup_noperm(struct qstr *, struct dentry *);
 extern struct dentry *lookup_noperm_unlocked(struct qstr *, struct dentry *);
 extern struct dentry *lookup_noperm_positive_unlocked(struct qstr *, struct dentry *);
-struct dentry *lookup_one(struct mnt_idmap *, struct qstr *, struct dentry *);
 struct dentry *lookup_one_unlocked(struct mnt_idmap *idmap,
 				   struct qstr *name, struct dentry *base);
 struct dentry *lookup_one_positive_unlocked(struct mnt_idmap *idmap,
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 49/53] VFS: remove d_rehash()
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

d_rehash() is no longer used.  Is existence implies that it might be
safe to unhash and rehash ad dentry, and with proposed locking changes
that will no longer be the case.
So remove it.

Signed-off-by: NeilBrown <neil@brown.name>
---
 Documentation/filesystems/porting.rst |  7 +++++++
 fs/dcache.c                           | 15 ---------------
 include/linux/dcache.h                |  5 -----
 3 files changed, 7 insertions(+), 20 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 4712403fd98e..154a38cd7801 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1389,3 +1389,10 @@ from an internal table when needed.
 d_alloc() is no longer exported as its use can be racy.  Use d_alloc_name()
 when object creation is controlled separately from standard filesystem interface,
 and d_alloc_parallel() or d_alloc_noblock() when standard interfaces can be used.
+
+---
+**mandatory**
+
+d_rehash() is gone. It should never be needed.  Only unhash a dentry if
+you really don't want it.
+
diff --git a/fs/dcache.c b/fs/dcache.c
index 4ebbbcc5aec4..abb96ad8e015 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2627,21 +2627,6 @@ static void __d_rehash(struct dentry *entry)
 	hlist_bl_unlock(b);
 }
 
-/**
- * d_rehash - add an entry back to the hash
- * @entry: dentry to add to the hash
- *
- * Adds a dentry to the hash according to its name.
- */
- 
-void d_rehash(struct dentry * entry)
-{
-	spin_lock(&entry->d_lock);
-	__d_rehash(entry);
-	spin_unlock(&entry->d_lock);
-}
-EXPORT_SYMBOL(d_rehash);
-
 #define PAR_LOOKUP_WQ_BITS	8
 #define PAR_LOOKUP_WQS (1 << PAR_LOOKUP_WQ_BITS)
 static wait_queue_head_t par_wait_table[PAR_LOOKUP_WQS] __cacheline_aligned;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 31b4a831ecdb..eb1a59b6fca7 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -277,11 +277,6 @@ extern struct dentry *d_find_alias_rcu(struct inode *);
 /* test whether we have any submounts in a subdir tree */
 extern int path_has_submounts(const struct path *);
 
-/*
- * This adds the entry to the hash queues.
- */
-extern void d_rehash(struct dentry *);
- 
 /* used for rename() and baskets */
 extern void d_move(struct dentry *, struct dentry *);
 extern void d_exchange(struct dentry *, struct dentry *);
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 48/53] VFS: remove d_add()
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

d_add() has been supplanted by d_splice_alias(), d_make_persistent() and
others.  It is no longer used and can be discarded

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/dcache.c            | 19 -------------------
 include/linux/dcache.h |  2 --
 2 files changed, 21 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 23f04fa05778..4ebbbcc5aec4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2979,25 +2979,6 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode,
 		spin_unlock(&inode->i_lock);
 }
 
-/**
- * d_add - add dentry to hash queues
- * @entry: dentry to add
- * @inode: The inode to attach to this dentry
- *
- * This adds the entry to the hash queues and initializes @inode.
- * The entry was actually filled in earlier during d_alloc().
- */
-
-void d_add(struct dentry *entry, struct inode *inode)
-{
-	if (inode) {
-		security_d_instantiate(entry, inode);
-		spin_lock(&inode->i_lock);
-	}
-	__d_add(entry, inode, NULL);
-}
-EXPORT_SYMBOL(d_add);
-
 struct dentry *d_make_persistent(struct dentry *dentry, struct inode *inode)
 {
 	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 18242f9598dc..31b4a831ecdb 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -282,8 +282,6 @@ extern int path_has_submounts(const struct path *);
  */
 extern void d_rehash(struct dentry *);
  
-extern void d_add(struct dentry *, struct inode *);
-
 /* used for rename() and baskets */
 extern void d_move(struct dentry *, struct dentry *);
 extern void d_exchange(struct dentry *, struct dentry *);
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 47/53] VFS: make d_alloc() local to VFS.
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

d_alloc() is not useful.  d_alloc_name() is a better interface for
those cases where it is safe to allocate a dentry without
synchronisation with the VFS, and d_alloc_parallel() or
d_alloc_noblock() shoudl be used when synchronisation is needed.

Signed-off-by: NeilBrown <neil@brown.name>
---
 Documentation/filesystems/porting.rst | 8 ++++++++
 fs/dcache.c                           | 1 -
 fs/internal.h                         | 1 +
 include/linux/dcache.h                | 1 -
 4 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 6a507c508ccf..4712403fd98e 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1381,3 +1381,11 @@ longer available.  Use start_renaming() or similar.
 
 d_alloc_parallel() no longer requires a waitqueue_head.  It uses one
 from an internal table when needed.
+
+---
+
+**mandatory**
+
+d_alloc() is no longer exported as its use can be racy.  Use d_alloc_name()
+when object creation is controlled separately from standard filesystem interface,
+and d_alloc_parallel() or d_alloc_noblock() when standard interfaces can be used.
diff --git a/fs/dcache.c b/fs/dcache.c
index 9a6139013367..23f04fa05778 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1830,7 +1830,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 
 	return dentry;
 }
-EXPORT_SYMBOL(d_alloc);
 
 /**
  * d_duplicate: duplicate a dentry for combined atomic operation
diff --git a/fs/internal.h b/fs/internal.h
index cbc384a1aa09..9c637e2d18ef 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -223,6 +223,7 @@ bool sync_lazytime(struct inode *inode);
 /*
  * dcache.c
  */
+struct dentry *d_alloc(struct dentry * parent, const struct qstr *name);
 extern int d_set_mounted(struct dentry *dentry);
 extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc);
 extern struct dentry *d_alloc_cursor(struct dentry *);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b12577ddfbb..18242f9598dc 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -242,7 +242,6 @@ extern void d_drop(struct dentry *dentry);
 extern void d_delete(struct dentry *);
 
 /* allocate/de-allocate */
-extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
 extern struct dentry * d_alloc_anon(struct super_block *);
 extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *);
 extern struct dentry * d_alloc_noblock(struct dentry *, struct qstr *);
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 46/53] Remove references to d_add() in documentation and comments.
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

It is planned to remove d_add(), so remove all references in
documentation and comments.

Signed-off-by: NeilBrown <neil@brown.name>
---
 Documentation/filesystems/nfs/exporting.rst | 10 ++--------
 Documentation/filesystems/vfs.rst           |  4 ++--
 fs/afs/dir.c                                |  5 +++--
 fs/dcache.c                                 |  2 +-
 fs/ocfs2/namei.c                            |  2 +-
 fs/xfs/xfs_iops.c                           |  6 +++---
 6 files changed, 12 insertions(+), 17 deletions(-)

diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index a01d9b9b5bc3..ccaacdc72576 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -101,14 +101,8 @@ Filesystem Issues
 For a filesystem to be exportable it must:
 
    1. provide the filehandle fragment routines described below.
-   2. make sure that d_splice_alias is used rather than d_add
-      when ->lookup finds an inode for a given parent and name.
-
-      If inode is NULL, d_splice_alias(inode, dentry) is equivalent to::
-
-		d_add(dentry, inode), NULL
-
-      Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)
+   2. Use d_splice_alias() when ->lookup finds an inode for a given 
+      parent and name.
 
       Typically the ->lookup routine will simply end with a::
 
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index d8df0a84cdba..26dec777ca5c 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -460,7 +460,7 @@ otherwise noted.
 ``lookup``
 	called when the VFS needs to look up an inode in a parent
 	directory.  The name to look for is found in the dentry.  This
-	method must call d_add() to insert the found inode into the
+	method must call d_splice_alias() to insert the found inode into the
 	dentry.  The "i_count" field in the inode structure should be
 	incremented.  If the named inode does not exist a NULL inode
 	should be inserted into the dentry (this is called a negative
@@ -1433,7 +1433,7 @@ manipulate dentries:
 	d_iput() method is called).  If there are other references, then
 	d_drop() is called instead
 
-``d_add``
+``d_splice_alias``
 	add a dentry to its parents hash list and then calls
 	d_instantiate()
 
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index b5c593f50079..f259ca2da383 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -960,8 +960,9 @@ static struct dentry *afs_lookup_atsys(struct inode *dir, struct dentry *dentry,
 		dput(ret);
 	}
 
-	/* We don't want to d_add() the @sys dentry here as we don't want to
-	 * the cached dentry to hide changes to the sysnames list.
+	/* We don't want to d_splice_alias() the @sys dentry here as we
+	 * don't want to the cached dentry to hide changes to the
+	 * sysnames list.
 	 */
 	ret = NULL;
 out_s:
diff --git a/fs/dcache.c b/fs/dcache.c
index c48337d95f9a..9a6139013367 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3323,7 +3323,7 @@ struct dentry *d_splice_alias_ops(struct inode *inode, struct dentry *dentry,
  * @dentry must be negative and may be in-lookup or unhashed or hashed.
  *
  * If inode is a directory and has an IS_ROOT alias, then d_move that in
- * place of the given dentry and return it, else simply d_add the inode
+ * place of the given dentry and return it, else simply __d_add the inode
  * to the dentry and return NULL.
  *
  * If a non-IS_ROOT directory is found, the filesystem is corrupt, and
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 268b79339a51..0d3116142bd7 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -172,7 +172,7 @@ static struct dentry *ocfs2_lookup(struct inode *dir, struct dentry *dentry,
 		ocfs2_dentry_attach_gen(dentry);
 
 bail_unlock:
-	/* Don't drop the cluster lock until *after* the d_add --
+	/* Don't drop the cluster lock until *after* the d_splice_alias --
 	 * unlink on another node will message us to remove that
 	 * dentry under this lock so otherwise we can race this with
 	 * the downconvert thread and have a stale dentry. */
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ec19d3ec7cf0..2641061ba1db 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -356,9 +356,9 @@ xfs_vn_ci_lookup(
 		if (unlikely(error != -ENOENT))
 			return ERR_PTR(error);
 		/*
-		 * call d_add(dentry, NULL) here when d_drop_negative_children
-		 * is called in xfs_vn_mknod (ie. allow negative dentries
-		 * with CI filesystems).
+		 * call d_splice_alias(NULL, dentry) here when
+		 * d_drop_negative_children is called in xfs_vn_mknod
+		 * (ie.  allow negative dentries with CI filesystems).
 		 */
 		return NULL;
 	}
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 45/53] efivarfs: use d_alloc_name()
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

efivarfs() is one of the few remaining users of d_alloc().
Other similar filesystems use d_alloc_name() in the same circumstances.
Now that d_alloc_name() supports ->d_hash (providing that it never
fails), change efivarfs to use that.

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/efivarfs/super.c | 26 +++-----------------------
 1 file changed, 3 insertions(+), 23 deletions(-)

diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
index 1c5224cf183e..232d9757804c 100644
--- a/fs/efivarfs/super.c
+++ b/fs/efivarfs/super.c
@@ -189,26 +189,6 @@ static const struct dentry_operations efivarfs_d_ops = {
 	.d_hash = efivarfs_d_hash,
 };
 
-static struct dentry *efivarfs_alloc_dentry(struct dentry *parent, char *name)
-{
-	struct dentry *d;
-	struct qstr q;
-	int err;
-
-	q.name = name;
-	q.len = strlen(name);
-
-	err = efivarfs_d_hash(parent, &q);
-	if (err)
-		return ERR_PTR(err);
-
-	d = d_alloc(parent, &q);
-	if (d)
-		return d;
-
-	return ERR_PTR(-ENOMEM);
-}
-
 bool efivarfs_variable_is_present(efi_char16_t *variable_name,
 				  efi_guid_t *vendor, void *data)
 {
@@ -263,9 +243,9 @@ static int efivarfs_create_dentry(struct super_block *sb, efi_char16_t *name16,
 	memcpy(entry->var.VariableName, name16, name_size);
 	memcpy(&(entry->var.VendorGuid), &vendor, sizeof(efi_guid_t));
 
-	dentry = efivarfs_alloc_dentry(root, name);
-	if (IS_ERR(dentry)) {
-		err = PTR_ERR(dentry);
+	dentry = d_alloc_name(root, name);
+	if (!dentry) {
+		err = -ENOMEM;
 		goto fail_inode;
 	}
 
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 44/53] hostfs: don't d_drop() before d_splice_alias() in hostfs_mkdir()
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

hostfs_mkdir() uses d_drop() and d_splice_alias() to ensure it has the
right dentry after a mkdir.
d_drop() is no longer needed here and will cause problem for future
changes to directory locking.  So remove the d_drop().

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/hostfs/hostfs_kern.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index abe86d72d9ef..f737f99710d5 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -700,7 +700,6 @@ static struct dentry *hostfs_mkdir(struct mnt_idmap *idmap, struct inode *ino,
 		dentry = ERR_PTR(err);
 	} else {
 		inode = hostfs_iget(dentry->d_sb, file);
-		d_drop(dentry);
 		dentry = d_splice_alias(inode, dentry);
 	}
 	__putname(file);
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 43/53] fuse: Use d_alloc_noblock() in fuse_direntplus_link()
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

fuse uses the results of readdir to prime the dcache.  Using
d_alloc_parallel() can block if there is a concurrent lookup.  Blocking
in that case is pointless as the lookup will add info to the dcache and
there is no value in the readdir waiting to see if it should add the
info too.

Also this call to d_alloc_parallel() is made while the parent
directory is locked.  A proposed change to locking will lock the parent
later, after d_alloc_parallel().  This means it won't be safe to wait in
d_alloc_parallel() while holding the directory lock.

So change to use d_alloc_noblock(), and use try_lookup_noperm() rather
than full_name_hash and d_lookup.

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/fuse/readdir.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index f588252891af..400a1a24f659 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -12,6 +12,7 @@
 #include <linux/posix_acl.h>
 #include <linux/pagemap.h>
 #include <linux/highmem.h>
+#include <linux/namei.h>
 
 static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
 {
@@ -192,14 +193,18 @@ static int fuse_direntplus_link(struct file *file,
 	fc = get_fuse_conn(dir);
 	epoch = atomic_read(&fc->epoch);
 
-	name.hash = full_name_hash(parent, name.name, name.len);
-	dentry = d_lookup(parent, &name);
+	dentry = try_lookup_noperm(&name, parent);
 	if (!dentry) {
 retry:
-		dentry = d_alloc_parallel(parent, &name);
-		if (IS_ERR(dentry))
-			return PTR_ERR(dentry);
+		dentry = d_alloc_noblock(parent, &name);
+	}
+	if (IS_ERR(dentry)) {
+		if (PTR_ERR(dentry) == -EWOULDBLOCK)
+			/* harmless */
+			return 0;
+		return PTR_ERR(dentry);
 	}
+
 	if (!d_in_lookup(dentry)) {
 		struct fuse_inode *fi;
 		inode = d_inode(dentry);
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 42/53] fuse: don't d_drop() before d_splice_alias()
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

create_new_entry() is used to finalise the creation of various
objects (mknod, mkdir, symlink etc).

It currently uses d_drop() which will be a problem for a proposed
new locking scheme.

d_splice_alias() now works on hashed dentries so the d_drop() isn't
needed.  Drop it.

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/fuse/dir.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 7ac6b232ef12..a659877b520a 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1020,7 +1020,6 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
 	}
 	kfree(forget);
 
-	d_drop(entry);
 	d = d_splice_alias(inode, entry);
 	if (IS_ERR(d))
 		return d;
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 41/53] libfs: stop using d_add().
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

"Best practice" is to use d_splice_alias() at the end of a ->lookup
function.  d_add() often works and is not incorrect in libfs, as the
inode is always NULL, but as it is planned to remove d_add(), change to
use d_splice_alias().

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/libfs.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 63b4fb082435..75f44341f98b 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -79,8 +79,7 @@ struct dentry *simple_lookup(struct inode *dir, struct dentry *dentry, unsigned
 	if (IS_ENABLED(CONFIG_UNICODE) && IS_CASEFOLDED(dir))
 		return NULL;
 
-	d_add(dentry, NULL);
-	return NULL;
+	return d_splice_alias(NULL, dentry);
 }
 EXPORT_SYMBOL(simple_lookup);
 
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 40/53] gfs2: stop using d_add().
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

"Best practice" is to use d_splice_alias() at the end of a ->lookup
function.  d_add() often works and is not incorrect in gfs2, as the
inode is always NULL, but as it is planned to remove d_add(), change to
use d_splice_alias().

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/gfs2/inode.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 8344040ecaf7..9997fbc1084c 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -988,10 +988,9 @@ static struct dentry *__gfs2_lookup(struct inode *dir, struct dentry *dentry,
 	int error;
 
 	inode = gfs2_lookupi(dir, &dentry->d_name, 0);
-	if (inode == NULL) {
-		d_add(dentry, NULL);
-		return NULL;
-	}
+	if (inode == NULL)
+		return d_splice_alias(NULL, dentry);
+
 	if (IS_ERR(inode))
 		return ERR_CAST(inode);
 
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 39/53] ecryptfs: stop using d_add().
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

"Best practice" is to use d_splice_alias() at the end of a ->lookup
function.  d_add() often works and is not incorrect in ecryptfs, but as
it is planned to remove d_add(), change to use d_splice_alias().

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/ecryptfs/inode.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 8ab014db3e03..beb9e2c8b8b3 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -352,8 +352,7 @@ static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry,
 
 	if (!lower_inode) {
 		/* We want to add because we couldn't find in lower */
-		d_add(dentry, NULL);
-		return NULL;
+		return d_splice_alias(NULL, dentry);
 	}
 	inode = __ecryptfs_get_inode(lower_inode, dentry->d_sb);
 	if (IS_ERR(inode)) {
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

* [PATCH 38/53] cephfs: Don't d_drop() before d_splice_alias()
From: NeilBrown @ 2026-03-12 21:12 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christian Brauner, Jan Kara,
	Jeff Layton, Trond Myklebust, Anna Schumaker, Carlos Maiolino,
	Miklos Szeredi, Amir Goldstein, Jan Harkes, Hugh Dickins,
	Baolin Wang, David Howells, Marc Dionne, Steve French,
	Namjae Jeon, Sungjong Seo, Yuezhang Mo, Andreas Hindborg,
	Breno Leitao, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Tyler Hicks, Andreas Gruenbacher, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Jeremy Kerr, Ard Biesheuvel
  Cc: linux-fsdevel, linux-nfs, linux-xfs, linux-unionfs, coda,
	linux-mm, linux-afs, linux-cifs, linux-ext4, linux-kernel,
	linux-trace-kernel, ceph-devel, ecryptfs, gfs2, linux-um,
	linux-efi
In-Reply-To: <20260312214330.3885211-1-neilb@ownmail.net>

From: NeilBrown <neil@brown.name>

In two places ceph drops a dentry and then calls d_splice_alias().
The d_drop() is no longer needed before d_splice_alias() and will
cause problems for proposed changes to locking.

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/ceph/file.c  | 2 --
 fs/ceph/inode.c | 3 ---
 2 files changed, 5 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 66bbf6d517a9..c40d129bbd03 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -751,8 +751,6 @@ static int ceph_finish_async_create(struct inode *dir, struct inode *inode,
 			unlock_new_inode(inode);
 		}
 		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
-			if (!d_unhashed(dentry))
-				d_drop(dentry);
 			dn = d_splice_alias(inode, dentry);
 			WARN_ON_ONCE(dn && dn != dentry);
 		}
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 8557b207d337..32bac5cac8c4 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1517,9 +1517,6 @@ static int splice_dentry(struct dentry **pdn, struct inode *in)
 		}
 	}
 
-	/* dn must be unhashed */
-	if (!d_unhashed(dn))
-		d_drop(dn);
 	realdn = d_splice_alias(in, dn);
 	if (IS_ERR(realdn)) {
 		pr_err_client(cl, "error %ld %p inode %p ino %llx.%llx\n",
-- 
2.50.0.107.gf914562f5916.dirty


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox