[RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
@ 2024-05-11  2:27 Yafang Shao
  2024-05-11  2:53 ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Yafang Shao @ 2024-05-11  2:27 UTC (permalink / raw)
  To: viro, brauner, jack
  Cc: linux-fsdevel, Yafang Shao, Wangkai, Colin Walters, Waiman Long,
	Linus Torvalds

Our applications, built on Elasticsearch[0], frequently create and delete
files. These applications operate within containers, some with a memory
limit exceeding 100GB. Over prolonged periods, the accumulation of negative
dentries within these containers can amount to tens of gigabytes.

Upon container exit, directories are deleted. However, due to the numerous
associated dentries, this process can be time-consuming. Our users have
expressed frustration with this prolonged exit duration, which constitutes
our first issue.

Simultaneously, other processes may attempt to access the parent directory
of the Elasticsearch directories. Since the task responsible for deleting
the dentries holds the inode lock, processes attempting directory lookup
experience significant delays. This issue, our second problem, is easily
demonstrated:

  - Task 1 generates negative dentries: 
  $ pwd
  ~/test
  $ mkdir es && cd es/ && ./create_and_delete_files.sh

  [ After generating tens of GB dentries ]  

  $ cd ~/test && rm -rf es

  [ It will take a long duration to finish ]

  - Task 2 attempts to lookup the 'test/' directory
  $ pwd
  ~/test
  $ ls

  The 'ls' command in Task 2 experiences prolonged execution as Task 1
  is deleting the dentries.

We've devised a solution to address both issues by deleting associated
dentry when removing a file. Interestingly, we've noted that a similar
patch was proposed years ago[1], although it was rejected citing the
absence of tangible issues caused by negative dentries. Given our current
challenges, we're resubmitting the proposal. All relevant stakeholders from
previous discussions have been included for reference.

[0]. https://github.com/elastic/elasticsearch
[1]. https://patchwork.kernel.org/project/linux-fsdevel/patch/1502099673-31620-1-git-send-email-wangkai86@huawei.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Wangkai <wangkai86@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Colin Walters <walters@verbum.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/dcache.c            | 4 ++++
 include/linux/dcache.h | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 71a8e943a0fa..4b97f60f0e64 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -701,6 +701,9 @@ static inline bool retain_dentry(struct dentry *dentry, bool locked)
 	if (unlikely(d_flags & DCACHE_DONTCACHE))
 		return false;

+	if (unlikely(dentry->d_flags & DCACHE_FILE_DELETED))
+		return false;
+
 	// At this point it looks like we ought to keep it.  We also might
 	// need to do something - put it on LRU if it wasn't there already
 	// and mark it referenced if it was on LRU, but not marked yet.
@@ -2392,6 +2395,7 @@ void d_delete(struct dentry * dentry)
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&inode->i_lock);
 	}
+	dentry->d_flags |= DCACHE_FILE_DELETED;
 }
 EXPORT_SYMBOL(d_delete);

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index bf53e3894aae..55a69682918c 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -210,7 +210,7 @@ struct dentry_operations {

 #define DCACHE_NOKEY_NAME		BIT(25) /* Encrypted name encoded without key */
 #define DCACHE_OP_REAL			BIT(26)
-
+#define DCACHE_FILE_DELETED		BIT(27) /* File is deleted */
 #define DCACHE_PAR_LOOKUP		BIT(28) /* being looked up (with parent locked shared) */
 #define DCACHE_DENTRY_CURSOR		BIT(29)
 #define DCACHE_NORCU			BIT(30) /* No RCU delay for freeing */
-- 
2.39.1

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-11  2:27 [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file Yafang Shao
@ 2024-05-11  2:53 ` Linus Torvalds
  2024-05-11  3:35   ` Yafang Shao
  2024-05-11  3:36   ` [RFC PATCH] fs: dcache: " Al Viro
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-05-11  2:53 UTC (permalink / raw)
  To: Yafang Shao
  Cc: viro, brauner, jack, linux-fsdevel, Wangkai, Colin Walters,
	Waiman Long

On Fri, 10 May 2024 at 19:28, Yafang Shao <laoar.shao@gmail.com> wrote:
>
> We've devised a solution to address both issues by deleting associated
> dentry when removing a file.

This patch is buggy. You are modifying d_flags outside the locked region.

So at a minimum, the DCACHE_FILE_DELETED bit setting would need to
just go into the

        if (dentry->d_lockref.count == 1) {

side of the conditional, since the other side of that conditional
already unhashes the dentry which makes this all moot anyway.

That said, I think it's buggy in another way too: what if somebody
else looks up the dentry before it actually gets unhashed? Then you
have another ref to it, and the dentry might live long enough that it
then gets re-used for a newly created file (which is why we have those
negative dentries in the first place).

So you'd have to clear the DCACHE_FILE_DELETED if the dentry is then
made live by a file creation or rename or whatever.

So that d_flags thing is actually pretty complicated.

But since you made all this unconditional anyway, I think having a new
dentry flag is unnecessary in the first place, and I suspect you are
better off just unhashing the dentry unconditionally instead.

IOW, I think the simpler patch is likely just something like this:

  --- a/fs/dcache.c
  +++ b/fs/dcache.c
  @@ -2381,6 +2381,7 @@ void d_delete(struct dentry * dentry)

        spin_lock(&inode->i_lock);
        spin_lock(&dentry->d_lock);
  +     __d_drop(dentry);
        /*
         * Are we the only user?
         */
  @@ -2388,7 +2389,6 @@ void d_delete(struct dentry * dentry)
                dentry->d_flags &= ~DCACHE_CANT_MOUNT;
                dentry_unlink_inode(dentry);
        } else {
  -             __d_drop(dentry);
                spin_unlock(&dentry->d_lock);
                spin_unlock(&inode->i_lock);
        }

although I think Al needs to ACK this, and I suspect that unhashing
the dentry also makes that

                dentry->d_flags &= ~DCACHE_CANT_MOUNT;

pointless (because the dentry won't be reused, so DCACHE_CANT_MOUNT
just won't matter).

I do worry that there are loads that actually love our current
behavior, but maybe it's worth doing the simple unconditional "make
d_delete() always unhash" and only worry about whether that causes
performance problems for people who commonly create a new file in its
place when we get such a report.

IOW, the more complex thing might be to actually take other behavior
into account (eg "do we have so many negative dentries that we really
don't want to create new ones").

Al - can you please step in and tell us what else I've missed, and why
my suggested version of the patch is also broken garbage?

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-11  2:53 ` Linus Torvalds
@ 2024-05-11  3:35   ` Yafang Shao
  2024-05-11  4:54     ` Waiman Long
  2024-05-15  2:18     ` [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file Yafang Shao
  2024-05-11  3:36   ` [RFC PATCH] fs: dcache: " Al Viro
  1 sibling, 2 replies; 43+ messages in thread
From: Yafang Shao @ 2024-05-11  3:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: viro, brauner, jack, linux-fsdevel, Wangkai, Colin Walters,
	Waiman Long

On Sat, May 11, 2024 at 10:54 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 10 May 2024 at 19:28, Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > We've devised a solution to address both issues by deleting associated
> > dentry when removing a file.
>
> This patch is buggy. You are modifying d_flags outside the locked region.
>
> So at a minimum, the DCACHE_FILE_DELETED bit setting would need to
> just go into the
>
>         if (dentry->d_lockref.count == 1) {
>
> side of the conditional, since the other side of that conditional
> already unhashes the dentry which makes this all moot anyway.
>
> That said, I think it's buggy in another way too: what if somebody
> else looks up the dentry before it actually gets unhashed? Then you
> have another ref to it, and the dentry might live long enough that it
> then gets re-used for a newly created file (which is why we have those
> negative dentries in the first place).
>
> So you'd have to clear the DCACHE_FILE_DELETED if the dentry is then
> made live by a file creation or rename or whatever.
>
> So that d_flags thing is actually pretty complicated.
>
> But since you made all this unconditional anyway, I think having a new
> dentry flag is unnecessary in the first place, and I suspect you are
> better off just unhashing the dentry unconditionally instead.
>
> IOW, I think the simpler patch is likely just something like this:

It's simpler. I used to contemplate handling it that way, but lack the
knowledge and courage to proceed, hence I opted for the d_flags
solution.
I'll conduct tests on the revised change. Appreciate your suggestion.

>
>   --- a/fs/dcache.c
>   +++ b/fs/dcache.c
>   @@ -2381,6 +2381,7 @@ void d_delete(struct dentry * dentry)
>
>         spin_lock(&inode->i_lock);
>         spin_lock(&dentry->d_lock);
>   +     __d_drop(dentry);
>         /*
>          * Are we the only user?
>          */
>   @@ -2388,7 +2389,6 @@ void d_delete(struct dentry * dentry)
>                 dentry->d_flags &= ~DCACHE_CANT_MOUNT;
>                 dentry_unlink_inode(dentry);
>         } else {
>   -             __d_drop(dentry);
>                 spin_unlock(&dentry->d_lock);
>                 spin_unlock(&inode->i_lock);
>         }
>
> although I think Al needs to ACK this, and I suspect that unhashing
> the dentry also makes that
>
>                 dentry->d_flags &= ~DCACHE_CANT_MOUNT;
>
> pointless (because the dentry won't be reused, so DCACHE_CANT_MOUNT
> just won't matter).
>
> I do worry that there are loads that actually love our current
> behavior, but maybe it's worth doing the simple unconditional "make
> d_delete() always unhash" and only worry about whether that causes
> performance problems for people who commonly create a new file in its
> place when we get such a report.
>
> IOW, the more complex thing might be to actually take other behavior
> into account (eg "do we have so many negative dentries that we really
> don't want to create new ones").

This poses a substantial challenge. Despite recurrent discussions
within the community about improving negative dentry over and over,
there hasn't been a consensus on how to address it.

>
> Al - can you please step in and tell us what else I've missed, and why
> my suggested version of the patch is also broken garbage?
>
>              Linus


-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-11  2:53 ` Linus Torvalds
  2024-05-11  3:35   ` Yafang Shao
@ 2024-05-11  3:36   ` Al Viro
  2024-05-11  3:51     ` Yafang Shao
                       ` (2 more replies)
  1 sibling, 3 replies; 43+ messages in thread
From: Al Viro @ 2024-05-11  3:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yafang Shao, brauner, jack, linux-fsdevel, Wangkai, Colin Walters,
	Waiman Long

On Fri, May 10, 2024 at 07:53:49PM -0700, Linus Torvalds wrote:

> although I think Al needs to ACK this, and I suspect that unhashing
> the dentry also makes that
> 
>                 dentry->d_flags &= ~DCACHE_CANT_MOUNT;
> 
> pointless (because the dentry won't be reused, so DCACHE_CANT_MOUNT
> just won't matter).
> 
> I do worry that there are loads that actually love our current
> behavior, but maybe it's worth doing the simple unconditional "make
> d_delete() always unhash" and only worry about whether that causes
> performance problems for people who commonly create a new file in its
> place when we get such a report.
> 
> IOW, the more complex thing might be to actually take other behavior
> into account (eg "do we have so many negative dentries that we really
> don't want to create new ones").
> 
> Al - can you please step in and tell us what else I've missed, and why
> my suggested version of the patch is also broken garbage?

Need to RTFS and think for a while; I think it should be OK, but I'll need
to dig through the tree to tell if there's anything nasty for e.g. network
filesystems.

Said that, I seriously suspect that there are loads where it would become
painful.  unlink() + creat() is _not_ a rare sequence, and this would
shove an extra negative lookup into each of those.

I would like to see the details on original posters' setup.  Note that
successful rmdir() evicts all children, so that it would seem that their
arseloads of negative dentries come from a bunch of unlinks in surviving
directories.

It would be interesting to see if using something like
	mkdir graveyard
	rename victim over there, then unlink in new place
	rename next victim over there, then unlink in new place
	....
	rmdir graveyard
would change the situation with memory pressure - it would trigger
eviction of all those negatives at controlled point(s) (rmdir).
I'm not saying that it's a good mitigation, but it would get more
details on that memory pressure.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-11  3:36   ` [RFC PATCH] fs: dcache: " Al Viro
@ 2024-05-11  3:51     ` Yafang Shao
  2024-05-11  5:18     ` Al Viro
  2024-05-11  5:32     ` Linus Torvalds
  2 siblings, 0 replies; 43+ messages in thread
From: Yafang Shao @ 2024-05-11  3:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, brauner, jack, linux-fsdevel, Wangkai,
	Colin Walters, Waiman Long

On Sat, May 11, 2024 at 11:36 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, May 10, 2024 at 07:53:49PM -0700, Linus Torvalds wrote:
>
> > although I think Al needs to ACK this, and I suspect that unhashing
> > the dentry also makes that
> >
> >                 dentry->d_flags &= ~DCACHE_CANT_MOUNT;
> >
> > pointless (because the dentry won't be reused, so DCACHE_CANT_MOUNT
> > just won't matter).
> >
> > I do worry that there are loads that actually love our current
> > behavior, but maybe it's worth doing the simple unconditional "make
> > d_delete() always unhash" and only worry about whether that causes
> > performance problems for people who commonly create a new file in its
> > place when we get such a report.
> >
> > IOW, the more complex thing might be to actually take other behavior
> > into account (eg "do we have so many negative dentries that we really
> > don't want to create new ones").
> >
> > Al - can you please step in and tell us what else I've missed, and why
> > my suggested version of the patch is also broken garbage?
>
> Need to RTFS and think for a while; I think it should be OK, but I'll need
> to dig through the tree to tell if there's anything nasty for e.g. network
> filesystems.
>
> Said that, I seriously suspect that there are loads where it would become
> painful.  unlink() + creat() is _not_ a rare sequence, and this would
> shove an extra negative lookup into each of those.
>
> I would like to see the details on original posters' setup.  Note that
> successful rmdir() evicts all children, so that it would seem that their
> arseloads of negative dentries come from a bunch of unlinks in surviving
> directories.

Right, the directories are fixed. We've engaged in discussions with ES
users regarding changing the directory, but it would entail a
significant adjustment for them.

>
> It would be interesting to see if using something like
>         mkdir graveyard
>         rename victim over there, then unlink in new place
>         rename next victim over there, then unlink in new place
>         ....
>         rmdir graveyard
> would change the situation with memory pressure - it would trigger
> eviction of all those negatives at controlled point(s) (rmdir).
> I'm not saying that it's a good mitigation, but it would get more
> details on that memory pressure.


-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-11  3:35   ` Yafang Shao
@ 2024-05-11  4:54     ` Waiman Long
  2024-05-11 15:58       ` Matthew Wilcox
  2024-05-15  2:18     ` [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file Yafang Shao
  1 sibling, 1 reply; 43+ messages in thread
From: Waiman Long @ 2024-05-11  4:54 UTC (permalink / raw)
  To: Yafang Shao, Linus Torvalds
  Cc: viro, brauner, jack, linux-fsdevel, Wangkai, Colin Walters

On 5/10/24 23:35, Yafang Shao wrote:
>> pointless (because the dentry won't be reused, so DCACHE_CANT_MOUNT
>> just won't matter).
>>
>> I do worry that there are loads that actually love our current
>> behavior, but maybe it's worth doing the simple unconditional "make
>> d_delete() always unhash" and only worry about whether that causes
>> performance problems for people who commonly create a new file in its
>> place when we get such a report.
>>
>> IOW, the more complex thing might be to actually take other behavior
>> into account (eg "do we have so many negative dentries that we really
>> don't want to create new ones").
> This poses a substantial challenge. Despite recurrent discussions
> within the community about improving negative dentry over and over,
> there hasn't been a consensus on how to address it.

I had suggested in the past to have a sysctl parameter to set a 
threshold on how many negative dentries (as a percentage of total system 
memory) are regarded as too many and activate some processes to control 
dentry creation. The hard part is to discard the oldest negative 
dentries as we can have many memcg LRU lists to look at and it can be a 
time consuming process. We could theoretically start removing dentries 
when removing files when 50% of the threshold is reached. At 100%, we 
start discarding old negative dentries. At 150%, we stop negative dentry 
creation altogether.

My 2 cents.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-11  3:36   ` [RFC PATCH] fs: dcache: " Al Viro
  2024-05-11  3:51     ` Yafang Shao
@ 2024-05-11  5:18     ` Al Viro
  2024-05-11  5:32     ` Linus Torvalds
  2 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2024-05-11  5:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yafang Shao, brauner, jack, linux-fsdevel, Wangkai, Colin Walters,
	Waiman Long

On Sat, May 11, 2024 at 04:36:19AM +0100, Al Viro wrote:

> Said that, I seriously suspect that there are loads where it would become
> painful.  unlink() + creat() is _not_ a rare sequence, and this would
> shove an extra negative lookup into each of those.
> 
> I would like to see the details on original posters' setup.  Note that
> successful rmdir() evicts all children, so that it would seem that their
> arseloads of negative dentries come from a bunch of unlinks in surviving
> directories.
> 
> It would be interesting to see if using something like
> 	mkdir graveyard
> 	rename victim over there, then unlink in new place
> 	rename next victim over there, then unlink in new place
> 	....
> 	rmdir graveyard
> would change the situation with memory pressure - it would trigger
> eviction of all those negatives at controlled point(s) (rmdir).
> I'm not saying that it's a good mitigation, but it would get more
> details on that memory pressure.

BTW, how about adding to do_vfs_ioctl() something like
	case FS_IOC_FORGET:
		shrink_dcache_parent(file->f_path.dentry);
		return 0;
possibly with option for dropping only negatives?

Even in the minimal form it would allow userland to force eviction
in given directory tree, without disrupting things anywhere else.
That's a trivial (completely untested) patch, really -

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 1d5abfdf0f22..342bb71cf76c 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -878,6 +878,12 @@ static int do_vfs_ioctl(struct file *filp, unsigned int fd,
 	case FS_IOC_GETFSSYSFSPATH:
 		return ioctl_get_fs_sysfs_path(filp, argp);
 
+	case FS_IOC_FORGET:
+		if (arg != 0)	// only 0 for now
+			break;
+		shrink_dcache_parent(filp->f_path.dentry);
+		return 0;
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			return file_ioctl(filp, cmd, argp);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 45e4e64fd664..143129510289 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -222,6 +222,7 @@ struct fsxattr {
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)
+#define	FS_IOC_FORGET			_IOW('f', 255, int)
 #define	FS_IOC_GETVERSION		_IOR('v', 1, long)
 #define	FS_IOC_SETVERSION		_IOW('v', 2, long)
 #define FS_IOC_FIEMAP			_IOWR('f', 11, struct fiemap)

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-11  3:36   ` [RFC PATCH] fs: dcache: " Al Viro
  2024-05-11  3:51     ` Yafang Shao
  2024-05-11  5:18     ` Al Viro
@ 2024-05-11  5:32     ` Linus Torvalds
  2 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-05-11  5:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Yafang Shao, brauner, jack, linux-fsdevel, Wangkai, Colin Walters,
	Waiman Long

On Fri, 10 May 2024 at 20:36, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Said that, I seriously suspect that there are loads where it would become
> painful.  unlink() + creat() is _not_ a rare sequence, and this would
> shove an extra negative lookup into each of those.

The other side of this is that the "lots of negative dentries after
people have removed files" is definitely something that has come up
before as a latency problem.

So I do wonder if we've just been too worried about not changing the
status quo, and maybe the whole "make unlink() turn a positive dentry
into a negative one" was always a mis-optimization.

I do agree that we might have to do something more complicated, but I
think that before we just _assume_ we have to do that, maybe we should
just do the simple and stupid thing.

Because while "unlink and create" is most definitely a very real
pattern, maybe it's not really _so_ common that we should care about
it as a primary issue?

The reason to keep the negative dentry around is to avoid the
unnecessary "->lookup()" followed by "->create()" directory
operations.

But network filesystems - where that is _particularly_ expensive - end
up having done the whole ->atomic_open thing anyway, so maybe that
reason isn't as big of a deal as it used to be?

And I don't particularly like your "give people a flush ioctl" either.
There are security concerns with that one - both for timing and for
just basically messing with performance.

At the very least, I think it should require write permissions on the
directory you are flushing, but I really think we should instead aim
to be better about this in the first place.

Anyway, in how many loads is that "unlink -> create" pattern more than
a very occasional thing? Which is why I think maybe we're just
overthinking this and being too timid when saying "we should count
negative dentries or something".

                 Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-11  4:54     ` Waiman Long
@ 2024-05-11 15:58       ` Matthew Wilcox
  2024-05-11 16:07         ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2024-05-11 15:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Yafang Shao, Linus Torvalds, viro, brauner, jack, linux-fsdevel,
	Wangkai, Colin Walters

On Sat, May 11, 2024 at 12:54:18AM -0400, Waiman Long wrote:
> I had suggested in the past to have a sysctl parameter to set a threshold on
> how many negative dentries (as a percentage of total system memory) are

Yes, but that's obviously bogus.  System memory is completely
uncorrelated with the optimum number of negative dentries to keep
around for workload performance.

Some multiple of "number of positive dentries" might make sense.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-11 15:58       ` Matthew Wilcox
@ 2024-05-11 16:07         ` Linus Torvalds
  2024-05-11 16:13           ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-05-11 16:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Waiman Long, Yafang Shao, viro, brauner, jack, linux-fsdevel,
	Wangkai, Colin Walters

On Sat, 11 May 2024 at 08:59, Matthew Wilcox <willy@infradead.org> wrote:
>
> Some multiple of "number of positive dentries" might make sense.

The thing is, it would probably have to be per parent-dentry to be
useful, since that is typically what causes latency concerns: not
"global negative dentries", but "negative dentries that have to be
flushed for this parent as it is deleted".

And you'd probably need to make the child list be some kind of LRU
list to make this all effective.

Which is all likely rather expensive in both CPU time (stats tend to
cause lots of dirty caches) and in memory (the dentry would grow for
both stats and for the d_children list - it would almost certainly
have to change from a 'struct hlist_head' to a 'struct list_head' so
that you can put things either at the to the beginning or end).

I dunno. I haven't looked at just *how* nasty it would be. Maybe it
wouldn't be as bad as I suspect it would be.

Now, the other option might be to just make the latency concerns
smaller. It's not like removing negative dentries is very costly per
se. I think the issue has always been the dcache_lock, not the work to
remove the dentries themselves.

                Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-11 16:07         ` Linus Torvalds
@ 2024-05-11 16:13           ` Linus Torvalds
  2024-05-11 18:05             ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-05-11 16:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Waiman Long, Yafang Shao, viro, brauner, jack, linux-fsdevel,
	Wangkai, Colin Walters

On Sat, 11 May 2024 at 09:07, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Now, the other option might be to just make the latency concerns
> smaller. It's not like removing negative dentries is very costly per
> se. I think the issue has always been the dcache_lock, not the work to
> remove the dentries themselves.

Actually, going back to re-read this particular report, at least this
time it was the inode lock of the parent, not the dcache_lock.

But the point ends up being the same - lots of negative dentries
aren't necessarily a problem in themselves, because the common
operation that matters is the hash lookup, which scales fairly well.

They mainly tend to become a problem when they hold up something else.

So better batching, or maybe just walking the negative child dentry
list after having marked the parent dead and then released the lock,
might also be the solution.

                      Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-11 16:13           ` Linus Torvalds
@ 2024-05-11 18:05             ` Linus Torvalds
  2024-05-11 18:26               ` [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()' Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-05-11 18:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Waiman Long, Yafang Shao, viro, brauner, jack, linux-fsdevel,
	Wangkai, Colin Walters

On Sat, 11 May 2024 at 09:13, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So better batching, or maybe just walking the negative child dentry
> list after having marked the parent dead and then released the lock,
> might also be the solution.

IOW, maybe the solution is something as simple as this UNTESTED patch instead?

  --- a/fs/namei.c
  +++ b/fs/namei.c
  @@ -4207,16 +4207,19 @@ int vfs_rmdir(struct mnt_idmap *idmap,
struct inode *dir,
        if (error)
                goto out;

  -     shrink_dcache_parent(dentry);
        dentry->d_inode->i_flags |= S_DEAD;
        dont_mount(dentry);
        detach_mounts(dentry);
  +     inode_unlock(dentry->d_inode);
  +
  +     shrink_dcache_parent(dentry);
  +     dput(dentry);
  +     d_delete_notify(dir, dentry);
  +     return 0;

   out:
        inode_unlock(dentry->d_inode);
        dput(dentry);
  -     if (!error)
  -             d_delete_notify(dir, dentry);
        return error;
   }
   EXPORT_SYMBOL(vfs_rmdir);

where that "shrink_dcache_parent()" will still be quite expensive if
the directory has a ton of negative dentries, but because we now free
them after we've marked the dentry dead and released the inode, nobody
much cares any more?

Note (as usual) that this is untested, and maybe I haven't thought
everything through and I might be missing some important detail.

But I think shrinking the children later should be just fine - there's
nothing people can *do* with them. At worst they are reachable for
some lookup that doesn't take the locks, but that will just result in
a negative dentry which is all good, since they cannot become positive
dentries any more at this point.

So I think the inode lock is irrelevant for negative dentries, and
shrinking them outside the lock feels like a very natural thing to do.

Again: this is more of a "brainstorming patch" than an actual
suggestion. Yafang - it might be worth testing for your load, but
please do so knowing that it might have some consistency issues, so
don't test it on any production machinery, please ;)

Can anybody point out why I'm being silly, and the above change is
completely broken garbage? Please use small words to point out my
mental deficiencies to make sure I get it.

Just to clarify: this obviously will *not* speed up the actual rmdir
itself. *ALL* it does is to avoid holding the lock over the
potentially long cleanup operation. So the latency of the rmdir is
still the same, but now it should no longer matter for anything else.

            Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-11 18:05             ` Linus Torvalds
@ 2024-05-11 18:26               ` Linus Torvalds
  2024-05-11 18:42                 ` Linus Torvalds
  2024-05-11 19:24                 ` [PATCH] " Al Viro
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-05-11 18:26 UTC (permalink / raw)
  To: torvalds
  Cc: brauner, jack, laoar.shao, linux-fsdevel, longman, viro, walters,
	wangkai86, willy

Yafang Shao reports that he has seen loads that generate billions of
negative dentries in a directory, which then when the directory is
removed causes excessive latencies for other users because the dentry
shrinking is done under the directory inode lock.

There seems to be no actual reason for holding the inode lock any more
by the time we get rid of the now uninteresting negative dentries, and
it's an effect of just code layout (the shared error path).

Reorganize the code trivially to just have a separate success path,
which simplifies the code (since 'd_delete_notify()' is only called in
the success path anyway) and makes it trivial to just move the dentry
shrinking outside the inode lock.

Reported-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/all/20240511022729.35144-1-laoar.shao@gmail.com/
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Ok, this is the same patch, just with a commit message.  And I actually
am running a kernel with that patch, so it compiles and boots.

Very limited testing: I created a directory with ten million negative
dentries, and then did a "rmdir" in one terminal window while doing a
"ls" inside that directory in another to kind of reproduce (on a smaller
scale) what Yafang was reporting. 

The "ls" was not affected at all.  But honestly, this was *ONE* single
trivial test, so it's almost completely worthless. 

 fs/namei.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 28e62238346e..474b1ee3266d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4217,16 +4217,19 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		goto out;

-	shrink_dcache_parent(dentry);
 	dentry->d_inode->i_flags |= S_DEAD;
 	dont_mount(dentry);
 	detach_mounts(dentry);
+	inode_unlock(dentry->d_inode);
+
+	shrink_dcache_parent(dentry);
+	dput(dentry);
+	d_delete_notify(dir, dentry);
+	return 0;

 out:
 	inode_unlock(dentry->d_inode);
 	dput(dentry);
-	if (!error)
-		d_delete_notify(dir, dentry);
 	return error;
 }
 EXPORT_SYMBOL(vfs_rmdir);
-- 
2.44.0.330.g4d18c88175

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-11 18:26               ` [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()' Linus Torvalds
@ 2024-05-11 18:42                 ` Linus Torvalds
  2024-05-11 19:28                   ` Al Viro
  2024-05-11 20:02                   ` [PATCH v2] " Linus Torvalds
  2024-05-11 19:24                 ` [PATCH] " Al Viro
  1 sibling, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-05-11 18:42 UTC (permalink / raw)
  To: torvalds
  Cc: brauner, jack, laoar.shao, linux-fsdevel, longman, viro, walters,
	wangkai86, willy

On Sat, 11 May 2024 at 11:29, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Reorganize the code trivially to just have a separate success path,
> which simplifies the code (since 'd_delete_notify()' is only called in
> the success path anyway) and makes it trivial to just move the dentry
> shrinking outside the inode lock.

Bah.

I think this might need more work.

The *caller* of vfs_rmdir() also holds a lock, ie we have do_rmdir() doing

        inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
        dentry = lookup_one_qstr_excl(&last, path.dentry, lookup_flags);
        ...
        error = vfs_rmdir(mnt_idmap(path.mnt), path.dentry->d_inode, dentry);
        dput(dentry);
        inode_unlock(path.dentry->d_inode);

so we have another level of locking going on, and my patch only moved
the dcache pruning outside the lock of the directory we're removing
(not outside the lock of the directory that contains the removed
directory).

And that outside lock is the much more important one, I bet.

So I still think this approach may be the right one, but that patch of
mine didn't go far enough.

Sadly, while do_rmdir() itself is trivial to fix up to do this, we
have several other users of vfs_rmdir() (ecryptfs, devtmpfs, overlayfs
in addition to nfsd and ksmbd), so the more complete patch would be
noticeably bigger.

My bad.

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-11 18:26               ` [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()' Linus Torvalds
  2024-05-11 18:42                 ` Linus Torvalds
@ 2024-05-11 19:24                 ` Al Viro
  1 sibling, 0 replies; 43+ messages in thread
From: Al Viro @ 2024-05-11 19:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: brauner, jack, laoar.shao, linux-fsdevel, longman, walters,
	wangkai86, willy

On Sat, May 11, 2024 at 11:26:26AM -0700, Linus Torvalds wrote:
> Yafang Shao reports that he has seen loads that generate billions of
> negative dentries in a directory, which then when the directory is
> removed causes excessive latencies for other users because the dentry
> shrinking is done under the directory inode lock.
> 
> There seems to be no actual reason for holding the inode lock any more
> by the time we get rid of the now uninteresting negative dentries, and
> it's an effect of just code layout (the shared error path).
> 
> Reorganize the code trivially to just have a separate success path,
> which simplifies the code (since 'd_delete_notify()' is only called in
> the success path anyway) and makes it trivial to just move the dentry
> shrinking outside the inode lock.

Wait, I thought he had confirmed that in the setup in question directories
were *not* removed, hadn't he?

I've no objections against that patch, but if the above is accurate it's
not going to help...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-11 18:42                 ` Linus Torvalds
@ 2024-05-11 19:28                   ` Al Viro
  2024-05-11 19:55                     ` Linus Torvalds
  2024-05-12 15:45                     ` James Bottomley
  2024-05-11 20:02                   ` [PATCH v2] " Linus Torvalds
  1 sibling, 2 replies; 43+ messages in thread
From: Al Viro @ 2024-05-11 19:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: brauner, jack, laoar.shao, linux-fsdevel, longman, walters,
	wangkai86, willy

On Sat, May 11, 2024 at 11:42:34AM -0700, Linus Torvalds wrote:

> so we have another level of locking going on, and my patch only moved
> the dcache pruning outside the lock of the directory we're removing
> (not outside the lock of the directory that contains the removed
> directory).
> 
> And that outside lock is the much more important one, I bet.

... and _that_ is where taking d_delete outside of the lock might
take an unpleasant analysis of a lot of code.

We have places where we assume that holding the parent locked
will prevent ->d_inode changes of children.  It might be possible
to get rid of that, but it will take a non-trivial amount of work.

In any case, I think the original poster said that parent directories
were not removed, so I doubt that rmdir() behaviour is relevant for
their load.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-11 19:28                   ` Al Viro
@ 2024-05-11 19:55                     ` Linus Torvalds
  2024-05-11 20:31                       ` Al Viro
  2024-05-12 15:45                     ` James Bottomley
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-05-11 19:55 UTC (permalink / raw)
  To: Al Viro
  Cc: brauner, jack, laoar.shao, linux-fsdevel, longman, walters,
	wangkai86, willy

[-- Attachment #1: Type: text/plain, Size: 2457 bytes --]

On Sat, 11 May 2024 at 12:28, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sat, May 11, 2024 at 11:42:34AM -0700, Linus Torvalds wrote:
> >
> > And that outside lock is the much more important one, I bet.
>
> ... and _that_ is where taking d_delete outside of the lock might
> take an unpleasant analysis of a lot of code.

Hmm. It really shouldn't matter. There can only be negative children
of the now deleted directory, so there are no actual effects on
inodes.

It only affects the d_child list, which is protected by d_lock (and
can be modified outside of the inode lock anyway due to memory
pressure).

What am I missing?

> In any case, I think the original poster said that parent directories
> were not removed, so I doubt that rmdir() behaviour is relevant for
> their load.

I don't see that at all. The load was a "rm -rf" of a directory tree,
and all of that was successful as far as I can see from the report.

The issue was that an unrelated process just looking at the directory
(either one - I clearly tested the wrong one) would be held up by the
directory lock while the pruning was going on.

And yes, the pruning can take a couple of seconds with "just" a few
million negative dentries. The negative dentries obviously don't even
have to be the result of a 'delete' - the easy way to see this is to
do a lot of bogus lookups.

Attached is my excessively stupid test-case in case you want to play
around with it:

    [dirtest]$ time ./a.out dir ; time rmdir dir

    real 0m12.592s
    user 0m1.153s
    sys 0m11.412s

    real 0m1.892s
    user 0m0.001s
    sys 0m1.528s

so you can see how it takes almost two seconds to then flush those
negative dentries - even when there were no 'unlink()' calls at all,
just failed lookups.

It's maybe instructive to do the same on tmpfs, which has

    /*
     * Retaining negative dentries for an in-memory filesystem just wastes
     * memory and lookup time: arrange for them to be deleted immediately.
     */
    int always_delete_dentry(const struct dentry *dentry)
    {
        return 1;
    }

and so if you do the same test on /tmp, the results are very different:

    [dirtest]$ time ./a.out /tmp/sillydir ; time rmdir /tmp/sillydir

    real 0m8.129s
    user 0m1.164s
    sys 0m6.592s

    real 0m0.001s
    user 0m0.000s
    sys 0m0.001s

so it does show very different patterns and you can test the whole
"what happens without negative dentries" case.

                 Linus

[-- Attachment #2: main.c --]
[-- Type: text/x-csrc, Size: 575 bytes --]

#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>

#define FATAL(x) do { if (x) die(#x); } while (0)

static void die(const char *s)
{
	fprintf(stderr, "%s: %s\n", s, strerror(errno));
	exit(1);
}

int main(int argc, char ** argv)
{
	char *dirname = argv[1];

	FATAL(argc < 2);
	FATAL(mkdir(dirname, 0700));
	for (int i = 0; i < 10000000; i++) {
		int fd;
		char name[128];
		snprintf(name, sizeof(name), "%s/name-%09d", dirname, i);
		FATAL(open(name, O_RDONLY) >= 0);
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-11 18:42                 ` Linus Torvalds
  2024-05-11 19:28                   ` Al Viro
@ 2024-05-11 20:02                   ` Linus Torvalds
  2024-05-12  3:06                     ` Yafang Shao
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-05-11 20:02 UTC (permalink / raw)
  To: torvalds
  Cc: brauner, jack, laoar.shao, linux-fsdevel, longman, viro, walters,
	wangkai86, willy

Yafang Shao reports that he has seen loads that generate billions of
negative dentries in a directory, which then when the directory is
removed causes excessive latencies for other users because the dentry
shrinking is done under the directory inode lock.

There seems to be no actual reason for holding the inode lock any more
by the time we get rid of the now uninteresting negative dentries, and
it's an effect of the calling convention.

Split the 'vfs_rmdir()' function into two separate phases:

 - 'vfs_rmdir_raw()' does the actual main rmdir heavy lifting

 - 'vfs_rmdir_cleanup()' needs to be run by the caller after a
   successful raw call, after the caller has dropped the inode locks.

We leave the 'vfs_rmdir()' function around, since it has multiple
callers, and only convert the main system call path to the new two-phase
model.  The other uses will be left as an exercise for the reader for
when people find they care.

[ Side note: I think the 'dget()/dput()' pair in vfs_rmdir_raw() is
  superfluous, since callers always have to have a dentry reference over
  the call anyway. That's a separate issue.    - Linus ]

Reported-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/all/20240511022729.35144-1-laoar.shao@gmail.com/
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Second version - this time doing the dentry pruning even later, after
releasing the parent inode lock too. 

I did the same amount of "extensive testing" on this one as the previous
one.  IOW, little-to-none. 

 fs/namei.c | 61 ++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 41 insertions(+), 20 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 28e62238346e..15b4ff6ed1e5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4176,21 +4176,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
 	return do_mkdirat(AT_FDCWD, getname(pathname), mode);
 }
 
-/**
- * vfs_rmdir - remove directory
- * @idmap:	idmap of the mount the inode was found from
- * @dir:	inode of @dentry
- * @dentry:	pointer to dentry of the base directory
- *
- * Remove a directory.
- *
- * If the inode has been found through an idmapped mount the idmap of
- * the vfsmount must be passed through @idmap. This function will then take
- * care to map the inode according to @idmap before checking permissions.
- * On non-idmapped mounts or if permission checking is to be performed on the
- * raw inode simply pass @nop_mnt_idmap.
- */
-int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
+static int vfs_rmdir_raw(struct mnt_idmap *idmap, struct inode *dir,
 		     struct dentry *dentry)
 {
 	int error = may_delete(idmap, dir, dentry, 1);
@@ -4217,18 +4203,43 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		goto out;
 
-	shrink_dcache_parent(dentry);
 	dentry->d_inode->i_flags |= S_DEAD;
 	dont_mount(dentry);
 	detach_mounts(dentry);
-
 out:
 	inode_unlock(dentry->d_inode);
 	dput(dentry);
-	if (!error)
-		d_delete_notify(dir, dentry);
 	return error;
 }
+
+static inline void vfs_rmdir_cleanup(struct inode *dir, struct dentry *dentry)
+{
+	shrink_dcache_parent(dentry);
+	d_delete_notify(dir, dentry);
+}
+
+/**
+ * vfs_rmdir - remove directory
+ * @idmap:	idmap of the mount the inode was found from
+ * @dir:	inode of @dentry
+ * @dentry:	pointer to dentry of the base directory
+ *
+ * Remove a directory.
+ *
+ * If the inode has been found through an idmapped mount the idmap of
+ * the vfsmount must be passed through @idmap. This function will then take
+ * care to map the inode according to @idmap before checking permissions.
+ * On non-idmapped mounts or if permission checking is to be performed on the
+ * raw inode simply pass @nop_mnt_idmap.
+ */
+int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
+		     struct dentry *dentry)
+{
+	int retval = vfs_rmdir_raw(idmap, dir, dentry);
+	if (!retval)
+		vfs_rmdir_cleanup(dir, dentry);
+	return retval;
+}
 EXPORT_SYMBOL(vfs_rmdir);
 
 int do_rmdir(int dfd, struct filename *name)
@@ -4272,7 +4283,17 @@ int do_rmdir(int dfd, struct filename *name)
 	error = security_path_rmdir(&path, dentry);
 	if (error)
 		goto exit4;
-	error = vfs_rmdir(mnt_idmap(path.mnt), path.dentry->d_inode, dentry);
+	error = vfs_rmdir_raw(mnt_idmap(path.mnt), path.dentry->d_inode, dentry);
+	if (error)
+		goto exit4;
+	inode_unlock(path.dentry->d_inode);
+	mnt_drop_write(path.mnt);
+	vfs_rmdir_cleanup(path.dentry->d_inode, dentry);
+	dput(dentry);
+	path_put(&path);
+	putname(name);
+	return 0;
+
 exit4:
 	dput(dentry);
 exit3:
-- 
2.44.0.330.g4d18c88175


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-11 19:55                     ` Linus Torvalds
@ 2024-05-11 20:31                       ` Al Viro
  2024-05-11 21:17                         ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2024-05-11 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: brauner, jack, laoar.shao, linux-fsdevel, longman, walters,
	wangkai86, willy

On Sat, May 11, 2024 at 12:55:29PM -0700, Linus Torvalds wrote:
> On Sat, 11 May 2024 at 12:28, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Sat, May 11, 2024 at 11:42:34AM -0700, Linus Torvalds wrote:
> > >
> > > And that outside lock is the much more important one, I bet.
> >
> > ... and _that_ is where taking d_delete outside of the lock might
> > take an unpleasant analysis of a lot of code.
> 
> Hmm. It really shouldn't matter. There can only be negative children
> of the now deleted directory, so there are no actual effects on
> inodes.
> 
> It only affects the d_child list, which is protected by d_lock (and
> can be modified outside of the inode lock anyway due to memory
> pressure).
> 
> What am I missing?

fsnotify and related fun, basically.  I'll need to redo the analysis,
but IIRC there had been places where correctness had been guaranteed
by the fact that this had been serialized by the lock on parent.

No idea how easy would it be to adapt to that change; I'll check,
but it'll take a few days, I'm afraid.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-11 20:31                       ` Al Viro
@ 2024-05-11 21:17                         ` Al Viro
  0 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2024-05-11 21:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: brauner, jack, laoar.shao, linux-fsdevel, longman, walters,
	wangkai86, willy

On Sat, May 11, 2024 at 09:31:43PM +0100, Al Viro wrote:
> On Sat, May 11, 2024 at 12:55:29PM -0700, Linus Torvalds wrote:
> > On Sat, 11 May 2024 at 12:28, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Sat, May 11, 2024 at 11:42:34AM -0700, Linus Torvalds wrote:
> > > >
> > > > And that outside lock is the much more important one, I bet.
> > >
> > > ... and _that_ is where taking d_delete outside of the lock might
> > > take an unpleasant analysis of a lot of code.
> > 
> > Hmm. It really shouldn't matter. There can only be negative children
> > of the now deleted directory, so there are no actual effects on
> > inodes.
> > 
> > It only affects the d_child list, which is protected by d_lock (and
> > can be modified outside of the inode lock anyway due to memory
> > pressure).
> > 
> > What am I missing?
> 
> fsnotify and related fun, basically.  I'll need to redo the analysis,
> but IIRC there had been places where correctness had been guaranteed
> by the fact that this had been serialized by the lock on parent.

As an aside, I'd really love to see d_rehash() gone - the really old nest
of users is gone (used to be in nfs), but there's still a weird corner case
in exfat + a bunch in AFS.  Life would be simpler if those had been gone -
many correctness proofs around dcache have unpleasant warts dealing with
that crap.  Not relevant in this case, but it makes analysis harder in
general...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-11 20:02                   ` [PATCH v2] " Linus Torvalds
@ 2024-05-12  3:06                     ` Yafang Shao
  2024-05-12  3:30                       ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Yafang Shao @ 2024-05-12  3:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: brauner, jack, linux-fsdevel, longman, viro, walters, wangkai86,
	willy

On Sun, May 12, 2024 at 4:03 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yafang Shao reports that he has seen loads that generate billions of
> negative dentries in a directory, which then when the directory is
> removed causes excessive latencies for other users because the dentry
> shrinking is done under the directory inode lock.
>
> There seems to be no actual reason for holding the inode lock any more
> by the time we get rid of the now uninteresting negative dentries, and
> it's an effect of the calling convention.
>
> Split the 'vfs_rmdir()' function into two separate phases:
>
>  - 'vfs_rmdir_raw()' does the actual main rmdir heavy lifting
>
>  - 'vfs_rmdir_cleanup()' needs to be run by the caller after a
>    successful raw call, after the caller has dropped the inode locks.
>
> We leave the 'vfs_rmdir()' function around, since it has multiple
> callers, and only convert the main system call path to the new two-phase
> model.  The other uses will be left as an exercise for the reader for
> when people find they care.
>
> [ Side note: I think the 'dget()/dput()' pair in vfs_rmdir_raw() is
>   superfluous, since callers always have to have a dentry reference over
>   the call anyway. That's a separate issue.    - Linus ]
>
> Reported-by: Yafang Shao <laoar.shao@gmail.com>
> Link: https://lore.kernel.org/all/20240511022729.35144-1-laoar.shao@gmail.com/
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

This could resolve the secondary concern.
Tested-by: Yafang Shao <laoar.shao@gmail.com>

Might it be feasible to execute the vfs_rmdir_cleanup() within a
kwoker? Such an approach could potentially mitigate the initial
concern as well.

> ---
>
> Second version - this time doing the dentry pruning even later, after
> releasing the parent inode lock too.
>
> I did the same amount of "extensive testing" on this one as the previous
> one.  IOW, little-to-none.
>
>  fs/namei.c | 61 ++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 41 insertions(+), 20 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 28e62238346e..15b4ff6ed1e5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -4176,21 +4176,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
>         return do_mkdirat(AT_FDCWD, getname(pathname), mode);
>  }
>
> -/**
> - * vfs_rmdir - remove directory
> - * @idmap:     idmap of the mount the inode was found from
> - * @dir:       inode of @dentry
> - * @dentry:    pointer to dentry of the base directory
> - *
> - * Remove a directory.
> - *
> - * If the inode has been found through an idmapped mount the idmap of
> - * the vfsmount must be passed through @idmap. This function will then take
> - * care to map the inode according to @idmap before checking permissions.
> - * On non-idmapped mounts or if permission checking is to be performed on the
> - * raw inode simply pass @nop_mnt_idmap.
> - */
> -int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> +static int vfs_rmdir_raw(struct mnt_idmap *idmap, struct inode *dir,
>                      struct dentry *dentry)
>  {
>         int error = may_delete(idmap, dir, dentry, 1);
> @@ -4217,18 +4203,43 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
>         if (error)
>                 goto out;
>
> -       shrink_dcache_parent(dentry);
>         dentry->d_inode->i_flags |= S_DEAD;
>         dont_mount(dentry);
>         detach_mounts(dentry);
> -
>  out:
>         inode_unlock(dentry->d_inode);
>         dput(dentry);
> -       if (!error)
> -               d_delete_notify(dir, dentry);
>         return error;
>  }
> +
> +static inline void vfs_rmdir_cleanup(struct inode *dir, struct dentry *dentry)
> +{
> +       shrink_dcache_parent(dentry);
> +       d_delete_notify(dir, dentry);
> +}
> +
> +/**
> + * vfs_rmdir - remove directory
> + * @idmap:     idmap of the mount the inode was found from
> + * @dir:       inode of @dentry
> + * @dentry:    pointer to dentry of the base directory
> + *
> + * Remove a directory.
> + *
> + * If the inode has been found through an idmapped mount the idmap of
> + * the vfsmount must be passed through @idmap. This function will then take
> + * care to map the inode according to @idmap before checking permissions.
> + * On non-idmapped mounts or if permission checking is to be performed on the
> + * raw inode simply pass @nop_mnt_idmap.
> + */
> +int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> +                    struct dentry *dentry)
> +{
> +       int retval = vfs_rmdir_raw(idmap, dir, dentry);
> +       if (!retval)
> +               vfs_rmdir_cleanup(dir, dentry);
> +       return retval;
> +}
>  EXPORT_SYMBOL(vfs_rmdir);
>
>  int do_rmdir(int dfd, struct filename *name)
> @@ -4272,7 +4283,17 @@ int do_rmdir(int dfd, struct filename *name)
>         error = security_path_rmdir(&path, dentry);
>         if (error)
>                 goto exit4;
> -       error = vfs_rmdir(mnt_idmap(path.mnt), path.dentry->d_inode, dentry);
> +       error = vfs_rmdir_raw(mnt_idmap(path.mnt), path.dentry->d_inode, dentry);
> +       if (error)
> +               goto exit4;
> +       inode_unlock(path.dentry->d_inode);
> +       mnt_drop_write(path.mnt);
> +       vfs_rmdir_cleanup(path.dentry->d_inode, dentry);
> +       dput(dentry);
> +       path_put(&path);
> +       putname(name);
> +       return 0;
> +
>  exit4:
>         dput(dentry);
>  exit3:
> --
> 2.44.0.330.g4d18c88175
>


-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-12  3:06                     ` Yafang Shao
@ 2024-05-12  3:30                       ` Al Viro
  2024-05-12  3:36                         ` Yafang Shao
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2024-05-12  3:30 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Linus Torvalds, brauner, jack, linux-fsdevel, longman, walters,
	wangkai86, willy

On Sun, May 12, 2024 at 11:06:07AM +0800, Yafang Shao wrote:
> This could resolve the secondary concern.
> Tested-by: Yafang Shao <laoar.shao@gmail.com>
> 
> Might it be feasible to execute the vfs_rmdir_cleanup() within a
> kwoker? Such an approach could potentially mitigate the initial
> concern as well.

	I'm honestly not sure I understood you correctly; in case I
have and you really want to make that asynchronous, the answer's "we
can't do that".  What's more, we can not even delay that past the call
of mnt_drop_write() in do_rmdir().

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-12  3:30                       ` Al Viro
@ 2024-05-12  3:36                         ` Yafang Shao
  0 siblings, 0 replies; 43+ messages in thread
From: Yafang Shao @ 2024-05-12  3:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, brauner, jack, linux-fsdevel, longman, walters,
	wangkai86, willy

On Sun, May 12, 2024 at 11:30 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sun, May 12, 2024 at 11:06:07AM +0800, Yafang Shao wrote:
> > This could resolve the secondary concern.
> > Tested-by: Yafang Shao <laoar.shao@gmail.com>
> >
> > Might it be feasible to execute the vfs_rmdir_cleanup() within a
> > kwoker? Such an approach could potentially mitigate the initial
> > concern as well.
>
>         I'm honestly not sure I understood you correctly; in case I
> have and you really want to make that asynchronous,

Exactly, that's precisely my point.

> the answer's "we
> can't do that".  What's more, we can not even delay that past the call
> of mnt_drop_write() in do_rmdir().

Thanks for your explanation.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-11 19:28                   ` Al Viro
  2024-05-11 19:55                     ` Linus Torvalds
@ 2024-05-12 15:45                     ` James Bottomley
  2024-05-12 16:16                       ` Al Viro
  1 sibling, 1 reply; 43+ messages in thread
From: James Bottomley @ 2024-05-12 15:45 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: brauner, jack, laoar.shao, linux-fsdevel, longman, walters,
	wangkai86, willy

On Sat, 2024-05-11 at 20:28 +0100, Al Viro wrote:
> On Sat, May 11, 2024 at 11:42:34AM -0700, Linus Torvalds wrote:
> 
> > so we have another level of locking going on, and my patch only
> > moved
> > the dcache pruning outside the lock of the directory we're removing
> > (not outside the lock of the directory that contains the removed
> > directory).
> > 
> > And that outside lock is the much more important one, I bet.
> 
> ... and _that_ is where taking d_delete outside of the lock might
> take an unpleasant analysis of a lot of code.

Couldn't you obviate this by doing it from a workqueue?  Even if the
directory is recreated, the chances are most of the negative dentries
that were under it will still exist and be removable by the time the
workqueue runs.

James


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-12 15:45                     ` James Bottomley
@ 2024-05-12 16:16                       ` Al Viro
  2024-05-12 19:59                         ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2024-05-12 16:16 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linus Torvalds, brauner, jack, laoar.shao, linux-fsdevel, longman,
	walters, wangkai86, willy

On Sun, May 12, 2024 at 11:45:44AM -0400, James Bottomley wrote:
> On Sat, 2024-05-11 at 20:28 +0100, Al Viro wrote:
> > On Sat, May 11, 2024 at 11:42:34AM -0700, Linus Torvalds wrote:
> > 
> > > so we have another level of locking going on, and my patch only
> > > moved
> > > the dcache pruning outside the lock of the directory we're removing
> > > (not outside the lock of the directory that contains the removed
> > > directory).
> > > 
> > > And that outside lock is the much more important one, I bet.
> > 
> > ... and _that_ is where taking d_delete outside of the lock might
> > take an unpleasant analysis of a lot of code.
> 
> Couldn't you obviate this by doing it from a workqueue?  Even if the
> directory is recreated, the chances are most of the negative dentries
> that were under it will still exist and be removable by the time the
> workqueue runs.

Eviction in general - sure
shrink_dcache_parent() in particular... not really - you'd need to keep
dentry pinned for that and that'd cause all kinds of fun for umount
d_delete() - even worse (you don't want dcache lookups to find that
sucker after rmdir(2) returned success to userland).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-12 16:16                       ` Al Viro
@ 2024-05-12 19:59                         ` Linus Torvalds
  2024-05-12 20:29                           ` Linus Torvalds
  2024-05-13  5:31                           ` Al Viro
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-05-12 19:59 UTC (permalink / raw)
  To: Al Viro
  Cc: James Bottomley, brauner, jack, laoar.shao, linux-fsdevel,
	longman, walters, wangkai86, willy

On Sun, 12 May 2024 at 09:16, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Eviction in general - sure
> shrink_dcache_parent() in particular... not really - you'd need to keep
> dentry pinned for that and that'd cause all kinds of fun for umount
> d_delete() - even worse (you don't want dcache lookups to find that
> sucker after rmdir(2) returned success to userland).

All of those dentries *should* be negative dentries, so I really don't
see the worry.

For example, any child dentries will obviously still elevate the
dentry count on the parent, so I really think we could actually skip
the dentry shrink *entirely* and it wouldn't really have any semantic
effect.

IOW, I really think the "shrink dentries on rmdir" is a "let's get rid
of pointless children that will never be used" than a semantic issue -
cleaning things up and releasing memory, rather than about behavior.

We will have already marked the inode dead, and called d_delete() on
the directory:

        dentry->d_inode->i_flags |= S_DEAD;
        dont_mount(dentry);
        detach_mounts(dentry);
        ...
        if (!error)
                d_delete_notify(dir, dentry);  // does d_delete(dentry)

so the removed directory entry itself will have either turned into a
negatve dentry or will unhash it (if there are other users).

So the children are already unreachable through that name, and can
only be reached through somebody who still has the directory open. And
I do not see how "rmdir()" can *possibly* have any valid semantic
effect on any user that has that directory as its PWD, so I claim that
the dentries that exist at this point must already not be relevant
from a semantic standpoint.

So Al, this is my argument: the only dentry that *matters* is the
dentry of the removed directory itself, and that's the one that sees
the "d_delete()" (and all the noise to make sure you can't do new
lookups and can't mount on top of it).

The dentries *underneath* it cannot matter for semantics, and can
happily be pruned - or not pruned - at any later date.

Can you address *THIS* issue?

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-12 19:59                         ` Linus Torvalds
@ 2024-05-12 20:29                           ` Linus Torvalds
  2024-05-13  5:31                           ` Al Viro
  1 sibling, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-05-12 20:29 UTC (permalink / raw)
  To: Al Viro
  Cc: James Bottomley, brauner, jack, laoar.shao, linux-fsdevel,
	longman, walters, wangkai86, willy

On Sun, 12 May 2024 at 12:59, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So the children are already unreachable through that name, and can
> only be reached through somebody who still has the directory open. And
> I do not see how "rmdir()" can *possibly* have any valid semantic
> effect on any user that has that directory as its PWD, so I claim that
> the dentries that exist at this point must already not be relevant
> from a semantic standpoint.

Side note: our IS_DEADDIR() checks seem a bit inconsistent.

That's actually what should catch some of the "I'm an a directory that
has been removed", but we have that check in lookup_open(), in
lookup_one_qstr_excl(), and in __lookup_slow().

I wonder if we should check it in lookup_dcache() too?

Because yes, that's a difference that rmdir makes, in how it sets
S_DEAD, and this seems to be an area where existing cached dentries
are handled differently.

                Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-12 19:59                         ` Linus Torvalds
  2024-05-12 20:29                           ` Linus Torvalds
@ 2024-05-13  5:31                           ` Al Viro
  2024-05-13 15:58                             ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Al Viro @ 2024-05-13  5:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: James Bottomley, brauner, jack, laoar.shao, linux-fsdevel,
	longman, walters, wangkai86, willy

On Sun, May 12, 2024 at 12:59:57PM -0700, Linus Torvalds wrote:
> so the removed directory entry itself will have either turned into a
> negatve dentry or will unhash it (if there are other users).
> 
> So the children are already unreachable through that name, and can
> only be reached through somebody who still has the directory open. And
> I do not see how "rmdir()" can *possibly* have any valid semantic
> effect on any user that has that directory as its PWD, so I claim that
> the dentries that exist at this point must already not be relevant
> from a semantic standpoint.
> 
> So Al, this is my argument: the only dentry that *matters* is the
> dentry of the removed directory itself, and that's the one that sees
> the "d_delete()" (and all the noise to make sure you can't do new
> lookups and can't mount on top of it).

Recall what d_delete() will do if you have other references.
That's why we want shrink_dcache_parent() *before* d_delete().

BTW, example of the reasons why d_delete() without directory being locked
is painful: simple_positive() is currently called either under ->d_lock
on dentry in question or under ->i_rwsem on the parent.  Either is enough
to stabilize it.  Get d_delete() happening without parent locked and
all callers of simple_positive() must take ->d_lock.

This one is not hard to adjust, but we need to find all such places.
Currently positivity of hashed dentry can change only with parent
held exclusive.  It's going to take some digging...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-13  5:31                           ` Al Viro
@ 2024-05-13 15:58                             ` Linus Torvalds
  2024-05-13 16:33                               ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-05-13 15:58 UTC (permalink / raw)
  To: Al Viro
  Cc: James Bottomley, brauner, jack, laoar.shao, linux-fsdevel,
	longman, walters, wangkai86, willy

On Sun, 12 May 2024 at 22:31, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Recall what d_delete() will do if you have other references.
> That's why we want shrink_dcache_parent() *before* d_delete().

Ahh. Yes. So with the shrinking done after as my patch, that means
that now the directory that gets removed is always unhashed.

> BTW, example of the reasons why d_delete() without directory being locked
> is painful

Oh, I would never suggest moving the d_delete() of the directory being
removed itself outside the lock.

I agree 100% on that side.

That said, maybe it's ok to just always unhash the directory when
doing the vfs_rmdir(). I certainly think it's better than unhashing
all the negative dentries like the original patch did.

Ho humm. Let me think about this some more.

We *could* strive for a hybrid approach, where we handle the common
case ("not a ton of child dentries") differently, and just get rid of
them synchronously, and handle the "millions of children" case by
unhashing the directory and dealing with shrinking the children async.

But now it's the merge window, and I have 60+ pull requests in my
inbox, so I will have to go back to work on that.

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-13 15:58                             ` Linus Torvalds
@ 2024-05-13 16:33                               ` Al Viro
  2024-05-13 16:44                                 ` Linus Torvalds
  2024-05-23  7:18                                 ` Dave Chinner
  0 siblings, 2 replies; 43+ messages in thread
From: Al Viro @ 2024-05-13 16:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: James Bottomley, brauner, jack, laoar.shao, linux-fsdevel,
	longman, walters, wangkai86, willy

On Mon, May 13, 2024 at 08:58:33AM -0700, Linus Torvalds wrote:

> We *could* strive for a hybrid approach, where we handle the common
> case ("not a ton of child dentries") differently, and just get rid of
> them synchronously, and handle the "millions of children" case by
> unhashing the directory and dealing with shrinking the children async.

try_to_shrink_children()?  Doable, and not even that hard to do, but
as for shrinking async...  We can easily move it out of inode_lock
on parent, but doing that really async would either need to be
tied into e.g. remount r/o logics or we'd get userland regressions.

I mean, "I have an opened unlinked file, can't remount r/o" is one
thing, but "I've done rm -rf ~luser, can't remount r/o for a while"
when all luser's processes had been killed and nothing is holding
any of that opened... ouch.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-13 16:33                               ` Al Viro
@ 2024-05-13 16:44                                 ` Linus Torvalds
  2024-05-23  7:18                                 ` Dave Chinner
  1 sibling, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-05-13 16:44 UTC (permalink / raw)
  To: Al Viro
  Cc: James Bottomley, brauner, jack, laoar.shao, linux-fsdevel,
	longman, walters, wangkai86, willy

On Mon, 13 May 2024 at 09:33, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> try_to_shrink_children()?  Doable, and not even that hard to do, but
> as for shrinking async...  We can easily move it out of inode_lock
> on parent, but doing that really async would either need to be
> tied into e.g. remount r/o logics or we'd get userland regressions.

Let's aim to just get it out from under the inode lock first, and then
look at anything more exciting later, perhaps?

                   Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-11  3:35   ` Yafang Shao
  2024-05-11  4:54     ` Waiman Long
@ 2024-05-15  2:18     ` Yafang Shao
  2024-05-15  2:36       ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Yafang Shao @ 2024-05-15  2:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: viro, brauner, jack, linux-fsdevel, Wangkai, Colin Walters,
	Waiman Long

On Sat, May 11, 2024 at 11:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, May 11, 2024 at 10:54 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Fri, 10 May 2024 at 19:28, Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > We've devised a solution to address both issues by deleting associated
> > > dentry when removing a file.
> >
> > This patch is buggy. You are modifying d_flags outside the locked region.
> >
> > So at a minimum, the DCACHE_FILE_DELETED bit setting would need to
> > just go into the
> >
> >         if (dentry->d_lockref.count == 1) {
> >
> > side of the conditional, since the other side of that conditional
> > already unhashes the dentry which makes this all moot anyway.
> >
> > That said, I think it's buggy in another way too: what if somebody
> > else looks up the dentry before it actually gets unhashed? Then you
> > have another ref to it, and the dentry might live long enough that it
> > then gets re-used for a newly created file (which is why we have those
> > negative dentries in the first place).
> >
> > So you'd have to clear the DCACHE_FILE_DELETED if the dentry is then
> > made live by a file creation or rename or whatever.
> >
> > So that d_flags thing is actually pretty complicated.
> >
> > But since you made all this unconditional anyway, I think having a new
> > dentry flag is unnecessary in the first place, and I suspect you are
> > better off just unhashing the dentry unconditionally instead.
> >
> > IOW, I think the simpler patch is likely just something like this:
>
> It's simpler. I used to contemplate handling it that way, but lack the
> knowledge and courage to proceed, hence I opted for the d_flags
> solution.
> I'll conduct tests on the revised change. Appreciate your suggestion.
>

We have successfully applied a hotfix to a subset of our production
servers, totaling several thousand. The hotfix is as follows:

diff --git a/fs/dcache.c b/fs/dcache.c
index 52e6d5f..30eb733 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2557,14 +2557,14 @@ void d_delete(struct dentry * dentry)

        spin_lock(&inode->i_lock);
        spin_lock(&dentry->d_lock);
+       __d_drop(dentry);
+
        /*
         * Are we the only user?
         */
        if (dentry->d_lockref.count == 1) {
-               dentry->d_flags &= ~DCACHE_CANT_MOUNT;
                dentry_unlink_inode(dentry);
        } else {
-               __d_drop(dentry);
                spin_unlock(&dentry->d_lock);
                spin_unlock(&inode->i_lock);
        }

So far, it has been functioning well without any regressions. We are
planning to roll this update out to our entire fleet, which consists
of hundreds of thousands of servers.

I believe this change is still necessary. Would you prefer to commit
it directly, or should I send an official patch?

If the "unlink-create" issue is a concern, perhaps we can address it
by adding a /sys/kernel/debug/vfs/delete_file_legacy entry?

> >
> >   --- a/fs/dcache.c
> >   +++ b/fs/dcache.c
> >   @@ -2381,6 +2381,7 @@ void d_delete(struct dentry * dentry)
> >
> >         spin_lock(&inode->i_lock);
> >         spin_lock(&dentry->d_lock);
> >   +     __d_drop(dentry);
> >         /*
> >          * Are we the only user?
> >          */
> >   @@ -2388,7 +2389,6 @@ void d_delete(struct dentry * dentry)
> >                 dentry->d_flags &= ~DCACHE_CANT_MOUNT;
> >                 dentry_unlink_inode(dentry);
> >         } else {
> >   -             __d_drop(dentry);
> >                 spin_unlock(&dentry->d_lock);
> >                 spin_unlock(&inode->i_lock);
> >         }
> >
> > although I think Al needs to ACK this, and I suspect that unhashing
> > the dentry also makes that
> >
> >                 dentry->d_flags &= ~DCACHE_CANT_MOUNT;
> >
> > pointless (because the dentry won't be reused, so DCACHE_CANT_MOUNT
> > just won't matter).
> >
> > I do worry that there are loads that actually love our current
> > behavior, but maybe it's worth doing the simple unconditional "make
> > d_delete() always unhash" and only worry about whether that causes
> > performance problems for people who commonly create a new file in its
> > place when we get such a report.
> >
> > IOW, the more complex thing might be to actually take other behavior
> > into account (eg "do we have so many negative dentries that we really
> > don't want to create new ones").
>
> This poses a substantial challenge. Despite recurrent discussions
> within the community about improving negative dentry over and over,
> there hasn't been a consensus on how to address it.
>
> >
> > Al - can you please step in and tell us what else I've missed, and why
> > my suggested version of the patch is also broken garbage?
> >
> >              Linus
>
>
> --
> Regards
> Yafang



--
Regards
Yafang

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file
  2024-05-15  2:18     ` [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file Yafang Shao
@ 2024-05-15  2:36       ` Linus Torvalds
  2024-05-15  9:17         ` [PATCH] vfs: " Yafang Shao
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-05-15  2:36 UTC (permalink / raw)
  To: Yafang Shao
  Cc: viro, brauner, jack, linux-fsdevel, Wangkai, Colin Walters,
	Waiman Long

On Tue, 14 May 2024 at 19:19, Yafang Shao <laoar.shao@gmail.com> wrote:
>
> I believe this change is still necessary. Would you prefer to commit
> it directly, or should I send an official patch?

Sending an official patch just for people to try out sounds good
regardless, but I'd really like some real performance testing on other
loads too before just taking it.

It *may* be acceptable, and it's certainly simple. It's also almost
certainly safe from a correctness angle.

But it might regress performance on other important loads even if it
fixes an issue on your load.

That's why we had the whole discussion about alternatives.

I'm still of the opinion that we should probably *try* this simple
approach, but I really would hope we could have some test tree that
people run a lot of benchmarks on.

Does anybody do filesystem benchmarking on the mm tree? Or do we have
some good tree to just give it a good testing? It feels a bit
excessive to put it in my development tree just to get some
performance testing coverage, but maybe that's what we have to do..

                 Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] vfs: Delete the associated dentry when deleting a file
  2024-05-15  2:36       ` Linus Torvalds
@ 2024-05-15  9:17         ` Yafang Shao
  2024-05-15 16:05           ` Linus Torvalds
  2024-05-22  8:11           ` kernel test robot
  0 siblings, 2 replies; 43+ messages in thread
From: Yafang Shao @ 2024-05-15  9:17 UTC (permalink / raw)
  To: torvalds
  Cc: brauner, jack, laoar.shao, linux-fsdevel, longman, viro, walters,
	wangkai86, linux-mm, linux-kernel, Matthew Wilcox

Our applications, built on Elasticsearch[0], frequently create and delete
files. These applications operate within containers, some with a memory
limit exceeding 100GB. Over prolonged periods, the accumulation of negative
dentries within these containers can amount to tens of gigabytes.

Upon container exit, directories are deleted. However, due to the numerous
associated dentries, this process can be time-consuming. Our users have
expressed frustration with this prolonged exit duration, which constitutes
our first issue.

Simultaneously, other processes may attempt to access the parent directory
of the Elasticsearch directories. Since the task responsible for deleting
the dentries holds the inode lock, processes attempting directory lookup
experience significant delays. This issue, our second problem, is easily
demonstrated:

  - Task 1 generates negative dentries:
  $ pwd
  ~/test
  $ mkdir es && cd es/ && ./create_and_delete_files.sh

  [ After generating tens of GB dentries ]

  $ cd ~/test && rm -rf es

  [ It will take a long duration to finish ]

  - Task 2 attempts to lookup the 'test/' directory
  $ pwd
  ~/test
  $ ls

  The 'ls' command in Task 2 experiences prolonged execution as Task 1
  is deleting the dentries.

We've devised a solution to address both issues by deleting associated
dentry when removing a file. Interestingly, we've noted that a similar
patch was proposed years ago[1], although it was rejected citing the
absence of tangible issues caused by negative dentries. Given our current
challenges, we're resubmitting the proposal. All relevant stakeholders from
previous discussions have been included for reference.

Some alternative solutions are also under discussion[2][3], such as
shrinking child dentries outside of the parent inode lock or even
asynchronously shrinking child dentries. However, given the straightforward
nature of the current solution, I believe this approach is still necessary.

[0]. https://github.com/elastic/elasticsearch
[1]. https://patchwork.kernel.org/project/linux-fsdevel/patch/1502099673-31620-1-git-send-email-wangkai86@huawei.com
[2]. https://lore.kernel.org/linux-fsdevel/20240511200240.6354-2-torvalds@linux-foundation.org/
[3]. https://lore.kernel.org/linux-fsdevel/CAHk-=wjEMf8Du4UFzxuToGDnF3yLaMcrYeyNAaH1NJWa6fwcNQ@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Wangkai <wangkai86@huawei.com>
Cc: Colin Walters <walters@verbum.org>
---
 fs/dcache.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 71a8e943a0fa..2ffdb98e9166 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2360,19 +2360,17 @@ EXPORT_SYMBOL(d_hash_and_lookup);
  * - unhash this dentry and free it.
  *
  * Usually, we want to just turn this into
- * a negative dentry, but if anybody else is
- * currently using the dentry or the inode
- * we can't do that and we fall back on removing
- * it from the hash queues and waiting for
- * it to be deleted later when it has no users
+ * a negative dentry, but certain workloads can
+ * generate a large number of negative dentries.
+ * Therefore, it would be better to simply
+ * unhash it.
  */

 /**
  * d_delete - delete a dentry
  * @dentry: The dentry to delete
  *
- * Turn the dentry into a negative dentry if possible, otherwise
- * remove it from the hash queues so it can be deleted later
+ * Remove the dentry from the hash queues so it can be deleted later.
  */

 void d_delete(struct dentry * dentry)
@@ -2381,14 +2379,14 @@ void d_delete(struct dentry * dentry)

 	spin_lock(&inode->i_lock);
 	spin_lock(&dentry->d_lock);
+	__d_drop(dentry);
+
 	/*
 	 * Are we the only user?
 	 */
 	if (dentry->d_lockref.count == 1) {
-		dentry->d_flags &= ~DCACHE_CANT_MOUNT;
 		dentry_unlink_inode(dentry);
 	} else {
-		__d_drop(dentry);
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&inode->i_lock);
 	}
-- 
2.30.1 (Apple Git-130)

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: Delete the associated dentry when deleting a file
  2024-05-15  9:17         ` [PATCH] vfs: " Yafang Shao
@ 2024-05-15 16:05           ` Linus Torvalds
  2024-05-16 13:44             ` Oliver Sang
  2024-05-22  8:51             ` Oliver Sang
  2024-05-22  8:11           ` kernel test robot
  1 sibling, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-05-15 16:05 UTC (permalink / raw)
  To: Yafang Shao, kernel test robot
  Cc: brauner, jack, linux-fsdevel, longman, viro, walters, wangkai86,
	linux-mm, linux-kernel, Matthew Wilcox

Oliver,
 is there any chance you could run this through the test robot
performance suite? The original full patch at

    https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/

and it would be interesting if the test robot could see if the patch
makes any difference on any other loads?

Thanks,
                     Linus

On Wed, 15 May 2024 at 02:17, Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Our applications, built on Elasticsearch[0], frequently create and delete
> files. These applications operate within containers, some with a memory
> limit exceeding 100GB. Over prolonged periods, the accumulation of negative
> dentries within these containers can amount to tens of gigabytes.
>
> Upon container exit, directories are deleted. However, due to the numerous
> associated dentries, this process can be time-consuming. Our users have
> expressed frustration with this prolonged exit duration, which constitutes
> our first issue.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: Delete the associated dentry when deleting a file
  2024-05-15 16:05           ` Linus Torvalds
@ 2024-05-16 13:44             ` Oliver Sang
  2024-05-22  8:51             ` Oliver Sang
  1 sibling, 0 replies; 43+ messages in thread
From: Oliver Sang @ 2024-05-16 13:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yafang Shao, brauner, jack, linux-fsdevel, longman, viro, walters,
	wangkai86, linux-mm, linux-kernel, Matthew Wilcox, oliver.sang

hi, Linus,

On Wed, May 15, 2024 at 09:05:24AM -0700, Linus Torvalds wrote:
> Oliver,
>  is there any chance you could run this through the test robot
> performance suite? The original full patch at
> 
>     https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/
> 
> and it would be interesting if the test robot could see if the patch
> makes any difference on any other loads?

sure. it's our great pleasure. thanks a lot to involve us.

the tests are started, but due to resource contraint, it may cost 2-3 days
to run enough tests.

will keep you guys updated, Thanks!

> 
> Thanks,
>                      Linus
> 
> On Wed, 15 May 2024 at 02:17, Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Our applications, built on Elasticsearch[0], frequently create and delete
> > files. These applications operate within containers, some with a memory
> > limit exceeding 100GB. Over prolonged periods, the accumulation of negative
> > dentries within these containers can amount to tens of gigabytes.
> >
> > Upon container exit, directories are deleted. However, due to the numerous
> > associated dentries, this process can be time-consuming. Our users have
> > expressed frustration with this prolonged exit duration, which constitutes
> > our first issue.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: Delete the associated dentry when deleting a file
  2024-05-15  9:17         ` [PATCH] vfs: " Yafang Shao
  2024-05-15 16:05           ` Linus Torvalds
@ 2024-05-22  8:11           ` kernel test robot
  2024-05-22 16:00             ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: kernel test robot @ 2024-05-22  8:11 UTC (permalink / raw)
  To: Yafang Shao
  Cc: oe-lkp, lkp, Linus Torvalds, Al Viro, Christian Brauner, Jan Kara,
	Waiman Long, Matthew Wilcox, Wangkai, Colin Walters,
	linux-fsdevel, ying.huang, feng.tang, fengwei.yin, laoar.shao,
	linux-mm, linux-kernel, oliver.sang


Hello,

kernel test robot noticed a 6.7% improvement of stress-ng.touch.ops_per_sec on:


commit: 3681ce364442ce2ec7c7fbe90ad0aca651db475b ("[PATCH] vfs: Delete the associated dentry when deleting a file")
url: https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/vfs-Delete-the-associated-dentry-when-deleting-a-file/20240516-144340
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 3c999d1ae3c75991902a1a7dad0cb62c2a3008b4
patch link: https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/
patch subject: [PATCH] vfs: Delete the associated dentry when deleting a file

testcase: stress-ng
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

	nr_threads: 100%
	disk: 1HDD
	testtime: 60s
	fs: ext4
	test: touch
	cpufreq_governor: performance



Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240522/202405221518.ecea2810-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-13/performance/1HDD/ext4/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/touch/stress-ng/60s

commit: 
  3c999d1ae3 ("Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq")
  3681ce3644 ("vfs: Delete the associated dentry when deleting a file")

3c999d1ae3c75991 3681ce364442ce2ec7c7fbe90ad 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
 9.884e+08           -49.1%  5.027e+08 ±  4%  cpuidle..time
     23.46 ±  2%     -38.3%      14.47 ±  5%  iostat.cpu.idle
     75.71           +11.7%      84.60        iostat.cpu.system
    620100 ±  5%     -24.0%     471157 ±  6%  numa-numastat.node0.local_node
    650333 ±  3%     -22.3%     505122 ±  5%  numa-numastat.node0.numa_hit
    736690 ±  3%     -21.7%     577182 ±  4%  numa-numastat.node1.local_node
    772759 ±  3%     -21.1%     609537 ±  3%  numa-numastat.node1.numa_hit
     23.42           -38.1%      14.50 ±  5%  vmstat.cpu.id
     54.39 ±  3%     +13.4%      61.67 ±  2%  vmstat.procs.r
     75507 ±  6%     +24.6%      94113 ±  5%  vmstat.system.cs
    177863           +14.2%     203176        vmstat.system.in
     21.30 ±  2%      -9.5       11.76 ±  7%  mpstat.cpu.all.idle%
      0.37 ±  2%      +0.1        0.42 ±  2%  mpstat.cpu.all.irq%
      0.14            -0.0        0.11 ±  2%  mpstat.cpu.all.soft%
     77.37            +9.4       86.79        mpstat.cpu.all.sys%
      0.81            +0.1        0.92 ±  2%  mpstat.cpu.all.usr%
     68.24           -11.5%      60.40        stress-ng.time.elapsed_time
     68.24           -11.5%      60.40        stress-ng.time.elapsed_time.max
    175478            +5.3%     184765        stress-ng.time.involuntary_context_switches
      5038           +12.4%       5663        stress-ng.time.percent_of_cpu_this_job_got
    259551 ±  2%      +6.7%     277010        stress-ng.touch.ops_per_sec
    454143 ±  2%      -9.2%     412504 ±  2%  meminfo.Active
    445748 ±  2%      -9.3%     404154 ±  2%  meminfo.Active(anon)
   1714332 ±  2%     -92.8%     123761 ±  2%  meminfo.KReclaimable
   7111721           -23.7%    5423125        meminfo.Memused
   1714332 ±  2%     -92.8%     123761 ±  2%  meminfo.SReclaimable
    291963           -35.6%     188122        meminfo.SUnreclaim
   2006296           -84.5%     311883        meminfo.Slab
   7289014           -24.0%    5539852        meminfo.max_used_kB
    875347 ±  3%     -91.3%      75880 ± 30%  numa-meminfo.node0.KReclaimable
    875347 ±  3%     -91.3%      75880 ± 30%  numa-meminfo.node0.SReclaimable
    161663 ±  7%     -31.3%     111087 ± 12%  numa-meminfo.node0.SUnreclaim
   1037010 ±  3%     -82.0%     186968 ± 16%  numa-meminfo.node0.Slab
    838678 ±  3%     -94.3%      48016 ± 50%  numa-meminfo.node1.KReclaimable
    838678 ±  3%     -94.3%      48016 ± 50%  numa-meminfo.node1.SReclaimable
    130259 ±  8%     -40.8%      77055 ± 18%  numa-meminfo.node1.SUnreclaim
    968937 ±  3%     -87.1%     125072 ± 25%  numa-meminfo.node1.Slab
    218872 ±  3%     -91.3%      18961 ± 31%  numa-vmstat.node0.nr_slab_reclaimable
     40425 ±  7%     -31.3%      27772 ± 12%  numa-vmstat.node0.nr_slab_unreclaimable
    649973 ±  3%     -22.3%     504715 ±  5%  numa-vmstat.node0.numa_hit
    619740 ±  5%     -24.0%     470750 ±  6%  numa-vmstat.node0.numa_local
    209697 ±  3%     -94.3%      12019 ± 50%  numa-vmstat.node1.nr_slab_reclaimable
     32567 ±  8%     -40.8%      19264 ± 18%  numa-vmstat.node1.nr_slab_unreclaimable
    771696 ±  3%     -21.1%     608506 ±  3%  numa-vmstat.node1.numa_hit
    735626 ±  3%     -21.7%     576154 ±  4%  numa-vmstat.node1.numa_local
    111469 ±  2%      -9.3%     101103 ±  2%  proc-vmstat.nr_active_anon
    446.10           -36.6%     283.05        proc-vmstat.nr_dirtied
     18812            +2.5%      19287        proc-vmstat.nr_kernel_stack
    428165 ±  2%     -92.8%      30970 ±  2%  proc-vmstat.nr_slab_reclaimable
     72954           -35.5%      47032        proc-vmstat.nr_slab_unreclaimable
    111469 ±  2%      -9.3%     101103 ±  2%  proc-vmstat.nr_zone_active_anon
   1424982           -21.7%    1116423        proc-vmstat.numa_hit
   1358679           -22.7%    1050103 ±  2%  proc-vmstat.numa_local
   4261989 ±  3%     -14.9%    3625822 ±  3%  proc-vmstat.pgalloc_normal
    399826            -4.6%     381396        proc-vmstat.pgfault
   3996977 ±  3%     -15.9%    3359964 ±  3%  proc-vmstat.pgfree
     13500            -5.9%      12708        proc-vmstat.pgpgout
      0.13 ±  3%      -0.0        0.08 ±  4%  perf-profile.children.cycles-pp.d_lookup
      0.12 ±  4%      -0.0        0.08        perf-profile.children.cycles-pp.__d_lookup
      0.11 ±  2%      +0.0        0.13 ±  3%  perf-profile.children.cycles-pp.irq_exit_rcu
      0.06 ±  7%      +0.0        0.09 ±  2%  perf-profile.children.cycles-pp.__slab_free
      0.02 ±100%      +0.0        0.06 ±  8%  perf-profile.children.cycles-pp.__memcg_slab_free_hook
      0.07 ±  7%      +0.0        0.12 ±  4%  perf-profile.children.cycles-pp.run_ksoftirqd
      0.09 ±  5%      +0.0        0.14 ±  4%  perf-profile.children.cycles-pp.smpboot_thread_fn
      0.18 ±  3%      +0.1        0.24 ±  3%  perf-profile.children.cycles-pp.kmem_cache_free
      0.10 ±  7%      +0.1        0.16 ±  7%  perf-profile.children.cycles-pp.kthread
      0.10 ±  7%      +0.1        0.16 ±  7%  perf-profile.children.cycles-pp.ret_from_fork
      0.10 ±  7%      +0.1        0.16 ±  7%  perf-profile.children.cycles-pp.ret_from_fork_asm
      0.14 ±  4%      +0.1        0.20 ±  3%  perf-profile.children.cycles-pp.rcu_do_batch
      0.15 ±  3%      +0.1        0.21 ±  2%  perf-profile.children.cycles-pp.rcu_core
      0.18 ±  3%      +0.1        0.24 ±  2%  perf-profile.children.cycles-pp.handle_softirqs
      0.00            +0.1        0.06 ±  7%  perf-profile.children.cycles-pp.list_lru_del
      0.82 ±  2%      +0.1        0.89 ±  3%  perf-profile.children.cycles-pp._raw_spin_lock
      0.01 ±238%      +0.1        0.08 ±  4%  perf-profile.children.cycles-pp.__call_rcu_common
      0.00            +0.1        0.08 ±  3%  perf-profile.children.cycles-pp.d_lru_del
      0.00            +0.2        0.17 ±  3%  perf-profile.children.cycles-pp.__dentry_kill
      0.33 ± 11%      +0.2        0.50 ±  6%  perf-profile.children.cycles-pp.dput
      0.09 ±  4%      -0.0        0.05 ± 22%  perf-profile.self.cycles-pp.__d_lookup
      0.06 ±  5%      +0.0        0.09 ±  5%  perf-profile.self.cycles-pp.__slab_free
      0.74 ±  2%      +0.1        0.79        perf-profile.self.cycles-pp._raw_spin_lock
      2.55 ±  4%     -40.6%       1.51 ±  5%  perf-stat.i.MPKI
 1.219e+10           +12.8%  1.376e+10        perf-stat.i.branch-instructions
      0.27 ±  2%      -0.1        0.21 ±  2%  perf-stat.i.branch-miss-rate%
  22693772           +14.3%   25941855        perf-stat.i.branch-misses
     39.43            -4.6       34.83        perf-stat.i.cache-miss-rate%
  95834044 ±  3%     +10.9%  1.063e+08 ±  4%  perf-stat.i.cache-misses
 2.658e+08 ±  2%     +14.8%  3.051e+08 ±  4%  perf-stat.i.cache-references
     77622 ±  6%     +25.5%      97395 ±  5%  perf-stat.i.context-switches
      2.82            +3.3%       2.91        perf-stat.i.cpi
 1.821e+11           +12.5%  2.049e+11        perf-stat.i.cpu-cycles
      9159 ±  3%     +22.9%      11257 ±  3%  perf-stat.i.cpu-migrations
 6.223e+10           +12.8%  7.019e+10        perf-stat.i.instructions
      0.36            -4.8%       0.34        perf-stat.i.ipc
      1.19 ±  9%     +27.6%       1.51 ±  5%  perf-stat.i.metric.K/sec
      4716            +6.3%       5011        perf-stat.i.minor-faults
      4716            +6.3%       5011        perf-stat.i.page-faults
     35.93            -1.2       34.77        perf-stat.overall.cache-miss-rate%
 1.214e+10           +11.5%  1.353e+10        perf-stat.ps.branch-instructions
  22276184           +13.3%   25240906        perf-stat.ps.branch-misses
  95133798 ±  3%      +9.8%  1.045e+08 ±  4%  perf-stat.ps.cache-misses
 2.648e+08 ±  2%     +13.5%  3.004e+08 ±  4%  perf-stat.ps.cache-references
     77612 ±  6%     +23.8%      96101 ±  5%  perf-stat.ps.context-switches
 1.813e+11           +11.1%  2.015e+11        perf-stat.ps.cpu-cycles
      9093 ±  3%     +21.5%      11047 ±  2%  perf-stat.ps.cpu-migrations
 6.192e+10           +11.5%  6.901e+10        perf-stat.ps.instructions
      4639            +5.6%       4899        perf-stat.ps.minor-faults
      4640            +5.6%       4899        perf-stat.ps.page-faults




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: Delete the associated dentry when deleting a file
  2024-05-15 16:05           ` Linus Torvalds
  2024-05-16 13:44             ` Oliver Sang
@ 2024-05-22  8:51             ` Oliver Sang
  2024-05-23  2:21               ` Yafang Shao
  1 sibling, 1 reply; 43+ messages in thread
From: Oliver Sang @ 2024-05-22  8:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yafang Shao, brauner, jack, linux-fsdevel, longman, viro, walters,
	wangkai86, linux-mm, linux-kernel, Matthew Wilcox, ying.huang,
	feng.tang, fengwei.yin, philip.li, yujie.liu, oliver.sang


hi, Linus, hi, Yafang Shao,


On Wed, May 15, 2024 at 09:05:24AM -0700, Linus Torvalds wrote:
> Oliver,
>  is there any chance you could run this through the test robot
> performance suite? The original full patch at
> 
>     https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/
> 
> and it would be interesting if the test robot could see if the patch
> makes any difference on any other loads?
> 

we just reported a stress-ng performance improvement by this patch [1]

test robot applied this patch upon
  3c999d1ae3 ("Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq")

filesystem is not our team's major domain, so we just made some limited review
of the results, and decided to send out the report FYI.

at first stage, we decided to check below catagories of tests as priority:

stress-ng filesystem
filebench mailserver
reaim fileserver

we also pick sysbench-fileio, blogbench into coverage.

here is a summary.

for stress-ng, besided [1] which was reported, we got below data that are
about this patch comparing to 3c999d1ae3.

either there is no significant performance change, or the change is smaller
than the noise which will make test robot's bisect fail, so these information
is just FYI. and if you have any doubt about any subtests, could you let us know
then we could check further?

(also included some net test results)

      12.87 ±  6%      -0.6%      12.79        stress-ng.xattr.ops_per_sec
       6721 ±  5%      +7.5%       7224 ± 27%  stress-ng.rawdev.ops_per_sec
       9002 ±  7%      -8.7%       8217        stress-ng.dirmany.ops_per_sec
    8594743 ±  4%      -3.0%    8337417        stress-ng.rawsock.ops_per_sec
       2056 ±  3%      +2.9%       2116        stress-ng.dirdeep.ops_per_sec
       4307 ± 21%      -6.9%       4009        stress-ng.dir.ops_per_sec
     137946 ± 18%      +5.8%     145942        stress-ng.fiemap.ops_per_sec
   22413006 ±  2%      +2.5%   22982512 ±  2%  stress-ng.sockdiag.ops_per_sec
     286714 ±  2%      -3.8%     275876 ±  5%  stress-ng.udp-flood.ops_per_sec
      82904 ± 46%     -31.6%      56716        stress-ng.sctp.ops_per_sec
    9853408            -0.3%    9826387        stress-ng.ping-sock.ops_per_sec
      84667 ± 12%     -26.7%      62050 ± 17%  stress-ng.dccp.ops_per_sec
      61750 ± 25%     -24.2%      46821 ± 38%  stress-ng.open.ops_per_sec
     583443 ±  3%      -3.4%     563822        stress-ng.file-ioctl.ops_per_sec
      11919 ± 28%     -34.3%       7833        stress-ng.dentry.ops_per_sec
      18.59 ± 12%     -23.9%      14.15 ± 27%  stress-ng.swap.ops_per_sec
     246.37 ±  2%     +15.9%     285.58 ± 12%  stress-ng.aiol.ops_per_sec
       7.45            -4.8%       7.10 ±  7%  stress-ng.fallocate.ops_per_sec
     207.97 ±  7%      +5.2%     218.70        stress-ng.copy-file.ops_per_sec
      69.87 ±  7%      +5.8%      73.93 ±  5%  stress-ng.fpunch.ops_per_sec
       0.25 ± 21%     +24.0%       0.31        stress-ng.inode-flags.ops_per_sec
     849.35 ±  6%      +1.4%     861.51        stress-ng.mknod.ops_per_sec
     926144 ±  4%      -5.2%     877558        stress-ng.lease.ops_per_sec
      82924            -2.1%      81220        stress-ng.fcntl.ops_per_sec
       6.19 ±124%     -50.7%       3.05        stress-ng.chattr.ops_per_sec
     676.90 ±  4%      -1.9%     663.94 ±  5%  stress-ng.iomix.ops_per_sec
       0.93 ±  6%      +5.6%       0.98 ±  7%  stress-ng.symlink.ops_per_sec
    1703608            -3.8%    1639057 ±  3%  stress-ng.eventfd.ops_per_sec
    1735861            -0.6%    1726072        stress-ng.sockpair.ops_per_sec
      85440            -2.0%      83705        stress-ng.rawudp.ops_per_sec
       6198            +0.6%       6236        stress-ng.sockabuse.ops_per_sec
      39226            +0.0%      39234        stress-ng.sock.ops_per_sec
       1358            +0.3%       1363        stress-ng.tun.ops_per_sec
    9794021            -1.7%    9623340        stress-ng.icmp-flood.ops_per_sec
    1324728            +0.3%    1328244        stress-ng.epoll.ops_per_sec
     146150            -2.0%     143231        stress-ng.rawpkt.ops_per_sec
    6381112            -0.4%    6352696        stress-ng.udp.ops_per_sec
    1234258            +0.2%    1236738        stress-ng.sockfd.ops_per_sec
      23954            -0.1%      23932        stress-ng.sockmany.ops_per_sec
     257030            -0.1%     256860        stress-ng.netdev.ops_per_sec
    6337097            +0.1%    6341130        stress-ng.flock.ops_per_sec
     173212            -0.3%     172728        stress-ng.rename.ops_per_sec
     199.69            +0.6%     200.82        stress-ng.sync-file.ops_per_sec
     606.57            +0.8%     611.53        stress-ng.chown.ops_per_sec
     183549            -0.9%     181975        stress-ng.handle.ops_per_sec
       1299            +0.0%       1299        stress-ng.hdd.ops_per_sec
   98371066            +0.2%   98571113        stress-ng.lockofd.ops_per_sec
      25.49            -4.3%      24.39        stress-ng.ioprio.ops_per_sec
   96745191            -1.5%   95333632        stress-ng.locka.ops_per_sec
     582.35            +0.1%     582.86        stress-ng.chmod.ops_per_sec
    2075897            -2.2%    2029552        stress-ng.getdent.ops_per_sec
      60.47            -1.9%      59.34        stress-ng.metamix.ops_per_sec
      14161            -0.3%      14123        stress-ng.io.ops_per_sec
      23.98            -1.5%      23.61        stress-ng.link.ops_per_sec
      27514            +0.0%      27528        stress-ng.filename.ops_per_sec
      44955            +1.6%      45678        stress-ng.dnotify.ops_per_sec
     160.94            +0.4%     161.51        stress-ng.inotify.ops_per_sec
    2452224            +4.0%    2549607        stress-ng.lockf.ops_per_sec
       6761            +0.3%       6779        stress-ng.fsize.ops_per_sec
     775083            -1.5%     763487        stress-ng.fanotify.ops_per_sec
     309124            -4.2%     296285        stress-ng.utime.ops_per_sec
      25567            -0.1%      25530        stress-ng.dup.ops_per_sec
       1858            +0.9%       1876        stress-ng.procfs.ops_per_sec
     105804            -3.9%     101658        stress-ng.access.ops_per_sec
       1.04            -1.9%       1.02        stress-ng.chdir.ops_per_sec
      82753            -0.3%      82480        stress-ng.fstat.ops_per_sec
     681128            +3.7%     706375        stress-ng.acl.ops_per_sec
      11892            -0.1%      11875        stress-ng.bind-mount.ops_per_sec


for filebench, similar results, but data is less unstable than stress-ng, which
means for most of them, we regarded them as that they should not be impacted by
this patch.

for reaim/sysbench-fileio/blogbench, the data are quite stable, and we didn't
notice any significant performance changes. we even doubt whether they go
through the code path changed by this patch.

so for these, we don't list full results here.

BTW, besides filesystem tests, this patch is also piped into other performance
test categories such like net, scheduler, mm and others, and since it also goes
into our so-called hourly kernels, it could run by full other performance test
suites which test robot supports. so in following 2-3 weeks, it's still possible
for us to report other results including regression.

thanks

[1] https://lore.kernel.org/all/202405221518.ecea2810-oliver.sang@intel.com/


> Thanks,
>                      Linus
> 
> On Wed, 15 May 2024 at 02:17, Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Our applications, built on Elasticsearch[0], frequently create and delete
> > files. These applications operate within containers, some with a memory
> > limit exceeding 100GB. Over prolonged periods, the accumulation of negative
> > dentries within these containers can amount to tens of gigabytes.
> >
> > Upon container exit, directories are deleted. However, due to the numerous
> > associated dentries, this process can be time-consuming. Our users have
> > expressed frustration with this prolonged exit duration, which constitutes
> > our first issue.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: Delete the associated dentry when deleting a file
  2024-05-22  8:11           ` kernel test robot
@ 2024-05-22 16:00             ` Linus Torvalds
  2024-05-22 17:13               ` Matthew Wilcox
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-05-22 16:00 UTC (permalink / raw)
  To: kernel test robot
  Cc: Yafang Shao, oe-lkp, lkp, Al Viro, Christian Brauner, Jan Kara,
	Waiman Long, Matthew Wilcox, Wangkai, Colin Walters,
	linux-fsdevel, ying.huang, feng.tang, fengwei.yin, linux-mm,
	linux-kernel

On Wed, 22 May 2024 at 01:12, kernel test robot <oliver.sang@intel.com> wrote:
>
> kernel test robot noticed a 6.7% improvement of stress-ng.touch.ops_per_sec on:
>
> commit: 3681ce364442ce2ec7c7fbe90ad0aca651db475b ("[PATCH] vfs: Delete the associated dentry when deleting a file")

Ok, since everything else is at least tentatively in the noise, and
the only hard numbers we have are the ones from Yafang's Elasticsearch
load and this - both of which say that this is a good patch - I
decided to just apply this ASAP just to get more testing.

I just wanted to note very explicitly that this is very much
tentative: this will be reverted very aggressively if somebody reports
some other real-world load performance issues, and we'll have to look
at other solutions. But I just don't think we'll get much more actual
testing of this without just saying "let's try it".

Also, I ended up removing the part of the patch that stopped clearing
the DCACHE_CANT_MOUNT bit. I think it's right, but it's really
unrelated to the actual problem at hand, and there are other cleanups
- like the unnecessary dget/dput pair - in this path that could also
be looked at.

Anyway, let's see if somebody notices any issues with this. And I
think we should look at the "shrink dentries" case anyway for other
reasons, since it's easy to create a ton of negative dentries with
just lots of lookups (rather than lots of unlinking of existing
files).

Of course, if you do billions of lookups of different files that do
not exist in the same directory, I suspect you just have yourself to
blame, so the "lots of negative lookups" load doesn't sound
particularly realistic.

TLDR; I applied it for testing because we're in the merge window and
there's no reason not to try.

                 Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: Delete the associated dentry when deleting a file
  2024-05-22 16:00             ` Linus Torvalds
@ 2024-05-22 17:13               ` Matthew Wilcox
  2024-05-22 18:11                 ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2024-05-22 17:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, Yafang Shao, oe-lkp, lkp, Al Viro,
	Christian Brauner, Jan Kara, Waiman Long, Wangkai, Colin Walters,
	linux-fsdevel, ying.huang, feng.tang, fengwei.yin, linux-mm,
	linux-kernel

On Wed, May 22, 2024 at 09:00:03AM -0700, Linus Torvalds wrote:
> Of course, if you do billions of lookups of different files that do
> not exist in the same directory, I suspect you just have yourself to
> blame, so the "lots of negative lookups" load doesn't sound
> particularly realistic.

Oh no.  We have real customers that this hits and it's not even stupid.

Every time an "event" happens, they look up something like a hash in three
different directories (like $PATH or multiple -I flags to the compiler).
Order is important, so they can't just look it up in the directory that
it's most likely to exist in.  It usually fails to exist in directory A
and B, so we create dentries that say it doesn't.  And those dentries are
literally never referenced again.  Then directory C has the file they're
looking for (or it doesn't and it gets created because the process has
write access to C and not A or B).

plan9 handles this so much better because it has that union-mount stuff
instead of search paths.  So it creates one dentry that tells it which of
those directories it actually exists in.  But we're stuck with unix-style
search paths, so life is pain.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: Delete the associated dentry when deleting a file
  2024-05-22 17:13               ` Matthew Wilcox
@ 2024-05-22 18:11                 ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-05-22 18:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: kernel test robot, Yafang Shao, oe-lkp, lkp, Al Viro,
	Christian Brauner, Jan Kara, Waiman Long, Wangkai, Colin Walters,
	linux-fsdevel, ying.huang, feng.tang, fengwei.yin, linux-mm,
	linux-kernel

On Wed, 22 May 2024 at 10:14, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, May 22, 2024 at 09:00:03AM -0700, Linus Torvalds wrote:
> > Of course, if you do billions of lookups of different files that do
> > not exist in the same directory, I suspect you just have yourself to
> > blame, so the "lots of negative lookups" load doesn't sound
> > particularly realistic.
>
> Oh no.  We have real customers that this hits and it's not even stupid.

Oh, negative dentries exist, and yes, they are a major feature on some
loads. Exactly because of the kinds of situations you describe.

In fact, that's the _point_. Things like PATH lookup require negative
dentries for good performance, because otherwise all the missed cases
would force a lookup all the way out to the filesystem.

So having thousands of negative dentries is normal and expected. And
it will grow for bigger machines and loads, of course.

That said, I don't think we've had people on real loads complain about
them being in the hundreds of millions like Yafang's case.

> plan9 handles this so much better because it has that union-mount stuff
> instead of search paths.  So it creates one dentry that tells it which of
> those directories it actually exists in.  But we're stuck with unix-style
> search paths, so life is pain.

I suspect you are just not as aware of the downsides of the plan9 models.

People tend to think plan9 was great. It had huge and fundamental
design mistakes.

Union mounts à la plan9 aren't hugely wonderful, and there's a reason
overlayfs does things differently (not that that is hugely wonderful
either).

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: Delete the associated dentry when deleting a file
  2024-05-22  8:51             ` Oliver Sang
@ 2024-05-23  2:21               ` Yafang Shao
  0 siblings, 0 replies; 43+ messages in thread
From: Yafang Shao @ 2024-05-23  2:21 UTC (permalink / raw)
  To: Oliver Sang
  Cc: Linus Torvalds, brauner, jack, linux-fsdevel, longman, viro,
	walters, wangkai86, linux-mm, linux-kernel, Matthew Wilcox,
	ying.huang, feng.tang, fengwei.yin, philip.li, yujie.liu

On Wed, May 22, 2024 at 4:51 PM Oliver Sang <oliver.sang@intel.com> wrote:
>
>
> hi, Linus, hi, Yafang Shao,
>
>
> On Wed, May 15, 2024 at 09:05:24AM -0700, Linus Torvalds wrote:
> > Oliver,
> >  is there any chance you could run this through the test robot
> > performance suite? The original full patch at
> >
> >     https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/
> >
> > and it would be interesting if the test robot could see if the patch
> > makes any difference on any other loads?
> >
>
> we just reported a stress-ng performance improvement by this patch [1]

Awesome!

>
> test robot applied this patch upon
>   3c999d1ae3 ("Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq")
>
> filesystem is not our team's major domain, so we just made some limited review
> of the results, and decided to send out the report FYI.
>
> at first stage, we decided to check below catagories of tests as priority:
>
> stress-ng filesystem
> filebench mailserver
> reaim fileserver
>
> we also pick sysbench-fileio, blogbench into coverage.
>
> here is a summary.
>
> for stress-ng, besided [1] which was reported, we got below data that are
> about this patch comparing to 3c999d1ae3.
>
> either there is no significant performance change, or the change is smaller
> than the noise which will make test robot's bisect fail, so these information
> is just FYI. and if you have any doubt about any subtests, could you let us know
> then we could check further?
>
> (also included some net test results)
>
>       12.87 ą  6%      -0.6%      12.79        stress-ng.xattr.ops_per_sec
>        6721 ą  5%      +7.5%       7224 ą 27%  stress-ng.rawdev.ops_per_sec
>        9002 ą  7%      -8.7%       8217        stress-ng.dirmany.ops_per_sec
>     8594743 ą  4%      -3.0%    8337417        stress-ng.rawsock.ops_per_sec
>        2056 ą  3%      +2.9%       2116        stress-ng.dirdeep.ops_per_sec
>        4307 ą 21%      -6.9%       4009        stress-ng.dir.ops_per_sec
>      137946 ą 18%      +5.8%     145942        stress-ng.fiemap.ops_per_sec
>    22413006 ą  2%      +2.5%   22982512 ą  2%  stress-ng.sockdiag.ops_per_sec
>      286714 ą  2%      -3.8%     275876 ą  5%  stress-ng.udp-flood.ops_per_sec
>       82904 ą 46%     -31.6%      56716        stress-ng.sctp.ops_per_sec
>     9853408            -0.3%    9826387        stress-ng.ping-sock.ops_per_sec
>       84667 ą 12%     -26.7%      62050 ą 17%  stress-ng.dccp.ops_per_sec
>       61750 ą 25%     -24.2%      46821 ą 38%  stress-ng.open.ops_per_sec
>      583443 ą  3%      -3.4%     563822        stress-ng.file-ioctl.ops_per_sec
>       11919 ą 28%     -34.3%       7833        stress-ng.dentry.ops_per_sec
>       18.59 ą 12%     -23.9%      14.15 ą 27%  stress-ng.swap.ops_per_sec
>      246.37 ą  2%     +15.9%     285.58 ą 12%  stress-ng.aiol.ops_per_sec
>        7.45            -4.8%       7.10 ą  7%  stress-ng.fallocate.ops_per_sec
>      207.97 ą  7%      +5.2%     218.70        stress-ng.copy-file.ops_per_sec
>       69.87 ą  7%      +5.8%      73.93 ą  5%  stress-ng.fpunch.ops_per_sec
>        0.25 ą 21%     +24.0%       0.31        stress-ng.inode-flags.ops_per_sec
>      849.35 ą  6%      +1.4%     861.51        stress-ng.mknod.ops_per_sec
>      926144 ą  4%      -5.2%     877558        stress-ng.lease.ops_per_sec
>       82924            -2.1%      81220        stress-ng.fcntl.ops_per_sec
>        6.19 ą124%     -50.7%       3.05        stress-ng.chattr.ops_per_sec
>      676.90 ą  4%      -1.9%     663.94 ą  5%  stress-ng.iomix.ops_per_sec
>        0.93 ą  6%      +5.6%       0.98 ą  7%  stress-ng.symlink.ops_per_sec
>     1703608            -3.8%    1639057 ą  3%  stress-ng.eventfd.ops_per_sec
>     1735861            -0.6%    1726072        stress-ng.sockpair.ops_per_sec
>       85440            -2.0%      83705        stress-ng.rawudp.ops_per_sec
>        6198            +0.6%       6236        stress-ng.sockabuse.ops_per_sec
>       39226            +0.0%      39234        stress-ng.sock.ops_per_sec
>        1358            +0.3%       1363        stress-ng.tun.ops_per_sec
>     9794021            -1.7%    9623340        stress-ng.icmp-flood.ops_per_sec
>     1324728            +0.3%    1328244        stress-ng.epoll.ops_per_sec
>      146150            -2.0%     143231        stress-ng.rawpkt.ops_per_sec
>     6381112            -0.4%    6352696        stress-ng.udp.ops_per_sec
>     1234258            +0.2%    1236738        stress-ng.sockfd.ops_per_sec
>       23954            -0.1%      23932        stress-ng.sockmany.ops_per_sec
>      257030            -0.1%     256860        stress-ng.netdev.ops_per_sec
>     6337097            +0.1%    6341130        stress-ng.flock.ops_per_sec
>      173212            -0.3%     172728        stress-ng.rename.ops_per_sec
>      199.69            +0.6%     200.82        stress-ng.sync-file.ops_per_sec
>      606.57            +0.8%     611.53        stress-ng.chown.ops_per_sec
>      183549            -0.9%     181975        stress-ng.handle.ops_per_sec
>        1299            +0.0%       1299        stress-ng.hdd.ops_per_sec
>    98371066            +0.2%   98571113        stress-ng.lockofd.ops_per_sec
>       25.49            -4.3%      24.39        stress-ng.ioprio.ops_per_sec
>    96745191            -1.5%   95333632        stress-ng.locka.ops_per_sec
>      582.35            +0.1%     582.86        stress-ng.chmod.ops_per_sec
>     2075897            -2.2%    2029552        stress-ng.getdent.ops_per_sec
>       60.47            -1.9%      59.34        stress-ng.metamix.ops_per_sec
>       14161            -0.3%      14123        stress-ng.io.ops_per_sec
>       23.98            -1.5%      23.61        stress-ng.link.ops_per_sec
>       27514            +0.0%      27528        stress-ng.filename.ops_per_sec
>       44955            +1.6%      45678        stress-ng.dnotify.ops_per_sec
>      160.94            +0.4%     161.51        stress-ng.inotify.ops_per_sec
>     2452224            +4.0%    2549607        stress-ng.lockf.ops_per_sec
>        6761            +0.3%       6779        stress-ng.fsize.ops_per_sec
>      775083            -1.5%     763487        stress-ng.fanotify.ops_per_sec
>      309124            -4.2%     296285        stress-ng.utime.ops_per_sec
>       25567            -0.1%      25530        stress-ng.dup.ops_per_sec
>        1858            +0.9%       1876        stress-ng.procfs.ops_per_sec
>      105804            -3.9%     101658        stress-ng.access.ops_per_sec
>        1.04            -1.9%       1.02        stress-ng.chdir.ops_per_sec
>       82753            -0.3%      82480        stress-ng.fstat.ops_per_sec
>      681128            +3.7%     706375        stress-ng.acl.ops_per_sec
>       11892            -0.1%      11875        stress-ng.bind-mount.ops_per_sec
>
>
> for filebench, similar results, but data is less unstable than stress-ng, which
> means for most of them, we regarded them as that they should not be impacted by
> this patch.
>
> for reaim/sysbench-fileio/blogbench, the data are quite stable, and we didn't
> notice any significant performance changes. we even doubt whether they go
> through the code path changed by this patch.
>
> so for these, we don't list full results here.
>
> BTW, besides filesystem tests, this patch is also piped into other performance
> test categories such like net, scheduler, mm and others, and since it also goes
> into our so-called hourly kernels, it could run by full other performance test
> suites which test robot supports. so in following 2-3 weeks, it's still possible
> for us to report other results including regression.
>

That's great. Many thanks for your help.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()'
  2024-05-13 16:33                               ` Al Viro
  2024-05-13 16:44                                 ` Linus Torvalds
@ 2024-05-23  7:18                                 ` Dave Chinner
  1 sibling, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2024-05-23  7:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, James Bottomley, brauner, jack, laoar.shao,
	linux-fsdevel, longman, walters, wangkai86, willy

On Mon, May 13, 2024 at 05:33:32PM +0100, Al Viro wrote:
> On Mon, May 13, 2024 at 08:58:33AM -0700, Linus Torvalds wrote:
>  
> > We *could* strive for a hybrid approach, where we handle the common
> > case ("not a ton of child dentries") differently, and just get rid of
> > them synchronously, and handle the "millions of children" case by
> > unhashing the directory and dealing with shrinking the children async.
> 
> try_to_shrink_children()?  Doable, and not even that hard to do, but
> as for shrinking async...  We can easily move it out of inode_lock
> on parent, but doing that really async would either need to be
> tied into e.g. remount r/o logics or we'd get userland regressions.
> 
> I mean, "I have an opened unlinked file, can't remount r/o" is one
> thing, but "I've done rm -rf ~luser, can't remount r/o for a while"
> when all luser's processes had been killed and nothing is holding
> any of that opened... ouch.

There is no ouch for the vast majority of users: XFS has been doing
background async inode unlink processing since 5.14 (i.e. for almost
3 years now). See commit ab23a7768739 ("xfs: per-cpu deferred inode
inactivation queues") for more of the background on this change - it
was implemented because we needed to allow the scrub code to drop
inode references from within transaction contexts, and evict()
processing could run a nested transaction which could then deadlock
the filesystem.

Hence XFS offloads the inode freeing half of the unlink operation
(i.e. the bit that happens in evict() context) to per-cpu workqueues
instead of doing the work directly in evict() context. We allow
evict() to completely tear down the VFS inode context, but don't
free it in ->destroy_inode() because we still have work to do on it.
XFS doesn't need an active VFS inode context to do the internal
metadata updates needed to free an inode, so it's trivial to defer
this work to a background context outside the VFS inode life cycle.

Hence over half the processing work of every unlink() operation on
XFS is now done in kworker threads rather than via the unlink()
syscall context.

Yes, that means freeze, remount r/o and unmount will block in
xfs_inodegc_stop() waiting for these async inode freeing operations
to be flushed and completed.

However, there have been very few reported issues with freeze,
remount r/o or unmount being significantly delayed - there's an
occasional report of an inodes with tens of millions of extents to
free delaying an operation, but that's no change from unlink()
taking minutes to run and delaying the operation that way,
anyway....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2024-05-23  7:18 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-11  2:27 [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file Yafang Shao
2024-05-11  2:53 ` Linus Torvalds
2024-05-11  3:35   ` Yafang Shao
2024-05-11  4:54     ` Waiman Long
2024-05-11 15:58       ` Matthew Wilcox
2024-05-11 16:07         ` Linus Torvalds
2024-05-11 16:13           ` Linus Torvalds
2024-05-11 18:05             ` Linus Torvalds
2024-05-11 18:26               ` [PATCH] vfs: move dentry shrinking outside the inode lock in 'rmdir()' Linus Torvalds
2024-05-11 18:42                 ` Linus Torvalds
2024-05-11 19:28                   ` Al Viro
2024-05-11 19:55                     ` Linus Torvalds
2024-05-11 20:31                       ` Al Viro
2024-05-11 21:17                         ` Al Viro
2024-05-12 15:45                     ` James Bottomley
2024-05-12 16:16                       ` Al Viro
2024-05-12 19:59                         ` Linus Torvalds
2024-05-12 20:29                           ` Linus Torvalds
2024-05-13  5:31                           ` Al Viro
2024-05-13 15:58                             ` Linus Torvalds
2024-05-13 16:33                               ` Al Viro
2024-05-13 16:44                                 ` Linus Torvalds
2024-05-23  7:18                                 ` Dave Chinner
2024-05-11 20:02                   ` [PATCH v2] " Linus Torvalds
2024-05-12  3:06                     ` Yafang Shao
2024-05-12  3:30                       ` Al Viro
2024-05-12  3:36                         ` Yafang Shao
2024-05-11 19:24                 ` [PATCH] " Al Viro
2024-05-15  2:18     ` [RFC PATCH] fs: dcache: Delete the associated dentry when deleting a file Yafang Shao
2024-05-15  2:36       ` Linus Torvalds
2024-05-15  9:17         ` [PATCH] vfs: " Yafang Shao
2024-05-15 16:05           ` Linus Torvalds
2024-05-16 13:44             ` Oliver Sang
2024-05-22  8:51             ` Oliver Sang
2024-05-23  2:21               ` Yafang Shao
2024-05-22  8:11           ` kernel test robot
2024-05-22 16:00             ` Linus Torvalds
2024-05-22 17:13               ` Matthew Wilcox
2024-05-22 18:11                 ` Linus Torvalds
2024-05-11  3:36   ` [RFC PATCH] fs: dcache: " Al Viro
2024-05-11  3:51     ` Yafang Shao
2024-05-11  5:18     ` Al Viro
2024-05-11  5:32     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).