[RFC PATCH 0/7] vfs: improving inode cache iteration scalability

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
@ 2024-10-02  1:33 Dave Chinner
  2024-10-02  1:33 ` [PATCH 1/7] vfs: replace invalidate_inodes() with evict_inodes() Dave Chinner
                   ` (8 more replies)
  0 siblings, 9 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-02  1:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-bcachefs, kent.overstreet, torvalds

The management of the sb->s_inodes list is a scalability limitation;
it is protected by a single lock and every inode that is
instantiated has to be added to the list. Those inodes then need to
be removed from the list when evicted from cache. Hence every inode
that moves through the VFS inode cache must take this global scope
lock twice.

This proves to be a significant limiting factor for concurrent file
access workloads that repeatedly miss the dentry cache on lookup.
Directory search and traversal workloads are particularly prone to
these issues, though on XFS we have enough concurrency capability
in file creation and unlink for the sb->s_inodes list to be a
limitation there as well.

Previous efforts to solve this problem have
largely centered around reworking the sb->s_inodes list into
something more scalable such as this longstanding patchset does:

https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/

However, a recent discussion about inode cache behaviour that arose
from the bcachefs 6.12-rc1 pull request opened a new direction for
us to explore. With both XFS and bcachefs now providing their own
per-superblock inode cache implementations, we should try to make
use of these inode caches as first class citizens.

With that new direction in mind, it became obvious that XFS could
elide the sb->s_inodes list completely - "the best part is no part"
- if iteration was not reliant on open-coded sb->s_inodes list
walks.

We already use the internal inode cache for iteration, and we have
filters for selecting specific inodes to operate on with specific
callback operations. If we had an abstraction for iterating
all VFS inodes, we can easily implement that directly on the XFS
inode cache.

This is what this patchset aims to implement.

There are two superblock iterator functions provided. The first is a
generic iterator that provides safe, reference counted inodes for
the callback to operate on. This is generally what most sb->s_inodes
iterators use, and it allows the iterator to drop locks and perform
blocking operations on the inode before moving to the next inode in
the sb->s_inodes list.

There is one quirk to this interface - INO_ITER_REFERENCE - because
fsnotify iterates the inode cache -after- evict_inodes() has been
called during superblock shutdown to evict all non-referenced
inodes. Hence it should only find referenced inodes, and it has
a check to skip unreferenced inodes. This flag does the same.

However, I suspect this is now somewhat sub-optimal because LSMs can
hold references to inodes beyond evict_inodes(), and they don't get
torn down until after fsnotify evicts the referenced inodes it
holds. However, the landlock LSM doesn't have checks for
unreferenced inodes (i.e. doesn't use INO_ITER_REFERENCE), so this
guard is not consistently applied.

I'm undecided on how best to handle this, but it does not need to be
solved for this patchset to work. fsnotify and
landlock don't need to run -after- evict_inodes(), but moving them
to before evict_inodes() mean we now do three full inode cache
iterations to evict all the inodes from the cache. That doesn't seem
like a good idea when there might be hundreds of millions of cached
inodes at unmount.

Similarly, needing the iterator to be aware that there should be no
unreferenced inodes left when they run doesn't seem like a good
idea, either. So perhaps the answer is that the iterator checks for
SB_ACTIVE (or some other similar flag) that indicates the superblock
is being torn down and so will skip zero-referenced inodes
automatically in this case. Like I said - this doesn't need to be
solved right now, it's just something to be aware of.

The second iterator is the "unsafe" iterator variant that only
provides the callback with an existence guarantee. It does this by
holding the rcu_read_lock() to guarantee that the inode is not freed
from under the callback. There are no validity checks performed on
the inode - it is entirely up to the callback to validate the inode
can be operated on safely.

Hence the "unsafe" variant is only for very specific internal uses
only. Nobody should be adding new uses of this function without
as there are very few good reasons for external access to inodes
without holding a valid reference. I have not decided whether the
unsafe callbacks should require a lockdep_assert_in_rcu_read_lock()
check in them to clearly document the context under which they are
running.

The patchset converts all the open coded iterators to use these
new iterator functions, which means the only use of sb->s_inodes
is now confined to fs/super.c (iterator API) and fs/inode.c
(add/remove API). A new superblock operation is then added to
call out from the iterators into the filesystem to allow them to run
the iteration instead of walking the sb->s_inodes list.

XFS is then converted to use this new superblock operation. I didn't
use the existing iterator function for this functionality right now
as it is currently based on radix tree tag lookups. It also uses a
batched 'lookup and lock' mechanism that complicated matters as I
developed this code. Hence I open coded a new, simpler cache walk
for testing purposes.

Now that I have stuff working and I think I have the callback API
semantics settled, batched radix tree lookups should still work to
minimise the iteration overhead. Also, we might want to tag VFS
inodes in the radix tree so that we can filter them efficiently for
traversals. This would allow us to use the existing generic inode
cache walker rather than a separate variant as this patch set
implements. This can be done as future work, though.

In terms of scalability improvements, a quick 'will it scale' test
demonstrates where the sb->s_inodes list hurts. Running a sharded,
share-nothing cold cache workload with 100,000 files per thread in
per-thread directories gives the following results on a 4-node 64p
machine with 128GB RAM.

The workloads "walk", "chmod" and "unlink" are all directory
traversal workloads that stream cold cache inodes into the cache.
There is enough memory on this test machine that these indoes are
not being reclaimed during the workload, and are being freed between
steps via drop_caches (which iterates the inode cache and so
explicitly tests the new iteration APIs!). Hence the sb->s_inodes
scalability issues aren't as bad in these tests as when memory is
tight and inodes are being reclaimed (i.e. the issues are worse in
real workloads).

The "bulkstat" workload uses the XFS bulkstat ioctl to iterate
inodes via walking the internal inode btrees. It uses
d_mark_dontcache() so it is actually tearing down each inode as soon
as it has been sampled by the bulkstat code. Hence it is doing two
sb->s_inodes list manipulations per inode and so shows scalability
issues much earlier than the other workloads.

Before:

Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
       xfs     400000     4      4.269      3.225      4.557      7.316      1.306
       xfs     800000     8      4.844      3.227      4.702      7.905      1.908
       xfs    1600000    16      6.286      3.296      5.592      8.838      4.392
       xfs    3200000    32      8.912      5.681      8.505     11.724      7.085
       xfs    6400000    64     15.344     11.144     14.162     18.604     15.494

Bulkstat starts to show issues at 8 threads, walk and chmod between
16 and 32 threads, and unlink is limited by internal XFS stuff.
Bulkstat is bottlenecked at about 400-450 thousand inodes/s by the
sb->s_inodes list management.

After:

Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
       xfs     400000     4      4.140      3.502      4.154      7.242      1.164
       xfs     800000     8      4.637      2.836      4.444      7.896      1.093
       xfs    1600000    16      5.549      3.054      5.213      8.696      1.107
       xfs    3200000    32      8.387      3.218      6.867     10.668      1.125
       xfs    6400000    64     14.112      3.953     10.365     18.620      1.270

When patched, walk shows little in way of scalability degradation
out to 64 threads, chmod is significantly improved at 32-64 threads,
and bulkstat shows perfect scalability out to 64 threads now.

I did a couple of other longer running, higher inode count tests
with bulkstat to get an idea of inode cache streaming rates - 32
million inodes scanned in 4.4 seconds at 64 threads. That's about
7.2 million inodes/s being streamed through the inode cache with the
IO rates are peaking well above 5.5GB/s (near IO bound).

Hence raw VFS inode cache throughput sees a ~17x scalability
improvement on XFS at 64 threads (and probably a -lot- more on
higher CPU count machines).  That's far better performance than I
ever got from the dlist conversion of the sb->s_inodes list in
previous patchsets, so this seems like a much better direction to be
heading for optimising the way we cache inodes.

I haven't done a lot of testing on this patchset yet - it boots and
appears to work OK for block devices, ext4 and XFS, but checking
stuff like quota on/off is still working properly on ext4 hasn't
been done yet.

What do people think of moving towards per-sb inode caching and
traversal mechanisms like this?

-Dave.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/7] vfs: replace invalidate_inodes() with evict_inodes()
  2024-10-02  1:33 [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Dave Chinner
@ 2024-10-02  1:33 ` Dave Chinner
  2024-10-03  7:07   ` Christoph Hellwig
  2024-10-03  9:20   ` Jan Kara
  2024-10-02  1:33 ` [PATCH 2/7] vfs: add inode iteration superblock method Dave Chinner
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-02  1:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-bcachefs, kent.overstreet, torvalds

From: Dave Chinner <dchinner@redhat.com>

As of commit e127b9bccdb0 ("fs: simplify invalidate_inodes"),
invalidate_inodes() is functionally identical to evict_inodes().
Replace calls to invalidate_inodes() with a call to
evict_inodes() and kill the former.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c    | 40 ----------------------------------------
 fs/internal.h |  1 -
 fs/super.c    |  2 +-
 3 files changed, 1 insertion(+), 42 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 471ae4a31549..0a53d8c34203 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -827,46 +827,6 @@ void evict_inodes(struct super_block *sb)
 }
 EXPORT_SYMBOL_GPL(evict_inodes);
 
-/**
- * invalidate_inodes	- attempt to free all inodes on a superblock
- * @sb:		superblock to operate on
- *
- * Attempts to free all inodes (including dirty inodes) for a given superblock.
- */
-void invalidate_inodes(struct super_block *sb)
-{
-	struct inode *inode, *next;
-	LIST_HEAD(dispose);
-
-again:
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		if (atomic_read(&inode->i_count)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-
-		inode->i_state |= I_FREEING;
-		inode_lru_list_del(inode);
-		spin_unlock(&inode->i_lock);
-		list_add(&inode->i_lru, &dispose);
-		if (need_resched()) {
-			spin_unlock(&sb->s_inode_list_lock);
-			cond_resched();
-			dispose_list(&dispose);
-			goto again;
-		}
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-
-	dispose_list(&dispose);
-}
-
 /*
  * Isolate the inode from the LRU in preparation for freeing it.
  *
diff --git a/fs/internal.h b/fs/internal.h
index 8c1b7acbbe8f..37749b429e80 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -207,7 +207,6 @@ bool in_group_or_capable(struct mnt_idmap *idmap,
  * fs-writeback.c
  */
 extern long get_nr_dirty_inodes(void);
-void invalidate_inodes(struct super_block *sb);
 
 /*
  * dcache.c
diff --git a/fs/super.c b/fs/super.c
index 1db230432960..a16e6a6342e0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1417,7 +1417,7 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
 	if (!surprise)
 		sync_filesystem(sb);
 	shrink_dcache_sb(sb);
-	invalidate_inodes(sb);
+	evict_inodes(sb);
 	if (sb->s_op->shutdown)
 		sb->s_op->shutdown(sb);
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 2/7] vfs: add inode iteration superblock method
  2024-10-02  1:33 [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Dave Chinner
  2024-10-02  1:33 ` [PATCH 1/7] vfs: replace invalidate_inodes() with evict_inodes() Dave Chinner
@ 2024-10-02  1:33 ` Dave Chinner
  2024-10-03  7:12   ` Christoph Hellwig
  2024-10-04  9:53   ` kernel test robot
  2024-10-02  1:33 ` [PATCH 3/7] vfs: convert vfs inode iterators to super_iter_inodes_unsafe() Dave Chinner
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-02  1:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-bcachefs, kent.overstreet, torvalds

From: Dave Chinner <dchinner@redhat.com>

Add a new superblock method for iterating all cached inodes in the
inode cache.

This will be used to replace the explicit sb->s_inodes iteration,
and the caller will supply a callback function and a private data
pointer that gets passed to the callback along with each inode that
is iterated.

There are two iteration functions provided. The first is the
interface that everyone should be using - it provides an valid,
unlocked and referenced inode that any inode operation (including
blocking operations) is allowed on. The iterator infrastructure is
responsible for lifecycle management, hence the subsystem callback
only needs to implement the operation it wants to perform on all
inodes.

The second iterator interface is the unsafe variant for internal VFS
use only. It simply iterates all VFS inodes without guaranteeing
any state or taking references. This iteration is done under a RCU
read lock to ensure that the VFS inode is not freed from under
the callback. If the operation wishes to block, it must drop the
RCU context after guaranteeing that the inode will not get freed.
This unsafe iteration mechanism is needed for operations that need
tight control over the state of the inodes they need to operate on.

This mechanism allows the existing sb->s_inodes iteration models
to be maintained, allowing a generic implementation for iterating
all cached inodes on the superblock to be provided.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/internal.h      |   2 +
 fs/super.c         | 105 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  12 ++++++
 3 files changed, 119 insertions(+)

diff --git a/fs/internal.h b/fs/internal.h
index 37749b429e80..7039d13980c6 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -127,6 +127,8 @@ struct super_block *user_get_super(dev_t, bool excl);
 void put_super(struct super_block *sb);
 extern bool mount_capable(struct fs_context *);
 int sb_init_dio_done_wq(struct super_block *sb);
+void super_iter_inodes_unsafe(struct super_block *sb, ino_iter_fn iter_fn,
+		void *private_data);
 
 /*
  * Prepare superblock for changing its read-only state (i.e., either remount
diff --git a/fs/super.c b/fs/super.c
index a16e6a6342e0..20a9446d943a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -167,6 +167,111 @@ static void super_wake(struct super_block *sb, unsigned int flag)
 	wake_up_var(&sb->s_flags);
 }
 
+/**
+ * super_iter_inodes - iterate all the cached inodes on a superblock
+ * @sb: superblock to iterate
+ * @iter_fn: callback to run on every inode found.
+ *
+ * This function iterates all cached inodes on a superblock that are not in
+ * the process of being initialised or torn down. It will run @iter_fn() with
+ * a valid, referenced inode, so it is safe for the caller to do anything
+ * it wants with the inode except drop the reference the iterator holds.
+ *
+ */
+int super_iter_inodes(struct super_block *sb, ino_iter_fn iter_fn,
+		void *private_data, int flags)
+{
+	struct inode *inode, *old_inode = NULL;
+	int ret = 0;
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+
+		/*
+		 * Skip over zero refcount inode if the caller only wants
+		 * referenced inodes to be iterated.
+		 */
+		if ((flags & INO_ITER_REFERENCED) &&
+		    !atomic_read(&inode->i_count)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inode_list_lock);
+		iput(old_inode);
+
+		ret = iter_fn(inode, private_data);
+
+		old_inode = inode;
+		if (ret == INO_ITER_ABORT) {
+			ret = 0;
+			break;
+		}
+		if (ret < 0)
+			break;
+
+		cond_resched();
+		spin_lock(&sb->s_inode_list_lock);
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+	iput(old_inode);
+	return ret;
+}
+
+/**
+ * super_iter_inodes_unsafe - unsafely iterate all the inodes on a superblock
+ * @sb: superblock to iterate
+ * @iter_fn: callback to run on every inode found.
+ *
+ * This is almost certainly not the function you want. It is for internal VFS
+ * operations only. Please use super_iter_inodes() instead. If you must use
+ * this function, please add a comment explaining why it is necessary and the
+ * locking that makes it safe to use this function.
+ *
+ * This function iterates all cached inodes on a superblock that are attached to
+ * the superblock. It will pass each inode to @iter_fn unlocked and without
+ * having performed any existences checks on it.
+
+ * @iter_fn must perform all necessary state checks on the inode itself to
+ * ensure safe operation. super_iter_inodes_unsafe() only guarantees that the
+ * inode exists and won't be freed whilst the callback is running.
+ *
+ * @iter_fn must not block. It is run in an atomic context that is not allowed
+ * to sleep to provide the inode existence guarantees. If the callback needs to
+ * do blocking operations it needs to track the inode itself and defer those
+ * operations until after the iteration completes.
+ *
+ * @iter_fn must provide conditional reschedule checks itself. If rescheduling
+ * or deferred processing is needed, it must return INO_ITER_ABORT to return to
+ * the high level function to perform those operations. It can then restart the
+ * iteration again. The high level code must provide forwards progress
+ * guarantees if they are necessary.
+ *
+ */
+void super_iter_inodes_unsafe(struct super_block *sb, ino_iter_fn iter_fn,
+		void *private_data)
+{
+	struct inode *inode;
+	int ret;
+
+	rcu_read_lock();
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		ret = iter_fn(inode, private_data);
+		if (ret == INO_ITER_ABORT)
+			break;
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+	rcu_read_unlock();
+}
+
 /*
  * One thing we have to be careful of with a per-sb shrinker is that we don't
  * drop the last active reference to the superblock from within the shrinker.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index eae5b67e4a15..0a6a462c45ab 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2213,6 +2213,18 @@ enum freeze_holder {
 	FREEZE_MAY_NEST		= (1U << 2),
 };
 
+/* Inode iteration callback return values */
+#define INO_ITER_DONE		0
+#define INO_ITER_ABORT		1
+
+/* Inode iteration control flags */
+#define INO_ITER_REFERENCED	(1U << 0)
+#define INO_ITER_UNSAFE		(1U << 1)
+
+typedef int (*ino_iter_fn)(struct inode *inode, void *priv);
+int super_iter_inodes(struct super_block *sb, ino_iter_fn iter_fn,
+		void *private_data, int flags);
+
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
 	void (*destroy_inode)(struct inode *);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 3/7] vfs: convert vfs inode iterators to super_iter_inodes_unsafe()
  2024-10-02  1:33 [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Dave Chinner
  2024-10-02  1:33 ` [PATCH 1/7] vfs: replace invalidate_inodes() with evict_inodes() Dave Chinner
  2024-10-02  1:33 ` [PATCH 2/7] vfs: add inode iteration superblock method Dave Chinner
@ 2024-10-02  1:33 ` Dave Chinner
  2024-10-03  7:14   ` Christoph Hellwig
  2024-10-04 10:55   ` kernel test robot
  2024-10-02  1:33 ` [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes() Dave Chinner
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-02  1:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-bcachefs, kent.overstreet, torvalds

From: Dave Chinner <dchinner@redhat.com>

Convert VFS internal superblock inode iterators that cannot use
referenced inodes to the new super_iter_inodes_unsafe() iterator.
Dquot and inode eviction require this special handling due to
special eviction handling requirements. The special
nr_blockdev_pages() statistics code needs it as well, as this is
called from si_meminfo() and so can potentially be run from
locations where arbitrary blocking is not allowed or desirable.

New cases using this iterator need careful consideration.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 block/bdev.c     | 24 +++++++++++----
 fs/inode.c       | 79 ++++++++++++++++++++++++++----------------------
 fs/quota/dquot.c | 72 ++++++++++++++++++++++++-------------------
 3 files changed, 102 insertions(+), 73 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 33f9c4605e3a..b5a362156ca1 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -472,16 +472,28 @@ void bdev_drop(struct block_device *bdev)
 	iput(BD_INODE(bdev));
 }
 
+static int bdev_pages_count(struct inode *inode, void *data)
+{
+	long	*pages = data;
+
+	*pages += inode->i_mapping->nrpages;
+	return INO_ITER_DONE;
+}
+
 long nr_blockdev_pages(void)
 {
-	struct inode *inode;
 	long ret = 0;
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list)
-		ret += inode->i_mapping->nrpages;
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
-
+	/*
+	 * We can be called from contexts where blocking is not
+	 * desirable. The count is advisory at best, and we only
+	 * need to access the inode mapping. Hence as long as we
+	 * have an inode existence guarantee, we can safely count
+	 * the cached pages on each inode without needing reference
+	 * counted inodes.
+	 */
+	super_iter_inodes_unsafe(blockdev_superblock,
+			bdev_pages_count, &ret);
 	return ret;
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 0a53d8c34203..3f335f78c5b2 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -761,8 +761,11 @@ static void evict(struct inode *inode)
  * Dispose-list gets a local list with local inodes in it, so it doesn't
  * need to worry about list corruption and SMP locks.
  */
-static void dispose_list(struct list_head *head)
+static bool dispose_list(struct list_head *head)
 {
+	if (list_empty(head))
+		return false;
+
 	while (!list_empty(head)) {
 		struct inode *inode;
 
@@ -772,6 +775,7 @@ static void dispose_list(struct list_head *head)
 		evict(inode);
 		cond_resched();
 	}
+	return true;
 }
 
 /**
@@ -783,47 +787,50 @@ static void dispose_list(struct list_head *head)
  * so any inode reaching zero refcount during or after that call will
  * be immediately evicted.
  */
+static int evict_inode_fn(struct inode *inode, void *data)
+{
+	struct list_head *dispose = data;
+
+	spin_lock(&inode->i_lock);
+	if (atomic_read(&inode->i_count) ||
+	    (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) {
+		spin_unlock(&inode->i_lock);
+		return INO_ITER_DONE;
+	}
+
+	inode->i_state |= I_FREEING;
+	inode_lru_list_del(inode);
+	spin_unlock(&inode->i_lock);
+	list_add(&inode->i_lru, dispose);
+
+	/*
+	 * If we've run long enough to need rescheduling, abort the
+	 * iteration so we can return to evict_inodes() and dispose of the
+	 * inodes before collecting more inodes to evict.
+	 */
+	if (need_resched())
+		return INO_ITER_ABORT;
+	return INO_ITER_DONE;
+}
+
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
 	LIST_HEAD(dispose);
 
-again:
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		if (atomic_read(&inode->i_count))
-			continue;
-
-		spin_lock(&inode->i_lock);
-		if (atomic_read(&inode->i_count)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-
-		inode->i_state |= I_FREEING;
-		inode_lru_list_del(inode);
-		spin_unlock(&inode->i_lock);
-		list_add(&inode->i_lru, &dispose);
-
+	do {
 		/*
-		 * We can have a ton of inodes to evict at unmount time given
-		 * enough memory, check to see if we need to go to sleep for a
-		 * bit so we don't livelock.
+		 * We do not want to take references to inodes whilst iterating
+		 * because we are trying to evict unreferenced inodes from
+		 * the cache. Hence we need to use the unsafe iteration
+		 * mechanism and do all the required inode validity checks in
+		 * evict_inode_fn() to safely queue unreferenced inodes for
+		 * eviction.
+		 *
+		 * We repeat the iteration until it doesn't find any more
+		 * inodes to dispose of.
 		 */
-		if (need_resched()) {
-			spin_unlock(&sb->s_inode_list_lock);
-			cond_resched();
-			dispose_list(&dispose);
-			goto again;
-		}
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-
-	dispose_list(&dispose);
+		super_iter_inodes_unsafe(sb, evict_inode_fn, &dispose);
+	} while (dispose_list(&dispose));
 }
 EXPORT_SYMBOL_GPL(evict_inodes);
 
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index b40410cd39af..ea0bd807fed7 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -1075,41 +1075,51 @@ static int add_dquot_ref(struct super_block *sb, int type)
 	return err;
 }
 
+struct dquot_ref_data {
+	int	type;
+	int	reserved;
+};
+
+static int remove_dquot_ref_fn(struct inode *inode, void *data)
+{
+	struct dquot_ref_data *ref = data;
+
+	spin_lock(&dq_data_lock);
+	if (!IS_NOQUOTA(inode)) {
+		struct dquot __rcu **dquots = i_dquot(inode);
+		struct dquot *dquot = srcu_dereference_check(
+			dquots[ref->type], &dquot_srcu,
+			lockdep_is_held(&dq_data_lock));
+
+#ifdef CONFIG_QUOTA_DEBUG
+		if (unlikely(inode_get_rsv_space(inode) > 0))
+			ref->reserved++;
+#endif
+		rcu_assign_pointer(dquots[ref->type], NULL);
+		if (dquot)
+			dqput(dquot);
+	}
+	spin_unlock(&dq_data_lock);
+	return INO_ITER_DONE;
+}
+
 static void remove_dquot_ref(struct super_block *sb, int type)
 {
-	struct inode *inode;
-#ifdef CONFIG_QUOTA_DEBUG
-	int reserved = 0;
-#endif
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		/*
-		 *  We have to scan also I_NEW inodes because they can already
-		 *  have quota pointer initialized. Luckily, we need to touch
-		 *  only quota pointers and these have separate locking
-		 *  (dq_data_lock).
-		 */
-		spin_lock(&dq_data_lock);
-		if (!IS_NOQUOTA(inode)) {
-			struct dquot __rcu **dquots = i_dquot(inode);
-			struct dquot *dquot = srcu_dereference_check(
-				dquots[type], &dquot_srcu,
-				lockdep_is_held(&dq_data_lock));
+	struct dquot_ref_data ref = {
+		.type = type,
+	};
 
+	/*
+	 * We have to scan I_NEW inodes because they can already
+	 * have quota pointer initialized. Luckily, we need to touch
+	 * only quota pointers and these have separate locking
+	 * (dq_data_lock) so the existence guarantee that
+	 * super_iter_inodes_unsafe() provides inodes passed to
+	 * remove_dquot_ref_fn() is sufficient for this operation.
+	 */
+	super_iter_inodes_unsafe(sb, remove_dquot_ref_fn, &ref);
 #ifdef CONFIG_QUOTA_DEBUG
-			if (unlikely(inode_get_rsv_space(inode) > 0))
-				reserved = 1;
-#endif
-			rcu_assign_pointer(dquots[type], NULL);
-			if (dquot)
-				dqput(dquot);
-		}
-		spin_unlock(&dq_data_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-#ifdef CONFIG_QUOTA_DEBUG
-	if (reserved) {
+	if (ref.reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
 			" was disabled thus quota information is probably "
 			"inconsistent. Please run quotacheck(8).\n", sb->s_id);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-02  1:33 [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Dave Chinner
                   ` (2 preceding siblings ...)
  2024-10-02  1:33 ` [PATCH 3/7] vfs: convert vfs inode iterators to super_iter_inodes_unsafe() Dave Chinner
@ 2024-10-02  1:33 ` Dave Chinner
  2024-10-03  7:23   ` lsm sb_delete hook, was " Christoph Hellwig
  2024-10-02  1:33 ` [PATCH 5/7] vfs: add inode iteration superblock method Dave Chinner
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2024-10-02  1:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-bcachefs, kent.overstreet, torvalds

From: Dave Chinner <dchinner@redhat.com>

Convert all the remaining superblock inode iterators to use
super_iter_inodes(). These are mostly straight forward conversions
for the iterations that use references, and the bdev use cases that
didn't even validate the inode before dereferencing it are now
inherently safe.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 block/bdev.c           |  76 ++++++++--------------
 fs/drop_caches.c       |  38 ++++-------
 fs/gfs2/ops_fstype.c   |  67 ++++++-------------
 fs/notify/fsnotify.c   |  75 ++++++---------------
 fs/quota/dquot.c       |  79 +++++++++--------------
 security/landlock/fs.c | 143 ++++++++++++++---------------------------
 6 files changed, 154 insertions(+), 324 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index b5a362156ca1..5f720e12f731 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1226,56 +1226,36 @@ void bdev_mark_dead(struct block_device *bdev, bool surprise)
  */
 EXPORT_SYMBOL_GPL(bdev_mark_dead);
 
+static int sync_bdev_fn(struct inode *inode, void *data)
+{
+	struct block_device *bdev;
+	bool wait = *(bool *)data;
+
+	if (inode->i_mapping->nrpages == 0)
+		return INO_ITER_DONE;
+
+	bdev = I_BDEV(inode);
+	mutex_lock(&bdev->bd_disk->open_mutex);
+	if (!atomic_read(&bdev->bd_openers)) {
+		; /* skip */
+	} else if (wait) {
+		/*
+		 * We keep the error status of individual mapping so
+		 * that applications can catch the writeback error using
+		 * fsync(2). See filemap_fdatawait_keep_errors() for
+		 * details.
+		 */
+		filemap_fdatawait_keep_errors(inode->i_mapping);
+	} else {
+		filemap_fdatawrite(inode->i_mapping);
+	}
+	mutex_unlock(&bdev->bd_disk->open_mutex);
+	return INO_ITER_DONE;
+}
+
 void sync_bdevs(bool wait)
 {
-	struct inode *inode, *old_inode = NULL;
-
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
-		struct block_device *bdev;
-
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
-		    mapping->nrpages == 0) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		iput(old_inode);
-		old_inode = inode;
-		bdev = I_BDEV(inode);
-
-		mutex_lock(&bdev->bd_disk->open_mutex);
-		if (!atomic_read(&bdev->bd_openers)) {
-			; /* skip */
-		} else if (wait) {
-			/*
-			 * We keep the error status of individual mapping so
-			 * that applications can catch the writeback error using
-			 * fsync(2). See filemap_fdatawait_keep_errors() for
-			 * details.
-			 */
-			filemap_fdatawait_keep_errors(inode->i_mapping);
-		} else {
-			filemap_fdatawrite(inode->i_mapping);
-		}
-		mutex_unlock(&bdev->bd_disk->open_mutex);
-
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
-	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
-	iput(old_inode);
+	super_iter_inodes(blockdev_superblock, sync_bdev_fn, &wait, 0);
 }
 
 /*
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d45ef541d848..901cda15537f 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,36 +16,20 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static void drop_pagecache_sb(struct super_block *sb, void *unused)
+static int invalidate_inode_fn(struct inode *inode, void *data)
 {
-	struct inode *inode, *toput_inode = NULL;
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		/*
-		 * We must skip inodes in unusual state. We may also skip
-		 * inodes without pages but we deliberately won't in case
-		 * we need to reschedule to avoid softlockups.
-		 */
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping_empty(inode->i_mapping) && !need_resched())) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
-
+	if (!mapping_empty(inode->i_mapping))
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
-		iput(toput_inode);
-		toput_inode = inode;
+	return INO_ITER_DONE;
+}
 
-		cond_resched();
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(toput_inode);
+/*
+ * Note: it would be nice to check mapping_empty() before we get a reference on
+ * the inode in super_iter_inodes(), but that's a future optimisation.
+ */
+static void drop_pagecache_sb(struct super_block *sb, void *unused)
+{
+	super_iter_inodes(sb, invalidate_inode_fn, NULL, 0);
 }
 
 int drop_caches_sysctl_handler(const struct ctl_table *table, int write,
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index e83d293c3614..f20862614ad6 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1714,53 +1714,10 @@ static int gfs2_meta_init_fs_context(struct fs_context *fc)
 	return 0;
 }
 
-/**
- * gfs2_evict_inodes - evict inodes cooperatively
- * @sb: the superblock
- *
- * When evicting an inode with a zero link count, we are trying to upgrade the
- * inode's iopen glock from SH to EX mode in order to determine if we can
- * delete the inode.  The other nodes are supposed to evict the inode from
- * their caches if they can, and to poke the inode's inode glock if they cannot
- * do so.  Either behavior allows gfs2_upgrade_iopen_glock() to proceed
- * quickly, but if the other nodes are not cooperating, the lock upgrading
- * attempt will time out.  Since inodes are evicted sequentially, this can add
- * up quickly.
- *
- * Function evict_inodes() tries to keep the s_inode_list_lock list locked over
- * a long time, which prevents other inodes from being evicted concurrently.
- * This precludes the cooperative behavior we are looking for.  This special
- * version of evict_inodes() avoids that.
- *
- * Modeled after drop_pagecache_sb().
- */
-static void gfs2_evict_inodes(struct super_block *sb)
+/* Nothing to do because we just want to bounce the inode through iput() */
+static int gfs2_evict_inode_fn(struct inode *inode, void *data)
 {
-	struct inode *inode, *toput_inode = NULL;
-	struct gfs2_sbd *sdp = sb->s_fs_info;
-
-	set_bit(SDF_EVICTING, &sdp->sd_flags);
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) &&
-		    !need_resched()) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		atomic_inc(&inode->i_count);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
-
-		iput(toput_inode);
-		toput_inode = inode;
-
-		cond_resched();
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(toput_inode);
+	return INO_ITER_DONE;
 }
 
 static void gfs2_kill_sb(struct super_block *sb)
@@ -1779,7 +1736,23 @@ static void gfs2_kill_sb(struct super_block *sb)
 	sdp->sd_master_dir = NULL;
 	shrink_dcache_sb(sb);
 
-	gfs2_evict_inodes(sb);
+	/*
+	 * When evicting an inode with a zero link count, we are trying to
+	 * upgrade the inode's iopen glock from SH to EX mode in order to
+	 * determine if we can delete the inode.  The other nodes are supposed
+	 * to evict the inode from their caches if they can, and to poke the
+	 * inode's inode glock if they cannot do so.  Either behavior allows
+	 * gfs2_upgrade_iopen_glock() to proceed quickly, but if the other nodes
+	 * are not cooperating, the lock upgrading attempt will time out.  Since
+	 * inodes are evicted sequentially, this can add up quickly.
+	 *
+	 * evict_inodes() tries to keep the s_inode_list_lock list locked over a
+	 * long time, which prevents other inodes from being evicted
+	 * concurrently.  This precludes the cooperative behavior we are looking
+	 * for. 
+	 */
+	set_bit(SDF_EVICTING, &sdp->sd_flags);
+	super_iter_inodes(sb, gfs2_evict_inode_fn, NULL, 0);
 
 	/*
 	 * Flush and then drain the delete workqueue here (via
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 272c8a1dab3c..68c34ed94271 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -28,63 +28,14 @@ void __fsnotify_vfsmount_delete(struct vfsmount *mnt)
 	fsnotify_clear_marks_by_mount(mnt);
 }
 
-/**
- * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
- * @sb: superblock being unmounted.
- *
- * Called during unmount with no locks held, so needs to be safe against
- * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
- */
-static void fsnotify_unmount_inodes(struct super_block *sb)
+static int fsnotify_unmount_inode_fn(struct inode *inode, void *data)
 {
-	struct inode *inode, *iput_inode = NULL;
+	spin_unlock(&inode->i_lock);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		/*
-		 * We cannot __iget() an inode in state I_FREEING,
-		 * I_WILL_FREE, or I_NEW which is fine because by that point
-		 * the inode cannot have any associated watches.
-		 */
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-
-		/*
-		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an __iget/iput with SB_ACTIVE clear would actually
-		 * evict all inodes with zero i_count from icache which is
-		 * unnecessarily violent and may in fact be illegal to do.
-		 * However, we should have been called /after/ evict_inodes
-		 * removed all zero refcount inodes, in any case.  Test to
-		 * be sure.
-		 */
-		if (!atomic_read(&inode->i_count)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
-
-		iput(iput_inode);
-
-		/* for each watch, send FS_UNMOUNT and then remove it */
-		fsnotify_inode(inode, FS_UNMOUNT);
-
-		fsnotify_inode_delete(inode);
-
-		iput_inode = inode;
-
-		cond_resched();
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-
-	iput(iput_inode);
+	/* for each watch, send FS_UNMOUNT and then remove it */
+	fsnotify_inode(inode, FS_UNMOUNT);
+	fsnotify_inode_delete(inode);
+	return INO_ITER_DONE;
 }
 
 void fsnotify_sb_delete(struct super_block *sb)
@@ -95,7 +46,19 @@ void fsnotify_sb_delete(struct super_block *sb)
 	if (!sbinfo)
 		return;
 
-	fsnotify_unmount_inodes(sb);
+	/*
+	 * If i_count is zero, the inode cannot have any watches and
+	 * doing an __iget/iput with SB_ACTIVE clear would actually
+	 * evict all inodes with zero i_count from icache which is
+	 * unnecessarily violent and may in fact be illegal to do.
+	 * However, we should have been called /after/ evict_inodes
+	 * removed all zero refcount inodes, in any case. Hence we use
+	 * INO_ITER_REFERENCED to ensure zero refcount inodes are filtered
+	 * properly.
+	 */
+	super_iter_inodes(sb, fsnotify_unmount_inode_fn, NULL,
+			INO_ITER_REFERENCED);
+
 	fsnotify_clear_marks_by_sb(sb);
 	/* Wait for outstanding object references from connectors */
 	wait_var_event(fsnotify_sb_watched_objects(sb),
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index ea0bd807fed7..ea9fce7acd1b 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -1017,56 +1017,40 @@ static int dqinit_needed(struct inode *inode, int type)
 	return 0;
 }
 
+struct dquot_ref_data {
+	int	type;
+	int	reserved;
+};
+
+static int add_dquot_ref_fn(struct inode *inode, void *data)
+{
+	struct dquot_ref_data *ref = data;
+	int ret;
+
+	if (!dqinit_needed(inode, ref->type))
+		return INO_ITER_DONE;
+
+#ifdef CONFIG_QUOTA_DEBUG
+	if (unlikely(inode_get_rsv_space(inode) > 0))
+		ref->reserved++;
+#endif
+	ret = __dquot_initialize(inode, ref->type);
+	if (ret < 0)
+		return ret;
+	return INO_ITER_DONE;
+}
+
 /* This routine is guarded by s_umount semaphore */
 static int add_dquot_ref(struct super_block *sb, int type)
 {
-	struct inode *inode, *old_inode = NULL;
-#ifdef CONFIG_QUOTA_DEBUG
-	int reserved = 0;
-#endif
-	int err = 0;
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    !atomic_read(&inode->i_writecount) ||
-		    !dqinit_needed(inode, type)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
-
-#ifdef CONFIG_QUOTA_DEBUG
-		if (unlikely(inode_get_rsv_space(inode) > 0))
-			reserved = 1;
-#endif
-		iput(old_inode);
-		err = __dquot_initialize(inode, type);
-		if (err) {
-			iput(inode);
-			goto out;
-		}
+	struct dquot_ref_data ref = {
+		.type = type,
+	};
+	int err;
 
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock. We cannot iput the inode now as we can be
-		 * holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		old_inode = inode;
-		cond_resched();
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(old_inode);
-out:
+	err = super_iter_inodes(sb, add_dquot_ref_fn, &ref, 0);
 #ifdef CONFIG_QUOTA_DEBUG
-	if (reserved) {
+	if (ref.reserved) {
 		quota_error(sb, "Writes happened before quota was turned on "
 			"thus quota information is probably inconsistent. "
 			"Please run quotacheck(8)");
@@ -1075,11 +1059,6 @@ static int add_dquot_ref(struct super_block *sb, int type)
 	return err;
 }
 
-struct dquot_ref_data {
-	int	type;
-	int	reserved;
-};
-
 static int remove_dquot_ref_fn(struct inode *inode, void *data)
 {
 	struct dquot_ref_data *ref = data;
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index 7d79fc8abe21..013ec4017ddd 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -1223,109 +1223,60 @@ static void hook_inode_free_security_rcu(void *inode_security)
 
 /*
  * Release the inodes used in a security policy.
- *
- * Cf. fsnotify_unmount_inodes() and invalidate_inodes()
  */
+static int release_inode_fn(struct inode *inode, void *data)
+{
+
+	rcu_read_lock();
+	object = rcu_dereference(landlock_inode(inode)->object);
+	if (!object) {
+		rcu_read_unlock();
+		return INO_ITER_DONE;
+	}
+
+	/*
+	 * If there is no concurrent release_inode() ongoing, then we
+	 * are in charge of calling iput() on this inode, otherwise we
+	 * will just wait for it to finish.
+	 */
+	spin_lock(&object->lock);
+	if (object->underobj != inode) {
+		spin_unlock(&object->lock);
+		rcu_read_unlock();
+		return INO_ITER_DONE;
+	}
+
+	object->underobj = NULL;
+	spin_unlock(&object->lock);
+	rcu_read_unlock();
+
+	/*
+	 * Because object->underobj was not NULL, release_inode() and
+	 * get_inode_object() guarantee that it is safe to reset
+	 * landlock_inode(inode)->object while it is not NULL.  It is therefore
+	 * not necessary to lock inode->i_lock.
+	 */
+	rcu_assign_pointer(landlock_inode(inode)->object, NULL);
+
+	/*
+	 * At this point, we own the ihold() reference that was originally set
+	 * up by get_inode_object() as well as the reference the inode iterator
+	 * obtained before calling us.  Therefore the following call to iput()
+	 * will not sleep nor drop the inode because there is now at least two
+	 * references to it.
+	 */
+	iput(inode);
+	return INO_ITER_DONE;
+}
+
 static void hook_sb_delete(struct super_block *const sb)
 {
-	struct inode *inode, *prev_inode = NULL;
-
 	if (!landlock_initialized)
 		return;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		struct landlock_object *object;
+	super_iter_inodes(sb, release_inode_fn, NULL, 0);
 
-		/* Only handles referenced inodes. */
-		if (!atomic_read(&inode->i_count))
-			continue;
-
-		/*
-		 * Protects against concurrent modification of inode (e.g.
-		 * from get_inode_object()).
-		 */
-		spin_lock(&inode->i_lock);
-		/*
-		 * Checks I_FREEING and I_WILL_FREE  to protect against a race
-		 * condition when release_inode() just called iput(), which
-		 * could lead to a NULL dereference of inode->security or a
-		 * second call to iput() for the same Landlock object.  Also
-		 * checks I_NEW because such inode cannot be tied to an object.
-		 */
-		if (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-
-		rcu_read_lock();
-		object = rcu_dereference(landlock_inode(inode)->object);
-		if (!object) {
-			rcu_read_unlock();
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		/* Keeps a reference to this inode until the next loop walk. */
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-
-		/*
-		 * If there is no concurrent release_inode() ongoing, then we
-		 * are in charge of calling iput() on this inode, otherwise we
-		 * will just wait for it to finish.
-		 */
-		spin_lock(&object->lock);
-		if (object->underobj == inode) {
-			object->underobj = NULL;
-			spin_unlock(&object->lock);
-			rcu_read_unlock();
-
-			/*
-			 * Because object->underobj was not NULL,
-			 * release_inode() and get_inode_object() guarantee
-			 * that it is safe to reset
-			 * landlock_inode(inode)->object while it is not NULL.
-			 * It is therefore not necessary to lock inode->i_lock.
-			 */
-			rcu_assign_pointer(landlock_inode(inode)->object, NULL);
-			/*
-			 * At this point, we own the ihold() reference that was
-			 * originally set up by get_inode_object() and the
-			 * __iget() reference that we just set in this loop
-			 * walk.  Therefore the following call to iput() will
-			 * not sleep nor drop the inode because there is now at
-			 * least two references to it.
-			 */
-			iput(inode);
-		} else {
-			spin_unlock(&object->lock);
-			rcu_read_unlock();
-		}
-
-		if (prev_inode) {
-			/*
-			 * At this point, we still own the __iget() reference
-			 * that we just set in this loop walk.  Therefore we
-			 * can drop the list lock and know that the inode won't
-			 * disappear from under us until the next loop walk.
-			 */
-			spin_unlock(&sb->s_inode_list_lock);
-			/*
-			 * We can now actually put the inode reference from the
-			 * previous loop walk, which is not needed anymore.
-			 */
-			iput(prev_inode);
-			cond_resched();
-			spin_lock(&sb->s_inode_list_lock);
-		}
-		prev_inode = inode;
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-
-	/* Puts the inode reference from the last loop walk, if any. */
-	if (prev_inode)
-		iput(prev_inode);
-	/* Waits for pending iput() in release_inode(). */
+	/* Waits for pending iput()s in release_inode(). */
 	wait_var_event(&landlock_superblock(sb)->inode_refs,
 		       !atomic_long_read(&landlock_superblock(sb)->inode_refs));
 }
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 5/7] vfs: add inode iteration superblock method
  2024-10-02  1:33 [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Dave Chinner
                   ` (3 preceding siblings ...)
  2024-10-02  1:33 ` [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes() Dave Chinner
@ 2024-10-02  1:33 ` Dave Chinner
  2024-10-03  7:24   ` Christoph Hellwig
  2024-10-02  1:33 ` [PATCH 6/7] xfs: implement sb->iter_vfs_inodes Dave Chinner
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2024-10-02  1:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-bcachefs, kent.overstreet, torvalds

From: Dave Chinner <dchinner@redhat.com>

For filesytsems that provide their own inode cache that can be
traversed, add a sueprblock method that can be used instead of
iterating the sb->s_inodes list. This allows these filesystems to
avoid having to populate the sb->s_inodes list and hence avoid the
scalability limitations that this list imposes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/super.c         | 54 +++++++++++++++++++++++++++++++---------------
 include/linux/fs.h |  4 ++++
 2 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 20a9446d943a..971ad4e996e0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -167,6 +167,31 @@ static void super_wake(struct super_block *sb, unsigned int flag)
 	wake_up_var(&sb->s_flags);
 }
 
+bool super_iter_iget(struct inode *inode, int flags)
+{
+	bool	ret = false;
+
+	spin_lock(&inode->i_lock);
+	if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))
+		goto out_unlock;
+
+	/*
+	 * Skip over zero refcount inode if the caller only wants
+	 * referenced inodes to be iterated.
+	 */
+	if ((flags & INO_ITER_REFERENCED) &&
+	    !atomic_read(&inode->i_count))
+		goto out_unlock;
+
+	__iget(inode);
+	ret = true;
+out_unlock:
+	spin_unlock(&inode->i_lock);
+	return ret;
+
+}
+EXPORT_SYMBOL_GPL(super_iter_iget);
+
 /**
  * super_iter_inodes - iterate all the cached inodes on a superblock
  * @sb: superblock to iterate
@@ -184,26 +209,15 @@ int super_iter_inodes(struct super_block *sb, ino_iter_fn iter_fn,
 	struct inode *inode, *old_inode = NULL;
 	int ret = 0;
 
+	if (sb->s_op->iter_vfs_inodes) {
+		return sb->s_op->iter_vfs_inodes(sb, iter_fn,
+				private_data, flags);
+	}
+
 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
-			spin_unlock(&inode->i_lock);
+		if (!super_iter_iget(inode, flags))
 			continue;
-		}
-
-		/*
-		 * Skip over zero refcount inode if the caller only wants
-		 * referenced inodes to be iterated.
-		 */
-		if ((flags & INO_ITER_REFERENCED) &&
-		    !atomic_read(&inode->i_count)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inode_list_lock);
 		iput(old_inode);
 
@@ -261,6 +275,12 @@ void super_iter_inodes_unsafe(struct super_block *sb, ino_iter_fn iter_fn,
 	struct inode *inode;
 	int ret;
 
+	if (sb->s_op->iter_vfs_inodes) {
+		sb->s_op->iter_vfs_inodes(sb, iter_fn,
+				private_data, INO_ITER_UNSAFE);
+		return;
+	}
+
 	rcu_read_lock();
 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0a6a462c45ab..8e82e3dc0618 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2224,6 +2224,7 @@ enum freeze_holder {
 typedef int (*ino_iter_fn)(struct inode *inode, void *priv);
 int super_iter_inodes(struct super_block *sb, ino_iter_fn iter_fn,
 		void *private_data, int flags);
+bool super_iter_iget(struct inode *inode, int flags);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
@@ -2258,6 +2259,9 @@ struct super_operations {
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
 	void (*shutdown)(struct super_block *sb);
+
+	int (*iter_vfs_inodes)(struct super_block *sb, ino_iter_fn iter_fn,
+			void *private_data, int flags);
 };
 
 /*
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 6/7] xfs: implement sb->iter_vfs_inodes
  2024-10-02  1:33 [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Dave Chinner
                   ` (4 preceding siblings ...)
  2024-10-02  1:33 ` [PATCH 5/7] vfs: add inode iteration superblock method Dave Chinner
@ 2024-10-02  1:33 ` Dave Chinner
  2024-10-03  7:30   ` Christoph Hellwig
  2024-10-02  1:33 ` [PATCH 7/7] bcachefs: " Dave Chinner
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2024-10-02  1:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-bcachefs, kent.overstreet, torvalds

From: Dave Chinner <dchinner@redhat.com>

We can iterate all the in-memory VFS inodes via the xfs_icwalk()
interface, so implement the new superblock operation to walk inodes
in this way.

This removes the dependency XFS has on the sb->s_inodes list and
allows us to avoid the global lock that marshalls this list and
must be taken on every VFS inode instantiation and eviction. This
greatly improves the rate at which we can stream inodes through the
VFS inode cache.

Sharded, share-nothing cold cache workload with 100,000 files per
thread in per-thread directories.

Before:

Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
       xfs     400000     4      4.269      3.225      4.557      7.316      1.306
       xfs     800000     8      4.844      3.227      4.702      7.905      1.908
       xfs    1600000    16      6.286      3.296      5.592      8.838      4.392
       xfs    3200000    32      8.912      5.681      8.505     11.724      7.085
       xfs    6400000    64     15.344     11.144     14.162     18.604     15.494

After:

Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
       xfs     400000     4      4.140      3.502      4.154      7.242      1.164
       xfs     800000     8      4.637      2.836      4.444      7.896      1.093
       xfs    1600000    16      5.549      3.054      5.213      8.696      1.107
       xfs    3200000    32      8.387      3.218      6.867     10.668      1.125
       xfs    6400000    64     14.112      3.953     10.365     18.620      1.270

Bulkstat shows the real story here - before we start to see
scalability problems at 16 threads. Patched shows almost perfect
scalability up to 64 threads streaming inodes through the VFS cache
using I_DONTCACHE semantics.

Note: this is an initial, unoptimised implementation that could be
significantly improved and reduced in size by using a radix tree tag
filter for VFS inodes and so use the generic tag-filtered
xfs_icwalk() implementation instead of special casing it like this
patch does.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 151 +++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_icache.h |   3 +
 fs/xfs/xfs_iops.c   |   1 -
 fs/xfs/xfs_super.c  |  11 ++++
 4 files changed, 163 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index a680e5b82672..ee544556cee7 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1614,6 +1614,155 @@ xfs_blockgc_free_quota(
 			xfs_inode_dquot(ip, XFS_DQTYPE_PROJ), iwalk_flags);
 }
 
+/* VFS Inode Cache Walking Code */
+
+/* XFS inodes in these states are not visible to the VFS. */
+#define XFS_ITER_VFS_NOGRAB_IFLAGS	(XFS_INEW | \
+					 XFS_NEED_INACTIVE | \
+					 XFS_INACTIVATING | \
+					 XFS_IRECLAIMABLE | \
+					 XFS_IRECLAIM)
+/*
+ * If the inode we found is visible to the VFS inode cache, then return it to
+ * the caller.
+ *
+ * In the normal case, we need to validate the VFS inode state and take a
+ * reference to it here. We will drop that reference once the VFS inode has been
+ * processed by the ino_iter_fn.
+ *
+ * However, if the INO_ITER_UNSAFE flag is set, we do not take references to the
+ * inode - it is the ino_iter_fn's responsibility to validate the inode is still
+ * a VFS inode once we hand it to them. We do not drop references after
+ * processing these inodes; the processing function may have evicted the VFS
+ * inode from cache as part of it's processing.
+ */
+static bool
+xfs_iter_vfs_igrab(
+	struct xfs_inode	*ip,
+	int			flags)
+{
+	struct inode		*inode = VFS_I(ip);
+	bool			ret = false;
+
+	ASSERT(rcu_read_lock_held());
+
+	/* Check for stale RCU freed inode */
+	spin_lock(&ip->i_flags_lock);
+	if (!ip->i_ino)
+		goto out_unlock_noent;
+
+	if (ip->i_flags & XFS_ITER_VFS_NOGRAB_IFLAGS)
+		goto out_unlock_noent;
+
+	if ((flags & INO_ITER_UNSAFE) ||
+	    super_iter_iget(inode, flags))
+		ret = true;
+
+out_unlock_noent:
+	spin_unlock(&ip->i_flags_lock);
+	return ret;
+}
+
+/*
+ * Initial implementation of vfs inode walker. This does not use batched lookups
+ * for initial simplicity and testing, though it could use them quite
+ * efficiently for both safe and unsafe iteration contexts.
+ */
+static int
+xfs_icwalk_vfs_inodes_ag(
+	struct xfs_perag	*pag,
+	ino_iter_fn		iter_fn,
+	void			*private_data,
+	int			flags)
+{
+	struct xfs_mount	*mp = pag->pag_mount;
+	uint32_t		first_index = 0;
+	int			ret = 0;
+	int			nr_found;
+	bool			done = false;
+
+	do {
+		struct xfs_inode *ip;
+
+		rcu_read_lock();
+		nr_found = radix_tree_gang_lookup(&pag->pag_ici_root,
+				(void **)&ip, first_index, 1);
+		if (!nr_found) {
+			rcu_read_unlock();
+			break;
+		}
+
+		/*
+		 * Update the index for the next lookup. Catch
+		 * overflows into the next AG range which can occur if
+		 * we have inodes in the last block of the AG and we
+		 * are currently pointing to the last inode.
+		 */
+		first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
+		if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
+			done = true;
+
+		if (!xfs_iter_vfs_igrab(ip, flags)) {
+			rcu_read_unlock();
+			continue;
+		}
+
+		/*
+		 * If we are doing an unsafe iteration, we must continue to hold
+		 * the RCU lock across the callback to guarantee the existence
+		 * of inode. We can't hold the rcu lock for reference counted
+		 * inodes because the callback is allowed to block in that case.
+		 */
+		if (!(flags & INO_ITER_UNSAFE))
+			rcu_read_unlock();
+
+		ret = iter_fn(VFS_I(ip), private_data);
+
+		/*
+		 * We've run the callback, so we can drop the existence
+		 * guarantee we hold on the inode now.
+		 */
+		if (!(flags & INO_ITER_UNSAFE))
+			iput(VFS_I(ip));
+		else
+			rcu_read_unlock();
+
+		if (ret == INO_ITER_ABORT) {
+			ret = 0;
+			break;
+		}
+		if (ret < 0)
+			break;
+
+	} while (!done);
+
+	return ret;
+}
+
+int
+xfs_icwalk_vfs_inodes(
+	struct xfs_mount	*mp,
+	ino_iter_fn		iter_fn,
+	void			*private_data,
+	int			flags)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	int			ret;
+
+	for_each_perag(mp, agno, pag) {
+		ret = xfs_icwalk_vfs_inodes_ag(pag, iter_fn,
+				private_data, flags);
+		if (ret == INO_ITER_ABORT) {
+			ret = 0;
+			break;
+		}
+		if (ret < 0)
+			break;
+	}
+	return ret;
+}
+
 /* XFS Inode Cache Walking Code */
 
 /*
@@ -1624,7 +1773,6 @@ xfs_blockgc_free_quota(
  */
 #define XFS_LOOKUP_BATCH	32
 
-
 /*
  * Decide if we want to grab this inode in anticipation of doing work towards
  * the goal.
@@ -1700,7 +1848,6 @@ xfs_icwalk_ag(
 		int		i;
 
 		rcu_read_lock();
-
 		nr_found = radix_tree_gang_lookup_tag(&pag->pag_ici_root,
 				(void **) batch, first_index,
 				XFS_LOOKUP_BATCH, goal);
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 905944dafbe5..c2754ea28a88 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -18,6 +18,9 @@ struct xfs_icwalk {
 	long		icw_scan_limit;
 };
 
+int xfs_icwalk_vfs_inodes(struct xfs_mount *mp, ino_iter_fn iter_fn,
+		void *private_data, int flags);
+
 /* Flags that reflect xfs_fs_eofblocks functionality. */
 #define XFS_ICWALK_FLAG_SYNC		(1U << 0) /* sync/wait mode scan */
 #define XFS_ICWALK_FLAG_UID		(1U << 1) /* filter by uid */
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ee79cf161312..5375c17ed69c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1293,7 +1293,6 @@ xfs_setup_inode(
 	inode->i_ino = ip->i_ino;
 	inode->i_state |= I_NEW;
 
-	inode_sb_list_add(inode);
 	/* make the inode look hashed for the writeback code */
 	inode_fake_hash(inode);
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index fbb3a1594c0d..a2ef1b582066 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1179,6 +1179,16 @@ xfs_fs_shutdown(
 	xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REMOVED);
 }
 
+static int
+xfs_fs_iter_vfs_inodes(
+	struct super_block	*sb,
+	ino_iter_fn		iter_fn,
+	void			*private_data,
+	int			flags)
+{
+	return xfs_icwalk_vfs_inodes(XFS_M(sb), iter_fn, private_data, flags);
+}
+
 static const struct super_operations xfs_super_operations = {
 	.alloc_inode		= xfs_fs_alloc_inode,
 	.destroy_inode		= xfs_fs_destroy_inode,
@@ -1193,6 +1203,7 @@ static const struct super_operations xfs_super_operations = {
 	.nr_cached_objects	= xfs_fs_nr_cached_objects,
 	.free_cached_objects	= xfs_fs_free_cached_objects,
 	.shutdown		= xfs_fs_shutdown,
+	.iter_vfs_inodes	= xfs_fs_iter_vfs_inodes,
 };
 
 static int
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 7/7] bcachefs: implement sb->iter_vfs_inodes
  2024-10-02  1:33 [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Dave Chinner
                   ` (5 preceding siblings ...)
  2024-10-02  1:33 ` [PATCH 6/7] xfs: implement sb->iter_vfs_inodes Dave Chinner
@ 2024-10-02  1:33 ` Dave Chinner
  2024-10-02 10:00 ` [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Christian Brauner
  2024-10-03 11:45 ` Jan Kara
  8 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-02  1:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-bcachefs, kent.overstreet, torvalds

From: Dave Chinner <dchinner@redhat.com>

Untested, probably doesn't work, just a quick hack to indicate
how this could be done with the new bcachefs inode cache.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/bcachefs/fs.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 4a1bb07a2574..7708ec2b68c1 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -1814,6 +1814,46 @@ void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s)
 	darray_exit(&grabbed);
 }
 
+static int
+bch2_iter_vfs_inodes(
+        struct super_block      *sb,
+        ino_iter_fn             iter_fn,
+        void                    *private_data,
+        int                     flags)
+{
+	struct bch_inode_info *inode, *old_inode = NULL;
+	int ret = 0;
+
+	mutex_lock(&c->vfs_inodes_lock);
+	list_for_each_entry(inode, &c->vfs_inodes_list, ei_vfs_inode_list) {
+		if (!super_iter_iget(&inode->v, flags))
+			continue;
+
+		if (!(flags & INO_ITER_UNSAFE))
+			mutex_unlock(&c->vfs_inodes_lock);
+
+		ret = iter_fn(VFS_I(ip), private_data);
+		cond_resched();
+
+		if (!(flags & INO_ITER_UNSAFE)) {
+			if (old_inode)
+				iput(&old_inode->v);
+			old_inode = inode;
+			mutex_lock(&c->vfs_inodes_lock);
+		}
+
+		if (ret == INO_ITER_ABORT) {
+			ret = 0;
+			break;
+		}
+		if (ret < 0)
+			break;
+	}
+	if (old_inode)
+		iput(&old_inode->v);
+	return ret;
+}
+
 static int bch2_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct super_block *sb = dentry->d_sb;
@@ -1995,6 +2035,7 @@ static const struct super_operations bch_super_operations = {
 	.put_super	= bch2_put_super,
 	.freeze_fs	= bch2_freeze,
 	.unfreeze_fs	= bch2_unfreeze,
+	.iter_vfs_inodes = bch2_iter_vfs_inodes
 };
 
 static int bch2_set_super(struct super_block *s, void *data)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-02  1:33 [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Dave Chinner
                   ` (6 preceding siblings ...)
  2024-10-02  1:33 ` [PATCH 7/7] bcachefs: " Dave Chinner
@ 2024-10-02 10:00 ` Christian Brauner
  2024-10-02 12:34   ` Dave Chinner
  2024-10-03 11:45 ` Jan Kara
  8 siblings, 1 reply; 72+ messages in thread
From: Christian Brauner @ 2024-10-02 10:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> The management of the sb->s_inodes list is a scalability limitation;
> it is protected by a single lock and every inode that is
> instantiated has to be added to the list. Those inodes then need to
> be removed from the list when evicted from cache. Hence every inode
> that moves through the VFS inode cache must take this global scope
> lock twice.
> 
> This proves to be a significant limiting factor for concurrent file
> access workloads that repeatedly miss the dentry cache on lookup.
> Directory search and traversal workloads are particularly prone to
> these issues, though on XFS we have enough concurrency capability
> in file creation and unlink for the sb->s_inodes list to be a
> limitation there as well.
> 
> Previous efforts to solve this problem have
> largely centered around reworking the sb->s_inodes list into
> something more scalable such as this longstanding patchset does:
> 
> https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/
> 
> However, a recent discussion about inode cache behaviour that arose
> from the bcachefs 6.12-rc1 pull request opened a new direction for
> us to explore. With both XFS and bcachefs now providing their own
> per-superblock inode cache implementations, we should try to make
> use of these inode caches as first class citizens.
> 
> With that new direction in mind, it became obvious that XFS could
> elide the sb->s_inodes list completely - "the best part is no part"
> - if iteration was not reliant on open-coded sb->s_inodes list
> walks.
> 
> We already use the internal inode cache for iteration, and we have
> filters for selecting specific inodes to operate on with specific
> callback operations. If we had an abstraction for iterating
> all VFS inodes, we can easily implement that directly on the XFS
> inode cache.
> 
> This is what this patchset aims to implement.
> 
> There are two superblock iterator functions provided. The first is a
> generic iterator that provides safe, reference counted inodes for
> the callback to operate on. This is generally what most sb->s_inodes
> iterators use, and it allows the iterator to drop locks and perform
> blocking operations on the inode before moving to the next inode in
> the sb->s_inodes list.
> 
> There is one quirk to this interface - INO_ITER_REFERENCE - because
> fsnotify iterates the inode cache -after- evict_inodes() has been
> called during superblock shutdown to evict all non-referenced
> inodes. Hence it should only find referenced inodes, and it has
> a check to skip unreferenced inodes. This flag does the same.
> 
> However, I suspect this is now somewhat sub-optimal because LSMs can
> hold references to inodes beyond evict_inodes(), and they don't get
> torn down until after fsnotify evicts the referenced inodes it
> holds. However, the landlock LSM doesn't have checks for
> unreferenced inodes (i.e. doesn't use INO_ITER_REFERENCE), so this
> guard is not consistently applied.
> 
> I'm undecided on how best to handle this, but it does not need to be
> solved for this patchset to work. fsnotify and
> landlock don't need to run -after- evict_inodes(), but moving them
> to before evict_inodes() mean we now do three full inode cache
> iterations to evict all the inodes from the cache. That doesn't seem
> like a good idea when there might be hundreds of millions of cached
> inodes at unmount.
> 
> Similarly, needing the iterator to be aware that there should be no
> unreferenced inodes left when they run doesn't seem like a good
> idea, either. So perhaps the answer is that the iterator checks for
> SB_ACTIVE (or some other similar flag) that indicates the superblock
> is being torn down and so will skip zero-referenced inodes
> automatically in this case. Like I said - this doesn't need to be
> solved right now, it's just something to be aware of.
> 
> The second iterator is the "unsafe" iterator variant that only
> provides the callback with an existence guarantee. It does this by
> holding the rcu_read_lock() to guarantee that the inode is not freed
> from under the callback. There are no validity checks performed on
> the inode - it is entirely up to the callback to validate the inode
> can be operated on safely.
> 
> Hence the "unsafe" variant is only for very specific internal uses
> only. Nobody should be adding new uses of this function without
> as there are very few good reasons for external access to inodes
> without holding a valid reference. I have not decided whether the
> unsafe callbacks should require a lockdep_assert_in_rcu_read_lock()
> check in them to clearly document the context under which they are
> running.
> 
> The patchset converts all the open coded iterators to use these
> new iterator functions, which means the only use of sb->s_inodes
> is now confined to fs/super.c (iterator API) and fs/inode.c
> (add/remove API). A new superblock operation is then added to
> call out from the iterators into the filesystem to allow them to run
> the iteration instead of walking the sb->s_inodes list.
> 
> XFS is then converted to use this new superblock operation. I didn't
> use the existing iterator function for this functionality right now
> as it is currently based on radix tree tag lookups. It also uses a
> batched 'lookup and lock' mechanism that complicated matters as I
> developed this code. Hence I open coded a new, simpler cache walk
> for testing purposes.
> 
> Now that I have stuff working and I think I have the callback API
> semantics settled, batched radix tree lookups should still work to
> minimise the iteration overhead. Also, we might want to tag VFS
> inodes in the radix tree so that we can filter them efficiently for
> traversals. This would allow us to use the existing generic inode
> cache walker rather than a separate variant as this patch set
> implements. This can be done as future work, though.
> 
> In terms of scalability improvements, a quick 'will it scale' test
> demonstrates where the sb->s_inodes list hurts. Running a sharded,
> share-nothing cold cache workload with 100,000 files per thread in
> per-thread directories gives the following results on a 4-node 64p
> machine with 128GB RAM.
> 
> The workloads "walk", "chmod" and "unlink" are all directory
> traversal workloads that stream cold cache inodes into the cache.
> There is enough memory on this test machine that these indoes are
> not being reclaimed during the workload, and are being freed between
> steps via drop_caches (which iterates the inode cache and so
> explicitly tests the new iteration APIs!). Hence the sb->s_inodes
> scalability issues aren't as bad in these tests as when memory is
> tight and inodes are being reclaimed (i.e. the issues are worse in
> real workloads).
> 
> The "bulkstat" workload uses the XFS bulkstat ioctl to iterate
> inodes via walking the internal inode btrees. It uses
> d_mark_dontcache() so it is actually tearing down each inode as soon
> as it has been sampled by the bulkstat code. Hence it is doing two
> sb->s_inodes list manipulations per inode and so shows scalability
> issues much earlier than the other workloads.
> 
> Before:
> 
> Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
>        xfs     400000     4      4.269      3.225      4.557      7.316      1.306
>        xfs     800000     8      4.844      3.227      4.702      7.905      1.908
>        xfs    1600000    16      6.286      3.296      5.592      8.838      4.392
>        xfs    3200000    32      8.912      5.681      8.505     11.724      7.085
>        xfs    6400000    64     15.344     11.144     14.162     18.604     15.494
> 
> Bulkstat starts to show issues at 8 threads, walk and chmod between
> 16 and 32 threads, and unlink is limited by internal XFS stuff.
> Bulkstat is bottlenecked at about 400-450 thousand inodes/s by the
> sb->s_inodes list management.
> 
> After:
> 
> Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
>        xfs     400000     4      4.140      3.502      4.154      7.242      1.164
>        xfs     800000     8      4.637      2.836      4.444      7.896      1.093
>        xfs    1600000    16      5.549      3.054      5.213      8.696      1.107
>        xfs    3200000    32      8.387      3.218      6.867     10.668      1.125
>        xfs    6400000    64     14.112      3.953     10.365     18.620      1.270
> 
> When patched, walk shows little in way of scalability degradation
> out to 64 threads, chmod is significantly improved at 32-64 threads,
> and bulkstat shows perfect scalability out to 64 threads now.
> 
> I did a couple of other longer running, higher inode count tests
> with bulkstat to get an idea of inode cache streaming rates - 32
> million inodes scanned in 4.4 seconds at 64 threads. That's about
> 7.2 million inodes/s being streamed through the inode cache with the
> IO rates are peaking well above 5.5GB/s (near IO bound).
> 
> Hence raw VFS inode cache throughput sees a ~17x scalability
> improvement on XFS at 64 threads (and probably a -lot- more on
> higher CPU count machines).  That's far better performance than I
> ever got from the dlist conversion of the sb->s_inodes list in
> previous patchsets, so this seems like a much better direction to be
> heading for optimising the way we cache inodes.
> 
> I haven't done a lot of testing on this patchset yet - it boots and
> appears to work OK for block devices, ext4 and XFS, but checking
> stuff like quota on/off is still working properly on ext4 hasn't
> been done yet.
> 
> What do people think of moving towards per-sb inode caching and
> traversal mechanisms like this?

Patches 1-4 are great cleanups that I would like us to merge even
independent of the rest.

I don't have big conceptual issues with the series otherwise. The only
thing that makes me a bit uneasy is that we are now providing an api
that may encourage filesystems to do their own inode caching even if
they don't really have a need for it just because it's there. So really
a way that would've solved this issue generically would have been my
preference.

But the reality is that xfs has been doing that private inode cache for
a long time and reading through 5/7 and 6/7 it clearly provides value
for xfs. So I find it hard to object to adding ->iter_vfs_inodes()
(Though I would like to s/iter_vfs_inodes/iter_inodes/g).

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-02 10:00 ` [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Christian Brauner
@ 2024-10-02 12:34   ` Dave Chinner
  2024-10-02 19:29     ` Kent Overstreet
  2024-10-02 19:49     ` Linus Torvalds
  0 siblings, 2 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-02 12:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > What do people think of moving towards per-sb inode caching and
> > traversal mechanisms like this?
> 
> Patches 1-4 are great cleanups that I would like us to merge even
> independent of the rest.

Yes, they make it much easier to manage the iteration code.

> I don't have big conceptual issues with the series otherwise. The only
> thing that makes me a bit uneasy is that we are now providing an api
> that may encourage filesystems to do their own inode caching even if
> they don't really have a need for it just because it's there.  So really
> a way that would've solved this issue generically would have been my
> preference.

Well, that's the problem, isn't it? :/

There really isn't a good generic solution for global list access
and management.  The dlist stuff kinda works, but it still has
significant overhead and doesn't get rid of spinlock contention
completely because of the lack of locality between list add and
remove operations.

i.e. dlist is optimised for low contention add operations (i.e.
local to the CPU). However, removal is not a local operation - it
almsot always happens on a different CPU to the add operation.
Hence removal always pulls the list and lock away from the CPU that
"owns" them, and hence there is still contention when inodes are
streaming through memory. This causes enough overhead that dlist
operations are still very visible in CPU profiles during scalability
testing...

XFS (and now bcachefs) have their own per-sb inode cache
implementations, and hence for them the sb->s_inodes list is pure
overhead.  If we restructure the generic inode cache infrastructure
to also be per-sb (this suggestion from Linus was what lead me to
this patch set), then they will also likely not need the
sb->s_inodes list, too.

That's the longer term "generic solution" to the sb->s_inodes list
scalability problem (i.e. get rid of it!), but it's a much larger
and longer term undertaking. Once we know what that new generic
inode cache infrastructure looks like, we'll probably only want to
be converting one filesystem at a time to the new infrastucture.

We'll need infrastructure to allow alternative per-sb iteration
mechanisms for such a conversion take place - the converted
filesystems will likely call a generic ->iter_vfs_inodes()
implementation based on the per-sb inode cache infrastructure rather
than iterating sb->s_inodes. Eventually, we'll end up with that
generic method replacing the sb->s_inodes iteration, we'll end up
with only a couple of filesystems using the callout again.

> But the reality is that xfs has been doing that private inode cache for
> a long time and reading through 5/7 and 6/7 it clearly provides value
> for xfs. So I find it hard to object to adding ->iter_vfs_inodes()
> (Though I would like to s/iter_vfs_inodes/iter_inodes/g).

I named it that way because, from my filesystem centric point of
view, there is a very distinct separation between VFS and filesystem
inodes. The VFS inode (struct inode) is a subset of the filesystem
inode structure and, in XFS's case, a subset of the filesystem inode
life cycle, too.

i.e. this method should not iterate cached filesystem inodes that
exist outside the VFS inode lifecycle or VFS visibility even though
they may be present in the filesystem's internal inode cache.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-02 12:34   ` Dave Chinner
@ 2024-10-02 19:29     ` Kent Overstreet
  2024-10-02 22:23       ` Dave Chinner
  2024-10-02 19:49     ` Linus Torvalds
  1 sibling, 1 reply; 72+ messages in thread
From: Kent Overstreet @ 2024-10-02 19:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, linux-fsdevel, linux-xfs, linux-bcachefs,
	torvalds

On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > > What do people think of moving towards per-sb inode caching and
> > > traversal mechanisms like this?
> > 
> > Patches 1-4 are great cleanups that I would like us to merge even
> > independent of the rest.
> 
> Yes, they make it much easier to manage the iteration code.
> 
> > I don't have big conceptual issues with the series otherwise. The only
> > thing that makes me a bit uneasy is that we are now providing an api
> > that may encourage filesystems to do their own inode caching even if
> > they don't really have a need for it just because it's there.  So really
> > a way that would've solved this issue generically would have been my
> > preference.
> 
> Well, that's the problem, isn't it? :/
> 
> There really isn't a good generic solution for global list access
> and management.  The dlist stuff kinda works, but it still has
> significant overhead and doesn't get rid of spinlock contention
> completely because of the lack of locality between list add and
> remove operations.

There is though; I haven't posted it yet because it still needs some
work, but the concept works and performs about the same as dlock-list.

https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list

The thing that needs to be sorted before posting is that it can't shrink
the radix tree. generic-radix-tree doesn't support shrinking, and I
could add that, but then ida doesn't provide a way to query the highest
id allocated (xarray doesn't support backwards iteration).

So I'm going to try it using idr and see how that performs (idr is not
really the right data structure for this, split ida and item radix tree
is better, so might end up doing something else).

But - this approach with more work will work for the list_lru lock
contention as well.

From 32cb8103ecfacdd5ed8e1eb390221c3f8339de6f Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@linux.dev>
Date: Sat, 28 Sep 2024 16:22:38 -0400
Subject: [PATCH] lib/fast_list.c

A fast "list" data structure, which is actually a radix tree, with an
IDA for slot allocation and a percpu buffer on top of that.

Items cannot be added or moved to the head or tail, only added at some
(arbitrary) position and removed. The advantage is that adding, removing
and iteration is generally lockless, only hitting the lock in ida when
the percpu buffer is full or empty.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

diff --git a/include/linux/fast_list.h b/include/linux/fast_list.h
new file mode 100644
index 000000000000..7d5d8592864d
--- /dev/null
+++ b/include/linux/fast_list.h
@@ -0,0 +1,22 @@
+#ifndef _LINUX_FAST_LIST_H
+#define _LINUX_FAST_LIST_H
+
+#include <linux/generic-radix-tree.h>
+#include <linux/idr.h>
+#include <linux/percpu.h>
+
+struct fast_list_pcpu;
+
+struct fast_list {
+	GENRADIX(void *)	items;
+	struct ida		slots_allocated;;
+	struct fast_list_pcpu	*buffer;
+};
+
+int fast_list_get_idx(struct fast_list *l);
+int fast_list_add(struct fast_list *l, void *item);
+void fast_list_remove(struct fast_list *l, unsigned idx);
+void fast_list_exit(struct fast_list *l);
+int fast_list_init(struct fast_list *l);
+
+#endif /* _LINUX_FAST_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index 773adf88af41..85cf5a0d36b1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -49,7 +49,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
 	 bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \
 	 percpu-refcount.o rhashtable.o base64.o \
 	 once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
-	 generic-radix-tree.o bitmap-str.o
+	 generic-radix-tree.o bitmap-str.o fast_list.o
 obj-$(CONFIG_STRING_KUNIT_TEST) += string_kunit.o
 obj-y += string_helpers.o
 obj-$(CONFIG_STRING_HELPERS_KUNIT_TEST) += string_helpers_kunit.o
diff --git a/lib/fast_list.c b/lib/fast_list.c
new file mode 100644
index 000000000000..bbb69bb29687
--- /dev/null
+++ b/lib/fast_list.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Fast, unordered lists
+ *
+ * Supports add, remove, and iterate
+ *
+ * Underneath, they're a radix tree and an IDA, with a percpu buffer for slot
+ * allocation and freeing.
+ *
+ * This means that adding, removing, and iterating over items is lockless,
+ * except when refilling/emptying the percpu slot buffers.
+ */
+
+#include <linux/fast_list.h>
+
+struct fast_list_pcpu {
+	size_t			nr;
+	size_t			entries[31];
+};
+
+/**
+ * fast_list_get_idx - get a slot in a fast_list
+ * @l:		list to get slot in
+ *
+ * This allocates a slot in the radix tree without storing to it, so that we can
+ * take the potential memory allocation failure early and do the list add later
+ * when we can't take an allocation failure.
+ *
+ * Returns: positive integer on success, -ENOMEM on failure
+ */
+int fast_list_get_idx(struct fast_list *l)
+{
+	int idx;
+
+	preempt_disable();
+	struct fast_list_pcpu *lp = this_cpu_ptr(l->buffer);
+
+	if (unlikely(!lp->nr))
+		while (lp->nr <= ARRAY_SIZE(lp->entries) / 2) {
+			idx = ida_alloc_range(&l->slots_allocated, 1, ~0, GFP_NOWAIT|__GFP_NOWARN);
+			if (unlikely(idx < 0)) {
+				preempt_enable();
+				idx = ida_alloc_range(&l->slots_allocated, 1, ~0, GFP_KERNEL);
+				if (unlikely(idx < 0))
+					return idx;
+
+				preempt_disable();
+				lp = this_cpu_ptr(l->buffer);
+			}
+
+			if (unlikely(!genradix_ptr_alloc_inlined(&l->items, idx,
+							GFP_NOWAIT|__GFP_NOWARN))) {
+				preempt_enable();
+				if (!genradix_ptr_alloc(&l->items, idx, GFP_KERNEL)) {
+					ida_free(&l->slots_allocated, idx);
+					return -ENOMEM;
+				}
+
+				preempt_disable();
+				lp = this_cpu_ptr(l->buffer);
+			}
+
+			if (unlikely(lp->nr == ARRAY_SIZE(lp->entries)))
+				ida_free(&l->slots_allocated, idx);
+			else
+				lp->entries[lp->nr++] = idx;
+		}
+
+	idx = lp->entries[--lp->nr];
+	preempt_enable();
+
+	return idx;
+}
+
+/**
+ * fast_list_add - add an item to a fast_list
+ * @l:		list
+ * @item:	item to add
+ *
+ * Allocates a slot in the radix tree and stores to it and then returns the
+ * slot index, which must be passed to fast_list_remove().
+ *
+ * Returns: positive integer on success, -ENOMEM on failure
+ */
+int fast_list_add(struct fast_list *l, void *item)
+{
+	int idx = fast_list_get_idx(l);
+	if (idx < 0)
+		return idx;
+
+	*genradix_ptr_inlined(&l->items, idx) = item;
+	return idx;
+}
+
+/**
+ * fast_list_remove - remove an item from a fast_list
+ * @l:		list
+ * @idx:	item's slot index
+ *
+ * Zeroes out the slot in the radix tree and frees the slot for future
+ * fast_list_add() operations.
+ */
+void fast_list_remove(struct fast_list *l, unsigned idx)
+{
+	if (!idx)
+		return;
+
+	*genradix_ptr_inlined(&l->items, idx) = NULL;
+
+	preempt_disable();
+	struct fast_list_pcpu *lp = this_cpu_ptr(l->buffer);
+
+	if (unlikely(lp->nr == ARRAY_SIZE(lp->entries)))
+		while (lp->nr >= ARRAY_SIZE(lp->entries) / 2) {
+			ida_free(&l->slots_allocated, idx);
+			idx = lp->entries[--lp->nr];
+		}
+
+	lp->entries[lp->nr++] = idx;
+	preempt_enable();
+}
+
+void fast_list_exit(struct fast_list *l)
+{
+	/* XXX: warn if list isn't empty */
+	free_percpu(l->buffer);
+	ida_destroy(&l->slots_allocated);
+	genradix_free(&l->items);
+}
+
+int fast_list_init(struct fast_list *l)
+{
+	genradix_init(&l->items);
+	ida_init(&l->slots_allocated);
+	l->buffer = alloc_percpu(*l->buffer);
+	if (!l->buffer)
+		return -ENOMEM;
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-02 12:34   ` Dave Chinner
  2024-10-02 19:29     ` Kent Overstreet
@ 2024-10-02 19:49     ` Linus Torvalds
  2024-10-02 20:28       ` Kent Overstreet
  1 sibling, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2024-10-02 19:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet

On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
>
> > I don't have big conceptual issues with the series otherwise. The only
> > thing that makes me a bit uneasy is that we are now providing an api
> > that may encourage filesystems to do their own inode caching even if
> > they don't really have a need for it just because it's there.  So really
> > a way that would've solved this issue generically would have been my
> > preference.
>
> Well, that's the problem, isn't it? :/
>
> There really isn't a good generic solution for global list access
> and management.  The dlist stuff kinda works, but it still has
> significant overhead and doesn't get rid of spinlock contention
> completely because of the lack of locality between list add and
> remove operations.

I much prefer the approach taken in your patch series, to let the
filesystem own the inode list and keeping the old model as the
"default list".

In many ways, that is how *most* of the VFS layer works - it exposes
helper functions that the filesystems can use (and most do), but
doesn't force them.

Yes, the VFS layer does force some things - you can't avoid using
dentries, for example, because that's literally how the VFS layer
deals with filenames (and things like mounting etc). And honestly, the
VFS layer does a better job of filename caching than any filesystem
really can do, and with the whole UNIX mount model, filenames
fundamentally cross filesystem boundaries anyway.

But clearly the VFS layer inode list handling isn't the best it can
be, and unless we can fix that in some fundamental way (and I don't
love the "let's use crazy lists instead of a simple one" models) I do
think that just letting filesystems do their own thing if they have
something better is a good model.

That's how we deal with all the basic IO, after all. The VFS layer has
lots of support routines, but filesystems don't *have* to use things
like generic_file_read_iter() and friends.

Yes, most filesystems do use generic_file_read_iter() in some form or
other (sometimes raw, sometimes wrapped with filesystem logic),
because it fits their model, it's convenient, and it handles all the
normal stuff well, but you don't *have* to use it if you have special
needs.

Taking that approach to the inode caching sounds sane to me, and I
generally like Dave's series. It looks like an improvement to me.

              Linus

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-02 19:49     ` Linus Torvalds
@ 2024-10-02 20:28       ` Kent Overstreet
  2024-10-02 23:17         ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Kent Overstreet @ 2024-10-02 20:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Christian Brauner, linux-fsdevel, linux-xfs,
	linux-bcachefs

On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> >
> > > I don't have big conceptual issues with the series otherwise. The only
> > > thing that makes me a bit uneasy is that we are now providing an api
> > > that may encourage filesystems to do their own inode caching even if
> > > they don't really have a need for it just because it's there.  So really
> > > a way that would've solved this issue generically would have been my
> > > preference.
> >
> > Well, that's the problem, isn't it? :/
> >
> > There really isn't a good generic solution for global list access
> > and management.  The dlist stuff kinda works, but it still has
> > significant overhead and doesn't get rid of spinlock contention
> > completely because of the lack of locality between list add and
> > remove operations.
> 
> I much prefer the approach taken in your patch series, to let the
> filesystem own the inode list and keeping the old model as the
> "default list".
> 
> In many ways, that is how *most* of the VFS layer works - it exposes
> helper functions that the filesystems can use (and most do), but
> doesn't force them.
> 
> Yes, the VFS layer does force some things - you can't avoid using
> dentries, for example, because that's literally how the VFS layer
> deals with filenames (and things like mounting etc). And honestly, the
> VFS layer does a better job of filename caching than any filesystem
> really can do, and with the whole UNIX mount model, filenames
> fundamentally cross filesystem boundaries anyway.
> 
> But clearly the VFS layer inode list handling isn't the best it can
> be, and unless we can fix that in some fundamental way (and I don't
> love the "let's use crazy lists instead of a simple one" models) I do
> think that just letting filesystems do their own thing if they have
> something better is a good model.

Well, I don't love adding more indirection and callbacks.

The underlying approach in this patchset of "just use the inode hash
table if that's available" - that I _do_ like, but this seems like
the wrong way to go about it, we're significantly adding to the amount
of special purpose "things" filesystems have to do if they want to
perform well.

Converting the standard inode hash table to an rhashtable (or more
likely, creating a new standard implementation and converting
filesystems one at a time) still needs to happen, and then the "use the
hash table for iteration" approach could use that without every
filesystem having to specialize.

Failing that, or even regardless, I think we do need either dlock-list
or fast-list. "I need some sort of generic list, but fast" is something
I've seen come up way too many times.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-02 19:29     ` Kent Overstreet
@ 2024-10-02 22:23       ` Dave Chinner
  2024-10-02 23:20         ` Kent Overstreet
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2024-10-02 22:23 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christian Brauner, linux-fsdevel, linux-xfs, linux-bcachefs,
	torvalds

On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote:
> On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote:
> > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > > > What do people think of moving towards per-sb inode caching and
> > > > traversal mechanisms like this?
> > > 
> > > Patches 1-4 are great cleanups that I would like us to merge even
> > > independent of the rest.
> > 
> > Yes, they make it much easier to manage the iteration code.
> > 
> > > I don't have big conceptual issues with the series otherwise. The only
> > > thing that makes me a bit uneasy is that we are now providing an api
> > > that may encourage filesystems to do their own inode caching even if
> > > they don't really have a need for it just because it's there.  So really
> > > a way that would've solved this issue generically would have been my
> > > preference.
> > 
> > Well, that's the problem, isn't it? :/
> > 
> > There really isn't a good generic solution for global list access
> > and management.  The dlist stuff kinda works, but it still has
> > significant overhead and doesn't get rid of spinlock contention
> > completely because of the lack of locality between list add and
> > remove operations.
> 
> There is though; I haven't posted it yet because it still needs some
> work, but the concept works and performs about the same as dlock-list.
> 
> https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list
> 
> The thing that needs to be sorted before posting is that it can't shrink
> the radix tree. generic-radix-tree doesn't support shrinking, and I
> could add that, but then ida doesn't provide a way to query the highest
> id allocated (xarray doesn't support backwards iteration).

That's an interesting construct, but...

> So I'm going to try it using idr and see how that performs (idr is not
> really the right data structure for this, split ida and item radix tree
> is better, so might end up doing something else).
> 
> But - this approach with more work will work for the list_lru lock
> contention as well.

....  it isn't a generic solution because it is dependent on
blocking memory allocation succeeding for list_add() operations.

Hence this cannot do list operations under external synchronisation
constructs like spinlocks or rcu_read_lock(). It also introduces
interesting interactions with memory reclaim - what happens we have
to add an object to one of these lists from memory reclaim context?

Taking the example of list_lru, this list construct will not work
for a variety of reasons. Some of them are:

- list_lru_add() being called from list_lru_add_obj() under RCU for
  memcg aware LRUs so cannot block and must not fail.
- list_lru_add_obj() is called under spinlocks from inode_lru_add(),
  the xfs buffer and dquot caches, the workingset code from under
  the address space mapping xarray lock, etc. Again, this must not
  fail.
- list_lru_add() operations take can place in large numbers in
  memory reclaim context (e.g. dentry reclaim drops inodes which
  adds them to the inode lru). Hence memory reclaim becomes even
  more dependent on PF_MEMALLOC memory allocation making forwards
  progress.
- adding long tail list latency to what are currently O(1) fast path
  operations (e.g.  mulitple allocations tree splits for LRUs
  tracking millions of objects) is not desirable.
- LRU lists are -ordered- (it's right there in the name!) and this
  appears to be an unordered list construct.

So while I think this is an interesting idea that might be useful in
some cases, I don't think it is a viable generic scalable list
construct we can use in areas like list_lru or global list
management that run under external synchronisation mechanisms.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-02 20:28       ` Kent Overstreet
@ 2024-10-02 23:17         ` Dave Chinner
  2024-10-03  1:22           ` Kent Overstreet
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2024-10-02 23:17 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, linux-xfs,
	linux-bcachefs

On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote:
> On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > >
> > > > I don't have big conceptual issues with the series otherwise. The only
> > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > that may encourage filesystems to do their own inode caching even if
> > > > they don't really have a need for it just because it's there.  So really
> > > > a way that would've solved this issue generically would have been my
> > > > preference.
> > >
> > > Well, that's the problem, isn't it? :/
> > >
> > > There really isn't a good generic solution for global list access
> > > and management.  The dlist stuff kinda works, but it still has
> > > significant overhead and doesn't get rid of spinlock contention
> > > completely because of the lack of locality between list add and
> > > remove operations.
> > 
> > I much prefer the approach taken in your patch series, to let the
> > filesystem own the inode list and keeping the old model as the
> > "default list".
> > 
> > In many ways, that is how *most* of the VFS layer works - it exposes
> > helper functions that the filesystems can use (and most do), but
> > doesn't force them.
> > 
> > Yes, the VFS layer does force some things - you can't avoid using
> > dentries, for example, because that's literally how the VFS layer
> > deals with filenames (and things like mounting etc). And honestly, the
> > VFS layer does a better job of filename caching than any filesystem
> > really can do, and with the whole UNIX mount model, filenames
> > fundamentally cross filesystem boundaries anyway.
> > 
> > But clearly the VFS layer inode list handling isn't the best it can
> > be, and unless we can fix that in some fundamental way (and I don't
> > love the "let's use crazy lists instead of a simple one" models) I do
> > think that just letting filesystems do their own thing if they have
> > something better is a good model.
> 
> Well, I don't love adding more indirection and callbacks.

It's way better than open coding inode cache traversals everywhere.

The callback model is simply "call this function on every object",
and it allows implementations the freedom to decide how they are
going to run those callbacks.

For example, this abstraction allows XFS to parallelise the
traversal. We currently run the traversal across all inodes in a
single thread, but now that XFS is walking the inode cache we can
push each shard off to a workqueue and run each shard concurrently.
IOWs, we can actually make the traversal of large caches much, much
faster without changing the semantics of the operation the traversal
is trying to acheive.

We simply cannot do things like that without a new iteration model.
Abstraction is necessary to facilitate a new iteration model, and a
model that provides independent object callbacks allows scope for
concurrent processing of individual objects.

> The underlying approach in this patchset of "just use the inode hash
> table if that's available" - that I _do_ like, but this seems like
> the wrong way to go about it, we're significantly adding to the amount
> of special purpose "things" filesystems have to do if they want to
> perform well.

I've already addressed this in my response to Christian. This is a
mechanism that allows filesystems to be moved one-by-one to a new
generic cache and iteration implementation without impacting
existing code. Once we have that, scalability of the inode cache and
traversals should not be a reason for filesystems "doing their own
thing" because the generic infrastructure will be sufficient for
most filesystem implementations.

> Converting the standard inode hash table to an rhashtable (or more
> likely, creating a new standard implementation and converting
> filesystems one at a time) still needs to happen, and then the "use the
> hash table for iteration" approach could use that without every
> filesystem having to specialize.

Yes, but this still doesn't help filesystems like XFS where the
structure of the inode cache is highly optimised for the specific
on-disk and in-memory locality of inodes. We aren't going to be
converting XFS to a rhashtable based inode cache anytime soon
because it simply doesn't provide the functionality we require.
e.g. efficient lockless sequential inode number ordered traversal in
-every- inode cluster writeback operation.

> Failing that, or even regardless, I think we do need either dlock-list
> or fast-list. "I need some sort of generic list, but fast" is something
> I've seen come up way too many times.

There's nothing stopping you from using the dlist patchset for your
own purposes. It's public code - just make sure you retain the
correct attributions. :)

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-02 22:23       ` Dave Chinner
@ 2024-10-02 23:20         ` Kent Overstreet
  2024-10-03  1:41           ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Kent Overstreet @ 2024-10-02 23:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, linux-fsdevel, linux-xfs, linux-bcachefs,
	torvalds

On Thu, Oct 03, 2024 at 08:23:44AM GMT, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote:
> > On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote:
> > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > > > > What do people think of moving towards per-sb inode caching and
> > > > > traversal mechanisms like this?
> > > > 
> > > > Patches 1-4 are great cleanups that I would like us to merge even
> > > > independent of the rest.
> > > 
> > > Yes, they make it much easier to manage the iteration code.
> > > 
> > > > I don't have big conceptual issues with the series otherwise. The only
> > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > that may encourage filesystems to do their own inode caching even if
> > > > they don't really have a need for it just because it's there.  So really
> > > > a way that would've solved this issue generically would have been my
> > > > preference.
> > > 
> > > Well, that's the problem, isn't it? :/
> > > 
> > > There really isn't a good generic solution for global list access
> > > and management.  The dlist stuff kinda works, but it still has
> > > significant overhead and doesn't get rid of spinlock contention
> > > completely because of the lack of locality between list add and
> > > remove operations.
> > 
> > There is though; I haven't posted it yet because it still needs some
> > work, but the concept works and performs about the same as dlock-list.
> > 
> > https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list
> > 
> > The thing that needs to be sorted before posting is that it can't shrink
> > the radix tree. generic-radix-tree doesn't support shrinking, and I
> > could add that, but then ida doesn't provide a way to query the highest
> > id allocated (xarray doesn't support backwards iteration).
> 
> That's an interesting construct, but...
> 
> > So I'm going to try it using idr and see how that performs (idr is not
> > really the right data structure for this, split ida and item radix tree
> > is better, so might end up doing something else).
> > 
> > But - this approach with more work will work for the list_lru lock
> > contention as well.
> 
> ....  it isn't a generic solution because it is dependent on
> blocking memory allocation succeeding for list_add() operations.
> 
> Hence this cannot do list operations under external synchronisation
> constructs like spinlocks or rcu_read_lock(). It also introduces
> interesting interactions with memory reclaim - what happens we have
> to add an object to one of these lists from memory reclaim context?
> 
> Taking the example of list_lru, this list construct will not work
> for a variety of reasons. Some of them are:
> 
> - list_lru_add() being called from list_lru_add_obj() under RCU for
>   memcg aware LRUs so cannot block and must not fail.
> - list_lru_add_obj() is called under spinlocks from inode_lru_add(),
>   the xfs buffer and dquot caches, the workingset code from under
>   the address space mapping xarray lock, etc. Again, this must not
>   fail.
> - list_lru_add() operations take can place in large numbers in
>   memory reclaim context (e.g. dentry reclaim drops inodes which
>   adds them to the inode lru). Hence memory reclaim becomes even
>   more dependent on PF_MEMALLOC memory allocation making forwards
>   progress.
> - adding long tail list latency to what are currently O(1) fast path
>   operations (e.g.  mulitple allocations tree splits for LRUs
>   tracking millions of objects) is not desirable.
> 
> So while I think this is an interesting idea that might be useful in
> some cases, I don't think it is a viable generic scalable list
> construct we can use in areas like list_lru or global list
> management that run under external synchronisation mechanisms.

There are difficulties, but given the fundamental scalability and
locking issues with linked lists, I think this is the approach we want
if we can make it work.

A couple things that help - we've already determined that the inode LRU
can go away for most filesystems, and we can preallocate slots without
actually adding objects. Iteration will see NULLs that they skip over,
so we can't simply preallocate a slot for everything if nr_live_objects
/ nr_lru_objects is too big. But, we can certainly preallocate slots on
a given code path and then release them back to the percpu buffer if
they're not used.

> - LRU lists are -ordered- (it's right there in the name!) and this
>   appears to be an unordered list construct.

Yes, it is. But in actual practice cache replacement policy tends not to
matter nearly as much as people think; there's many papers showing real
world hit ratio of common algorithms is only a fudge factor from random
replacement - the main thing you want is an accessed bit (or counter, if
you want the analagous version of n-lru for n > 2), and we'll still have
that.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-02 23:17         ` Dave Chinner
@ 2024-10-03  1:22           ` Kent Overstreet
  2024-10-03  2:20             ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Kent Overstreet @ 2024-10-03  1:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, linux-xfs,
	linux-bcachefs

On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote:
> > On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> > > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > >
> > > > > I don't have big conceptual issues with the series otherwise. The only
> > > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > > that may encourage filesystems to do their own inode caching even if
> > > > > they don't really have a need for it just because it's there.  So really
> > > > > a way that would've solved this issue generically would have been my
> > > > > preference.
> > > >
> > > > Well, that's the problem, isn't it? :/
> > > >
> > > > There really isn't a good generic solution for global list access
> > > > and management.  The dlist stuff kinda works, but it still has
> > > > significant overhead and doesn't get rid of spinlock contention
> > > > completely because of the lack of locality between list add and
> > > > remove operations.
> > > 
> > > I much prefer the approach taken in your patch series, to let the
> > > filesystem own the inode list and keeping the old model as the
> > > "default list".
> > > 
> > > In many ways, that is how *most* of the VFS layer works - it exposes
> > > helper functions that the filesystems can use (and most do), but
> > > doesn't force them.
> > > 
> > > Yes, the VFS layer does force some things - you can't avoid using
> > > dentries, for example, because that's literally how the VFS layer
> > > deals with filenames (and things like mounting etc). And honestly, the
> > > VFS layer does a better job of filename caching than any filesystem
> > > really can do, and with the whole UNIX mount model, filenames
> > > fundamentally cross filesystem boundaries anyway.
> > > 
> > > But clearly the VFS layer inode list handling isn't the best it can
> > > be, and unless we can fix that in some fundamental way (and I don't
> > > love the "let's use crazy lists instead of a simple one" models) I do
> > > think that just letting filesystems do their own thing if they have
> > > something better is a good model.
> > 
> > Well, I don't love adding more indirection and callbacks.
> 
> It's way better than open coding inode cache traversals everywhere.

Eh? You had a nice iterator for dlock-list :)

> The callback model is simply "call this function on every object",
> and it allows implementations the freedom to decide how they are
> going to run those callbacks.
> 
> For example, this abstraction allows XFS to parallelise the
> traversal. We currently run the traversal across all inodes in a
> single thread, but now that XFS is walking the inode cache we can
> push each shard off to a workqueue and run each shard concurrently.
> IOWs, we can actually make the traversal of large caches much, much
> faster without changing the semantics of the operation the traversal
> is trying to acheive.
> 
> We simply cannot do things like that without a new iteration model.
> Abstraction is necessary to facilitate a new iteration model, and a
> model that provides independent object callbacks allows scope for
> concurrent processing of individual objects.

Parallelized iteration is a slick possibility.

My concern is that we've been trying to get away from callbacks for
iteration - post spectre they're quite a bit more expensive than
external iterators, and we've generally been successful with that. 

> 
> > The underlying approach in this patchset of "just use the inode hash
> > table if that's available" - that I _do_ like, but this seems like
> > the wrong way to go about it, we're significantly adding to the amount
> > of special purpose "things" filesystems have to do if they want to
> > perform well.
> 
> I've already addressed this in my response to Christian. This is a
> mechanism that allows filesystems to be moved one-by-one to a new
> generic cache and iteration implementation without impacting
> existing code. Once we have that, scalability of the inode cache and
> traversals should not be a reason for filesystems "doing their own
> thing" because the generic infrastructure will be sufficient for
> most filesystem implementations.

Well, I'm not really seeing the need; based on my performance testing
both dlock-list and fast-list completely shift the bottleneck to the
lru_list locking - and in my testing both patchsets were about equal, to
within the margin of error.

Which is a touch surprising, given that dlock-list works similarly to
lru_list - possibly it's because you only have siblings sharing lists
vs. numa nodes for lru lists, or lru scanning is doing more cross
cpu/node accesses.

> > Converting the standard inode hash table to an rhashtable (or more
> > likely, creating a new standard implementation and converting
> > filesystems one at a time) still needs to happen, and then the "use the
> > hash table for iteration" approach could use that without every
> > filesystem having to specialize.
> 
> Yes, but this still doesn't help filesystems like XFS where the
> structure of the inode cache is highly optimised for the specific
> on-disk and in-memory locality of inodes. We aren't going to be
> converting XFS to a rhashtable based inode cache anytime soon
> because it simply doesn't provide the functionality we require.
> e.g. efficient lockless sequential inode number ordered traversal in
> -every- inode cluster writeback operation.

I was going to ask what your requirements are - I may take on the
general purpose inode rhashtable code, although since I'm still pretty
buried we'll see.

Coincidentally, just today I'm working on an issue in bcachefs where
we'd also prefer an ordered data structure to a hash table for the inode
cache - in online fsck, we need to be able to check if an inode is still
open, but checking for an inode in an interior snapshot node means we
have to do a scan and check if any of the open inodes are in a
descendent subvolume.

Radix tree doesn't work for us, since our keys are { inum, subvol } - 96
bits - but it has me considering looking at maple trees (or something
like the lockless RCU btree you were working on awhile back) - those
modern approaches should be approaching hash table performance, if
enough needs for ordered access come up.

> > Failing that, or even regardless, I think we do need either dlock-list
> > or fast-list. "I need some sort of generic list, but fast" is something
> > I've seen come up way too many times.
> 
> There's nothing stopping you from using the dlist patchset for your
> own purposes. It's public code - just make sure you retain the
> correct attributions. :)

If this patchset goes in that might be just what I do, if I don't get
around to finishing fast-list :)

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-02 23:20         ` Kent Overstreet
@ 2024-10-03  1:41           ` Dave Chinner
  2024-10-03  2:24             ` Kent Overstreet
  2024-10-03  9:17             ` Jan Kara
  0 siblings, 2 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-03  1:41 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christian Brauner, linux-fsdevel, linux-xfs, linux-bcachefs,
	torvalds

On Wed, Oct 02, 2024 at 07:20:16PM -0400, Kent Overstreet wrote:
> On Thu, Oct 03, 2024 at 08:23:44AM GMT, Dave Chinner wrote:
> > On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote:
> > > On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote:
> > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > > > > > What do people think of moving towards per-sb inode caching and
> > > > > > traversal mechanisms like this?
> > > > > 
> > > > > Patches 1-4 are great cleanups that I would like us to merge even
> > > > > independent of the rest.
> > > > 
> > > > Yes, they make it much easier to manage the iteration code.
> > > > 
> > > > > I don't have big conceptual issues with the series otherwise. The only
> > > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > > that may encourage filesystems to do their own inode caching even if
> > > > > they don't really have a need for it just because it's there.  So really
> > > > > a way that would've solved this issue generically would have been my
> > > > > preference.
> > > > 
> > > > Well, that's the problem, isn't it? :/
> > > > 
> > > > There really isn't a good generic solution for global list access
> > > > and management.  The dlist stuff kinda works, but it still has
> > > > significant overhead and doesn't get rid of spinlock contention
> > > > completely because of the lack of locality between list add and
> > > > remove operations.
> > > 
> > > There is though; I haven't posted it yet because it still needs some
> > > work, but the concept works and performs about the same as dlock-list.
> > > 
> > > https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list
> > > 
> > > The thing that needs to be sorted before posting is that it can't shrink
> > > the radix tree. generic-radix-tree doesn't support shrinking, and I
> > > could add that, but then ida doesn't provide a way to query the highest
> > > id allocated (xarray doesn't support backwards iteration).
> > 
> > That's an interesting construct, but...
> > 
> > > So I'm going to try it using idr and see how that performs (idr is not
> > > really the right data structure for this, split ida and item radix tree
> > > is better, so might end up doing something else).
> > > 
> > > But - this approach with more work will work for the list_lru lock
> > > contention as well.
> > 
> > ....  it isn't a generic solution because it is dependent on
> > blocking memory allocation succeeding for list_add() operations.
> > 
> > Hence this cannot do list operations under external synchronisation
> > constructs like spinlocks or rcu_read_lock(). It also introduces
> > interesting interactions with memory reclaim - what happens we have
> > to add an object to one of these lists from memory reclaim context?
> > 
> > Taking the example of list_lru, this list construct will not work
> > for a variety of reasons. Some of them are:
> > 
> > - list_lru_add() being called from list_lru_add_obj() under RCU for
> >   memcg aware LRUs so cannot block and must not fail.
> > - list_lru_add_obj() is called under spinlocks from inode_lru_add(),
> >   the xfs buffer and dquot caches, the workingset code from under
> >   the address space mapping xarray lock, etc. Again, this must not
> >   fail.
> > - list_lru_add() operations take can place in large numbers in
> >   memory reclaim context (e.g. dentry reclaim drops inodes which
> >   adds them to the inode lru). Hence memory reclaim becomes even
> >   more dependent on PF_MEMALLOC memory allocation making forwards
> >   progress.
> > - adding long tail list latency to what are currently O(1) fast path
> >   operations (e.g.  mulitple allocations tree splits for LRUs
> >   tracking millions of objects) is not desirable.
> > 
> > So while I think this is an interesting idea that might be useful in
> > some cases, I don't think it is a viable generic scalable list
> > construct we can use in areas like list_lru or global list
> > management that run under external synchronisation mechanisms.
> 
> There are difficulties, but given the fundamental scalability and
> locking issues with linked lists, I think this is the approach we want
> if we can make it work.

Sure, but this is a completely different problem to what I'm trying
to address here. I want infrastructure that does not need global
lists or list_lru for inode cache maintenance at all. So talking
about how to make the lists I am trying to remove scale better is
kinda missing the point....

> A couple things that help - we've already determined that the inode LRU
> can go away for most filesystems,

We haven't determined that yet. I *think* it is possible, but there
is a really nasty inode LRU dependencies that has been driven deep
down into the mm page cache writeback code.  We have to fix that
awful layering violation before we can get rid of the inode LRU.

I *think* we can do it by requiring dirty inodes to hold an explicit
inode reference, thereby keeping the inode pinned in memory whilst
it is being tracked for writeback. That would also get rid of the
nasty hacks needed in evict() to wait on writeback to complete on
unreferenced inodes.

However, this isn't simple to do, and so getting rid of the inode
LRU is not going to happen in the near term.

> and we can preallocate slots without
> actually adding objects. Iteration will see NULLs that they skip over,
> so we can't simply preallocate a slot for everything if nr_live_objects
> / nr_lru_objects is too big. But, we can certainly preallocate slots on
> a given code path and then release them back to the percpu buffer if
> they're not used.

I'm not really that interested in spending time trying to optimise
away list_lru contention at this point in time.

It's not a performance limiting factor because inode and
dentry LRU scalability is controllable by NUMA configuration. i.e.
if you have severe list_lru lock contention on inode and dentry
caches, then either turn on Sub-NUMA Clustering in your bios,
configure your VM with more discrete nodes, or use the fake-numa=N
boot parameter to increase the number of nodes the kernel sets up.
This will increase the number of list_lru instances for NUMA aware
shrinkers and the contention will go away.

This is trivial to do and I use the "configure your VM with more
discrete nodes" method for benchmarking purposes. I've run my perf
testing VMs with 4 nodes for the past decade and the list_lru
contention has never got above the threshold of concern. There's
always been something else causing worse problems, and even with
the sb->s_inodes list out of the way, it still isn't a problem on
64-way cache-hammering workloads...

> > - LRU lists are -ordered- (it's right there in the name!) and this
> >   appears to be an unordered list construct.
> 
> Yes, it is. But in actual practice cache replacement policy tends not to
> matter nearly as much as people think; there's many papers showing real
> world hit ratio of common algorithms is only a fudge factor from random
> replacement - the main thing you want is an accessed bit (or counter, if
> you want the analagous version of n-lru for n > 2), and we'll still have
> that.

Sure.  But I can cherry-pick many papers showing exactly the opposite.
i.e. that LRU and LFU algorithms are far superior at maintaining a
working set compared to random cache shootdown, especially when
there is significant latency for cache replacement.

What matters is whether there are any behavioural regressions as a
result of changing the current algorithm. We've used quasi-LRU
working set management for so long that this is the behaviour that
people have tuned their systems and applications to work well with.
Fundamental changes to working set maintenance behaviour is not
something I'm considering doing, nor something I *want* to do.

And, really, this is way outside the scope of this patch set. It's
even outside of the scope of "remove the inode cache LRU" proposal
because that proposal is based on the existing dentry cache LRU
working set management making the inode cache LRU completely
redundant. i.e.  it's not a change of working set management
algorithms at all....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-03  1:22           ` Kent Overstreet
@ 2024-10-03  2:20             ` Dave Chinner
  2024-10-03  2:42               ` Kent Overstreet
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2024-10-03  2:20 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, linux-xfs,
	linux-bcachefs

On Wed, Oct 02, 2024 at 09:22:38PM -0400, Kent Overstreet wrote:
> On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote:
> > On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote:
> > > On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> > > > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > > >
> > > > > > I don't have big conceptual issues with the series otherwise. The only
> > > > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > > > that may encourage filesystems to do their own inode caching even if
> > > > > > they don't really have a need for it just because it's there.  So really
> > > > > > a way that would've solved this issue generically would have been my
> > > > > > preference.
> > > > >
> > > > > Well, that's the problem, isn't it? :/
> > > > >
> > > > > There really isn't a good generic solution for global list access
> > > > > and management.  The dlist stuff kinda works, but it still has
> > > > > significant overhead and doesn't get rid of spinlock contention
> > > > > completely because of the lack of locality between list add and
> > > > > remove operations.
> > > > 
> > > > I much prefer the approach taken in your patch series, to let the
> > > > filesystem own the inode list and keeping the old model as the
> > > > "default list".
> > > > 
> > > > In many ways, that is how *most* of the VFS layer works - it exposes
> > > > helper functions that the filesystems can use (and most do), but
> > > > doesn't force them.
> > > > 
> > > > Yes, the VFS layer does force some things - you can't avoid using
> > > > dentries, for example, because that's literally how the VFS layer
> > > > deals with filenames (and things like mounting etc). And honestly, the
> > > > VFS layer does a better job of filename caching than any filesystem
> > > > really can do, and with the whole UNIX mount model, filenames
> > > > fundamentally cross filesystem boundaries anyway.
> > > > 
> > > > But clearly the VFS layer inode list handling isn't the best it can
> > > > be, and unless we can fix that in some fundamental way (and I don't
> > > > love the "let's use crazy lists instead of a simple one" models) I do
> > > > think that just letting filesystems do their own thing if they have
> > > > something better is a good model.
> > > 
> > > Well, I don't love adding more indirection and callbacks.
> > 
> > It's way better than open coding inode cache traversals everywhere.
> 
> Eh? You had a nice iterator for dlock-list :)

Which was painful to work with because
it maintains the existing spin lock based traversal pattern. This
was necessary because the iterator held a spinlock internally. I
really didn't like that aspect of it because it perpeutated the need
to open code the iget/iput game to allow the spinlock to be dropped
across the inode operation that needed to be performed.

i.e. adding an dlist iterator didn't clean up any of the other mess
that sb->s_inodes iteration required...

> > We simply cannot do things like that without a new iteration model.
> > Abstraction is necessary to facilitate a new iteration model, and a
> > model that provides independent object callbacks allows scope for
> > concurrent processing of individual objects.
> 
> Parallelized iteration is a slick possibility.
> 
> My concern is that we've been trying to get away from callbacks for
> iteration - post spectre they're quite a bit more expensive than
> external iterators, and we've generally been successful with that. 

So everyone keeps saying, but the old adage applies here: Penny
wise, pound foolish.

Optimising away the callbacks might bring us a few percent
performance improvement for each operation (e.g. via the dlist
iterator mechanisms) in a traversal, but that iteration is still
only single threaded. Hence the maximum processing rate is
determined by the performance of a single CPU core.

However, if we change the API to allow for parallelism at the cost
of a few percent per object operation, then a single CPU core will
not process quite as many objects as before. However, the moment we
allow multiple CPU cores to process in parallel, we acheive
processing rate improvements measured in integer multiples.

Modern CPUs have concurrency to burn.  Optimising APIs for minimum
per-operation overhead rather than for concurrent processing
implementations is the wrong direction to be taking....

> > > Converting the standard inode hash table to an rhashtable (or more
> > > likely, creating a new standard implementation and converting
> > > filesystems one at a time) still needs to happen, and then the "use the
> > > hash table for iteration" approach could use that without every
> > > filesystem having to specialize.
> > 
> > Yes, but this still doesn't help filesystems like XFS where the
> > structure of the inode cache is highly optimised for the specific
> > on-disk and in-memory locality of inodes. We aren't going to be
> > converting XFS to a rhashtable based inode cache anytime soon
> > because it simply doesn't provide the functionality we require.
> > e.g. efficient lockless sequential inode number ordered traversal in
> > -every- inode cluster writeback operation.
> 
> I was going to ask what your requirements are - I may take on the
> general purpose inode rhashtable code, although since I'm still pretty
> buried we'll see.
> 
> Coincidentally, just today I'm working on an issue in bcachefs where
> we'd also prefer an ordered data structure to a hash table for the inode
> cache - in online fsck, we need to be able to check if an inode is still
> open, but checking for an inode in an interior snapshot node means we
> have to do a scan and check if any of the open inodes are in a
> descendent subvolume.
> 
> Radix tree doesn't work for us, since our keys are { inum, subvol } - 96
> bits -

Sure it does - you just need two layers of radix trees. i.e have a
radix tree per subvol to index inodes by inum, and a per-sb radix
tree to index the subvols. With some code to propagate radix tree
bits from the inode radix tree to the subvol radix tree they then
largely work in conjunction for filtered searches.

This is -exactly- the internal inode cache structure that XFS has.
We have a per-sb radix tree indexing the allocation groups, and a
radix tree per allocation group indexing inodes by inode number.
Hence an inode lookup involves splitting the inum into agno/agino
pairs, then doing a perag lookup with the agno, and doing a perag
inode cache lookup with the agino. All of these radix tree
lookups are lockless...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-03  1:41           ` Dave Chinner
@ 2024-10-03  2:24             ` Kent Overstreet
  2024-10-03  9:17             ` Jan Kara
  1 sibling, 0 replies; 72+ messages in thread
From: Kent Overstreet @ 2024-10-03  2:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, linux-fsdevel, linux-xfs, linux-bcachefs,
	torvalds

On Thu, Oct 03, 2024 at 11:41:42AM GMT, Dave Chinner wrote:
> > A couple things that help - we've already determined that the inode LRU
> > can go away for most filesystems,
> 
> We haven't determined that yet. I *think* it is possible, but there
> is a really nasty inode LRU dependencies that has been driven deep
> down into the mm page cache writeback code.  We have to fix that
> awful layering violation before we can get rid of the inode LRU.
> 
> I *think* we can do it by requiring dirty inodes to hold an explicit
> inode reference, thereby keeping the inode pinned in memory whilst
> it is being tracked for writeback. That would also get rid of the
> nasty hacks needed in evict() to wait on writeback to complete on
> unreferenced inodes.
> 
> However, this isn't simple to do, and so getting rid of the inode
> LRU is not going to happen in the near term.

Ok.

> > and we can preallocate slots without
> > actually adding objects. Iteration will see NULLs that they skip over,
> > so we can't simply preallocate a slot for everything if nr_live_objects
> > / nr_lru_objects is too big. But, we can certainly preallocate slots on
> > a given code path and then release them back to the percpu buffer if
> > they're not used.
> 
> I'm not really that interested in spending time trying to optimise
> away list_lru contention at this point in time.

Fair, and I'm not trying to derail this one - whether dlock-list, or
fast-list, or super_iter_inodes(), we should do one of them. On current
mainline, I see lock contention bad enough to trigger bcachefs's 10
second "srcu lock held too long" warning, any of these solves the
biggest problem.

But...

> It's not a performance limiting factor because inode and
> dentry LRU scalability is controllable by NUMA configuration. i.e.
> if you have severe list_lru lock contention on inode and dentry
> caches, then either turn on Sub-NUMA Clustering in your bios,
> configure your VM with more discrete nodes, or use the fake-numa=N
> boot parameter to increase the number of nodes the kernel sets up.
> This will increase the number of list_lru instances for NUMA aware
> shrinkers and the contention will go away.

I don't buy this, asking users to change their bios (and even to know
that's a thing they should consider) is a big ask. Linked lists _suck_,
both w.r.t. locking and cache behaviour, and we need to be exploring
better options.

> > > - LRU lists are -ordered- (it's right there in the name!) and this
> > >   appears to be an unordered list construct.
> > 
> > Yes, it is. But in actual practice cache replacement policy tends not to
> > matter nearly as much as people think; there's many papers showing real
> > world hit ratio of common algorithms is only a fudge factor from random
> > replacement - the main thing you want is an accessed bit (or counter, if
> > you want the analagous version of n-lru for n > 2), and we'll still have
> > that.
> 
> Sure.  But I can cherry-pick many papers showing exactly the opposite.
> i.e. that LRU and LFU algorithms are far superior at maintaining a
> working set compared to random cache shootdown, especially when
> there is significant latency for cache replacement.

But as mentioned we won't be comparing against pure random, it'll be vs.
pure random with at least an accessed bit, preserving the multiple
generations which are the most important feature of LRU/LFU as we use
them.

> What matters is whether there are any behavioural regressions as a
> result of changing the current algorithm. We've used quasi-LRU
> working set management for so long that this is the behaviour that
> people have tuned their systems and applications to work well with.
> Fundamental changes to working set maintenance behaviour is not
> something I'm considering doing, nor something I *want* to do.

Yup, it's not going to be the easiest thing to tackle.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-03  2:20             ` Dave Chinner
@ 2024-10-03  2:42               ` Kent Overstreet
  0 siblings, 0 replies; 72+ messages in thread
From: Kent Overstreet @ 2024-10-03  2:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, linux-xfs,
	linux-bcachefs

On Thu, Oct 03, 2024 at 12:20:53PM GMT, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 09:22:38PM -0400, Kent Overstreet wrote:
> > On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote:
> Which was painful to work with because
> it maintains the existing spin lock based traversal pattern. This
> was necessary because the iterator held a spinlock internally. I
> really didn't like that aspect of it because it perpeutated the need
> to open code the iget/iput game to allow the spinlock to be dropped
> across the inode operation that needed to be performed.
> 
> i.e. adding an dlist iterator didn't clean up any of the other mess
> that sb->s_inodes iteration required...

yeah, true.

that's actually something that does get cleaner with fast-list; because
we're iterating over a radix tree and our iterator is a radix tree
index, the crab-walk thing naturally goes away.

> > My concern is that we've been trying to get away from callbacks for
> > iteration - post spectre they're quite a bit more expensive than
> > external iterators, and we've generally been successful with that. 
> 
> So everyone keeps saying, but the old adage applies here: Penny
> wise, pound foolish.
> 
> Optimising away the callbacks might bring us a few percent
> performance improvement for each operation (e.g. via the dlist
> iterator mechanisms) in a traversal, but that iteration is still
> only single threaded. Hence the maximum processing rate is
> determined by the performance of a single CPU core.
> 
> However, if we change the API to allow for parallelism at the cost
> of a few percent per object operation, then a single CPU core will
> not process quite as many objects as before. However, the moment we
> allow multiple CPU cores to process in parallel, we acheive
> processing rate improvements measured in integer multiples.
> 
> Modern CPUs have concurrency to burn.  Optimising APIs for minimum
> per-operation overhead rather than for concurrent processing
> implementations is the wrong direction to be taking....

OTOH - this is all academic because none of the uses of s_inodes are
_remotely_ fastpaths. Aside from nr_blockdev_pages() it's more or less
all filesystem teardown, or similar frequency.

> > Radix tree doesn't work for us, since our keys are { inum, subvol } - 96
> > bits -
> 
> Sure it does - you just need two layers of radix trees. i.e have a
> radix tree per subvol to index inodes by inum, and a per-sb radix
> tree to index the subvols. With some code to propagate radix tree
> bits from the inode radix tree to the subvol radix tree they then
> largely work in conjunction for filtered searches.

It'd have to be the reverse - index by inum, then subvol, and then we'd
need to do bit stuffing so that a radix tree with a single element is
just a pointer to the element. But - yeah, if the current approach (not
considering the subvol when calculating the hash) becomes an issue, that
might be the way to go.

> This is -exactly- the internal inode cache structure that XFS has.
> We have a per-sb radix tree indexing the allocation groups, and a
> radix tree per allocation group indexing inodes by inode number.
> Hence an inode lookup involves splitting the inum into agno/agino
> pairs, then doing a perag lookup with the agno, and doing a perag
> inode cache lookup with the agino. All of these radix tree
> lookups are lockless...

Speaking of, I'd like to pick your brain on AGIs at some point. We've
been sketching out future scalability work in bcachefs, and I think
that's going to be one of the things we'll end up needing.

Right now the scalability limit is backpointers fsck, but that looks
fairly trivial to solve: there's no reason to run the backpointers ->
extents pass except for debug testing, we can check and repair those
references at runtime, and we can sum up backpointers in a bucket and
check them against the bucket sector counts and skip extents ->
backpointers if they match.

After that, the next scalability limitation should be the main
check_alloc_info pass, and we'll need something analagous to AGIs to
shard that and run it efficiently when the main allocation info doesn't
fit in memory - and it sounds like you have other optimizations that
leverage AGIs as well.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/7] vfs: replace invalidate_inodes() with evict_inodes()
  2024-10-02  1:33 ` [PATCH 1/7] vfs: replace invalidate_inodes() with evict_inodes() Dave Chinner
@ 2024-10-03  7:07   ` Christoph Hellwig
  2024-10-03  9:20   ` Jan Kara
  1 sibling, 0 replies; 72+ messages in thread
From: Christoph Hellwig @ 2024-10-03  7:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/7] vfs: add inode iteration superblock method
  2024-10-02  1:33 ` [PATCH 2/7] vfs: add inode iteration superblock method Dave Chinner
@ 2024-10-03  7:12   ` Christoph Hellwig
  2024-10-03 10:35     ` Dave Chinner
  2024-10-04  9:53   ` kernel test robot
  1 sibling, 1 reply; 72+ messages in thread
From: Christoph Hellwig @ 2024-10-03  7:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

On Wed, Oct 02, 2024 at 11:33:19AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add a new superblock method for iterating all cached inodes in the
> inode cache.

The method is added later, this just adds an abstraction.

> +/**
> + * super_iter_inodes - iterate all the cached inodes on a superblock
> + * @sb: superblock to iterate
> + * @iter_fn: callback to run on every inode found.
> + *
> + * This function iterates all cached inodes on a superblock that are not in
> + * the process of being initialised or torn down. It will run @iter_fn() with
> + * a valid, referenced inode, so it is safe for the caller to do anything
> + * it wants with the inode except drop the reference the iterator holds.
> + *
> + */

Spurious empty comment line above.

> +void super_iter_inodes_unsafe(struct super_block *sb, ino_iter_fn iter_fn,
> +		void *private_data)
> +{
> +	struct inode *inode;
> +	int ret;
> +
> +	rcu_read_lock();
> +	spin_lock(&sb->s_inode_list_lock);
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		ret = iter_fn(inode, private_data);
> +		if (ret == INO_ITER_ABORT)
> +			break;
> +	}

Looking at the entire series, splitting the helpers for the unsafe
vs safe iteration feels a bit of an odd API design given that the
INO_ITER_REFERENCED can be passed to super_iter_inodes, but is an
internal flag pass here to the file system method.  Not sure what
the best way to do it, but maybe just make super_iter_inodes
a wrapper that calls into the method if available, or
a generic_iter_inodes_unsafe if the unsafe flag is set, else
a plain generic_iter_inodes?

> +/* Inode iteration callback return values */
> +#define INO_ITER_DONE		0
> +#define INO_ITER_ABORT		1
> +
> +/* Inode iteration control flags */
> +#define INO_ITER_REFERENCED	(1U << 0)
> +#define INO_ITER_UNSAFE		(1U << 1)

Please adjust the naming a bit to make clear these are different
namespaces, e.g. INO_ITER_RET_ and INO_ITER_F_.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 3/7] vfs: convert vfs inode iterators to super_iter_inodes_unsafe()
  2024-10-02  1:33 ` [PATCH 3/7] vfs: convert vfs inode iterators to super_iter_inodes_unsafe() Dave Chinner
@ 2024-10-03  7:14   ` Christoph Hellwig
  2024-10-03 10:45     ` Dave Chinner
  2024-10-04 10:55   ` kernel test robot
  1 sibling, 1 reply; 72+ messages in thread
From: Christoph Hellwig @ 2024-10-03  7:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

> diff --git a/block/bdev.c b/block/bdev.c
> index 33f9c4605e3a..b5a362156ca1 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -472,16 +472,28 @@ void bdev_drop(struct block_device *bdev)
>  	iput(BD_INODE(bdev));
>  }
>  
> +static int bdev_pages_count(struct inode *inode, void *data)

This are guaranteed to operate on the bdev superblock, so just
hardcoding the s_inodes list seems fine here as it keeps the code
much simpler.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-02  1:33 ` [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes() Dave Chinner
@ 2024-10-03  7:23   ` Christoph Hellwig
  2024-10-03  7:38     ` Christoph Hellwig
  0 siblings, 1 reply; 72+ messages in thread
From: Christoph Hellwig @ 2024-10-03  7:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds, Mickaël Salaün, Jann Horn, Serge Hallyn,
	Kees Cook, linux-security-module

On Wed, Oct 02, 2024 at 11:33:21AM +1000, Dave Chinner wrote:
> --- a/security/landlock/fs.c
> +++ b/security/landlock/fs.c
> @@ -1223,109 +1223,60 @@ static void hook_inode_free_security_rcu(void *inode_security)
>  
>  /*
>   * Release the inodes used in a security policy.
> - *
> - * Cf. fsnotify_unmount_inodes() and invalidate_inodes()
>   */
> +static int release_inode_fn(struct inode *inode, void *data)

Looks like this is called from the sb_delete LSM hook, which
is only implemented by landlock, and only called from
generic_shutdown_super, separated from evict_inodes only by call
to fsnotify_sb_delete.  Why did LSM not hook into that and instead
added another iteration of the per-sb inode list?

(Note that this is not tying to get Dave to fix this, just noticed
it when reviewing this series)


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/7] vfs: add inode iteration superblock method
  2024-10-02  1:33 ` [PATCH 5/7] vfs: add inode iteration superblock method Dave Chinner
@ 2024-10-03  7:24   ` Christoph Hellwig
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Hellwig @ 2024-10-03  7:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

On Wed, Oct 02, 2024 at 11:33:22AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For filesytsems that provide their own inode cache that can be

s/filesytsems/filesystems/

> traversed, add a sueprblock method that can be used instead of

s/sueprblock/superblock/

> +bool super_iter_iget(struct inode *inode, int flags)

Can you add a kerneldoc comment explaining this helper?  Including
what flags is?

> +{
> +	bool	ret = false;

Weird indentation.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/7] xfs: implement sb->iter_vfs_inodes
  2024-10-02  1:33 ` [PATCH 6/7] xfs: implement sb->iter_vfs_inodes Dave Chinner
@ 2024-10-03  7:30   ` Christoph Hellwig
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Hellwig @ 2024-10-03  7:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

On Wed, Oct 02, 2024 at 11:33:23AM +1000, Dave Chinner wrote:
> Note: this is an initial, unoptimised implementation that could be
> significantly improved and reduced in size by using a radix tree tag
> filter for VFS inodes and so use the generic tag-filtered
> xfs_icwalk() implementation instead of special casing it like this
> patch does.

Looking at how much this duplicates from xfs_icwalk that would be very
nice to have.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03  7:23   ` lsm sb_delete hook, was " Christoph Hellwig
@ 2024-10-03  7:38     ` Christoph Hellwig
  2024-10-03 11:57       ` Jan Kara
  0 siblings, 1 reply; 72+ messages in thread
From: Christoph Hellwig @ 2024-10-03  7:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds, Mickaël Salaün, Jann Horn, Serge Hallyn,
	Kees Cook, linux-security-module, Jan Kara, Amir Goldstein

On Thu, Oct 03, 2024 at 12:23:41AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 02, 2024 at 11:33:21AM +1000, Dave Chinner wrote:
> > --- a/security/landlock/fs.c
> > +++ b/security/landlock/fs.c
> > @@ -1223,109 +1223,60 @@ static void hook_inode_free_security_rcu(void *inode_security)
> >  
> >  /*
> >   * Release the inodes used in a security policy.
> > - *
> > - * Cf. fsnotify_unmount_inodes() and invalidate_inodes()
> >   */
> > +static int release_inode_fn(struct inode *inode, void *data)
> 
> Looks like this is called from the sb_delete LSM hook, which
> is only implemented by landlock, and only called from
> generic_shutdown_super, separated from evict_inodes only by call
> to fsnotify_sb_delete.  Why did LSM not hook into that and instead

An the main thing that fsnotify_sb_delete does is yet another inode
iteration..

Ay chance you all could get together an figure out how to get down
to a single sb inode iteration per unmount?


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-03  1:41           ` Dave Chinner
  2024-10-03  2:24             ` Kent Overstreet
@ 2024-10-03  9:17             ` Jan Kara
  2024-10-03  9:59               ` Dave Chinner
  1 sibling, 1 reply; 72+ messages in thread
From: Jan Kara @ 2024-10-03  9:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Christian Brauner, linux-fsdevel, linux-xfs,
	linux-bcachefs, torvalds

On Thu 03-10-24 11:41:42, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 07:20:16PM -0400, Kent Overstreet wrote:
> > A couple things that help - we've already determined that the inode LRU
> > can go away for most filesystems,
> 
> We haven't determined that yet. I *think* it is possible, but there
> is a really nasty inode LRU dependencies that has been driven deep
> down into the mm page cache writeback code.  We have to fix that
> awful layering violation before we can get rid of the inode LRU.
> 
> I *think* we can do it by requiring dirty inodes to hold an explicit
> inode reference, thereby keeping the inode pinned in memory whilst
> it is being tracked for writeback. That would also get rid of the
> nasty hacks needed in evict() to wait on writeback to complete on
> unreferenced inodes.
> 
> However, this isn't simple to do, and so getting rid of the inode
> LRU is not going to happen in the near term.

Yeah. I agree the way how writeback protects from inode eviction is not the
prettiest one but the problem with writeback holding normal inode reference
is that then flush worker for the device can end up deleting unlinked
inodes which was causing writeback stalls and generally unexpected lock
ordering issues for some filesystems (already forgot the details). Now this
was more that 12 years ago so maybe we could find a better solution to
those problems these days (e.g. interactions between page writeback and
page reclaim are very different these days) but I just wanted to warn there
may be nasty surprises there.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/7] vfs: replace invalidate_inodes() with evict_inodes()
  2024-10-02  1:33 ` [PATCH 1/7] vfs: replace invalidate_inodes() with evict_inodes() Dave Chinner
  2024-10-03  7:07   ` Christoph Hellwig
@ 2024-10-03  9:20   ` Jan Kara
  1 sibling, 0 replies; 72+ messages in thread
From: Jan Kara @ 2024-10-03  9:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

On Wed 02-10-24 11:33:18, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> As of commit e127b9bccdb0 ("fs: simplify invalidate_inodes"),
> invalidate_inodes() is functionally identical to evict_inodes().
> Replace calls to invalidate_inodes() with a call to
> evict_inodes() and kill the former.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Indeed :). Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/inode.c    | 40 ----------------------------------------
>  fs/internal.h |  1 -
>  fs/super.c    |  2 +-
>  3 files changed, 1 insertion(+), 42 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 471ae4a31549..0a53d8c34203 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -827,46 +827,6 @@ void evict_inodes(struct super_block *sb)
>  }
>  EXPORT_SYMBOL_GPL(evict_inodes);
>  
> -/**
> - * invalidate_inodes	- attempt to free all inodes on a superblock
> - * @sb:		superblock to operate on
> - *
> - * Attempts to free all inodes (including dirty inodes) for a given superblock.
> - */
> -void invalidate_inodes(struct super_block *sb)
> -{
> -	struct inode *inode, *next;
> -	LIST_HEAD(dispose);
> -
> -again:
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> -		spin_lock(&inode->i_lock);
> -		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
> -			spin_unlock(&inode->i_lock);
> -			continue;
> -		}
> -		if (atomic_read(&inode->i_count)) {
> -			spin_unlock(&inode->i_lock);
> -			continue;
> -		}
> -
> -		inode->i_state |= I_FREEING;
> -		inode_lru_list_del(inode);
> -		spin_unlock(&inode->i_lock);
> -		list_add(&inode->i_lru, &dispose);
> -		if (need_resched()) {
> -			spin_unlock(&sb->s_inode_list_lock);
> -			cond_resched();
> -			dispose_list(&dispose);
> -			goto again;
> -		}
> -	}
> -	spin_unlock(&sb->s_inode_list_lock);
> -
> -	dispose_list(&dispose);
> -}
> -
>  /*
>   * Isolate the inode from the LRU in preparation for freeing it.
>   *
> diff --git a/fs/internal.h b/fs/internal.h
> index 8c1b7acbbe8f..37749b429e80 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -207,7 +207,6 @@ bool in_group_or_capable(struct mnt_idmap *idmap,
>   * fs-writeback.c
>   */
>  extern long get_nr_dirty_inodes(void);
> -void invalidate_inodes(struct super_block *sb);
>  
>  /*
>   * dcache.c
> diff --git a/fs/super.c b/fs/super.c
> index 1db230432960..a16e6a6342e0 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1417,7 +1417,7 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
>  	if (!surprise)
>  		sync_filesystem(sb);
>  	shrink_dcache_sb(sb);
> -	invalidate_inodes(sb);
> +	evict_inodes(sb);
>  	if (sb->s_op->shutdown)
>  		sb->s_op->shutdown(sb);
>  
> -- 
> 2.45.2
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-03  9:17             ` Jan Kara
@ 2024-10-03  9:59               ` Dave Chinner
  0 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-03  9:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kent Overstreet, Christian Brauner, linux-fsdevel, linux-xfs,
	linux-bcachefs, torvalds

On Thu, Oct 03, 2024 at 11:17:41AM +0200, Jan Kara wrote:
> On Thu 03-10-24 11:41:42, Dave Chinner wrote:
> > On Wed, Oct 02, 2024 at 07:20:16PM -0400, Kent Overstreet wrote:
> > > A couple things that help - we've already determined that the inode LRU
> > > can go away for most filesystems,
> > 
> > We haven't determined that yet. I *think* it is possible, but there
> > is a really nasty inode LRU dependencies that has been driven deep
> > down into the mm page cache writeback code.  We have to fix that
> > awful layering violation before we can get rid of the inode LRU.
> > 
> > I *think* we can do it by requiring dirty inodes to hold an explicit
> > inode reference, thereby keeping the inode pinned in memory whilst
> > it is being tracked for writeback. That would also get rid of the
> > nasty hacks needed in evict() to wait on writeback to complete on
> > unreferenced inodes.
> > 
> > However, this isn't simple to do, and so getting rid of the inode
> > LRU is not going to happen in the near term.
> 
> Yeah. I agree the way how writeback protects from inode eviction is not the
> prettiest one but the problem with writeback holding normal inode reference
> is that then flush worker for the device can end up deleting unlinked
> inodes which was causing writeback stalls and generally unexpected lock
> ordering issues for some filesystems (already forgot the details).

Yeah, if we end up in evict() on ext4 it will can then do all sorts
of whacky stuff that involves blocking, running transactions and
doing other IO. XFS, OTOH, has been changed to defer all that crap
to background threads (the xfs_inodegc infrastructure) that runs
after the VFS thinks the inode is dead and destroyed. There are some
benefits to having the filesystem inode exist outside the VFS inode
life cycle....

> Now this
> was more that 12 years ago so maybe we could find a better solution to
> those problems these days (e.g. interactions between page writeback and
> page reclaim are very different these days) but I just wanted to warn there
> may be nasty surprises there.

I don't think the situation has improved with filesytsems like ext4.
I think they've actually gotten worse - I recently learnt that ext4
inode eviction can recurse back into the inode cache to instantiate
extended attribute inodes so they can be truncated to allow inode
eviction to make progress.

I suspect the ext4 eviction behaviour is unfixable in any reasonable
time frame, so the only solution I can come up with is to run the
iput() call from a background thread context.  (e.g. defer it to a
workqueue). That way iput_final() and eviction processing will not
interfere with other writeback operations....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/7] vfs: add inode iteration superblock method
  2024-10-03  7:12   ` Christoph Hellwig
@ 2024-10-03 10:35     ` Dave Chinner
  0 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-03 10:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

On Thu, Oct 03, 2024 at 12:12:29AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 02, 2024 at 11:33:19AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add a new superblock method for iterating all cached inodes in the
> > inode cache.
> 
> The method is added later, this just adds an abstraction.

Ah, I forgot to remove that from the commit message when I split the
patch up....

> > +/**
> > + * super_iter_inodes - iterate all the cached inodes on a superblock
> > + * @sb: superblock to iterate
> > + * @iter_fn: callback to run on every inode found.
> > + *
> > + * This function iterates all cached inodes on a superblock that are not in
> > + * the process of being initialised or torn down. It will run @iter_fn() with
> > + * a valid, referenced inode, so it is safe for the caller to do anything
> > + * it wants with the inode except drop the reference the iterator holds.
> > + *
> > + */
> 
> Spurious empty comment line above.
> 
> > +void super_iter_inodes_unsafe(struct super_block *sb, ino_iter_fn iter_fn,
> > +		void *private_data)
> > +{
> > +	struct inode *inode;
> > +	int ret;
> > +
> > +	rcu_read_lock();
> > +	spin_lock(&sb->s_inode_list_lock);
> > +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > +		ret = iter_fn(inode, private_data);
> > +		if (ret == INO_ITER_ABORT)
> > +			break;
> > +	}
> 
> Looking at the entire series, splitting the helpers for the unsafe
> vs safe iteration feels a bit of an odd API design given that the
> INO_ITER_REFERENCED can be passed to super_iter_inodes, but is an
> internal flag pass here to the file system method.

The INO_ITER_REFERENCED flag is a hack to support the whacky
fsnotify and landlock iterators that are run after evict_inodes()
(which you already noticed...).  i.e.  there should not be any
unreferenced inodes at this point, so if any are found they should
be skipped.

I think it might be better to remove that flag and replace the
iterator implementation with some kind of SB flag and
WARN_ON_ONCE that fires if a referenced inode is found. With that,
the flags field for super_iter_inodes() can go away...

> Not sure what
> the best way to do it, but maybe just make super_iter_inodes
> a wrapper that calls into the method if available, or
> a generic_iter_inodes_unsafe if the unsafe flag is set, else
> a plain generic_iter_inodes?

Perhaps. I'll look into it.

> > +/* Inode iteration callback return values */
> > +#define INO_ITER_DONE		0
> > +#define INO_ITER_ABORT		1
> > +
> > +/* Inode iteration control flags */
> > +#define INO_ITER_REFERENCED	(1U << 0)
> > +#define INO_ITER_UNSAFE		(1U << 1)
> 
> Please adjust the naming a bit to make clear these are different
> namespaces, e.g. INO_ITER_RET_ and INO_ITER_F_.

Will do.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 3/7] vfs: convert vfs inode iterators to super_iter_inodes_unsafe()
  2024-10-03  7:14   ` Christoph Hellwig
@ 2024-10-03 10:45     ` Dave Chinner
  0 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-03 10:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

On Thu, Oct 03, 2024 at 12:14:57AM -0700, Christoph Hellwig wrote:
> > diff --git a/block/bdev.c b/block/bdev.c
> > index 33f9c4605e3a..b5a362156ca1 100644
> > --- a/block/bdev.c
> > +++ b/block/bdev.c
> > @@ -472,16 +472,28 @@ void bdev_drop(struct block_device *bdev)
> >  	iput(BD_INODE(bdev));
> >  }
> >  
> > +static int bdev_pages_count(struct inode *inode, void *data)
> 
> This are guaranteed to operate on the bdev superblock, so just
> hardcoding the s_inodes list seems fine here as it keeps the code
> much simpler.

Maybe, but right now I want to remove all external accesses to
sb->s_inodes. This isn't performance critical, and we can revisit
how the bdev sb tracks inodes when we move to per-sb inode
caches....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-02  1:33 [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Dave Chinner
                   ` (7 preceding siblings ...)
  2024-10-02 10:00 ` [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Christian Brauner
@ 2024-10-03 11:45 ` Jan Kara
  2024-10-03 12:18   ` Christoph Hellwig
  2024-10-03 13:03   ` Dave Chinner
  8 siblings, 2 replies; 72+ messages in thread
From: Jan Kara @ 2024-10-03 11:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

Hi Dave!

On Wed 02-10-24 11:33:17, Dave Chinner wrote:
> There are two superblock iterator functions provided. The first is a
> generic iterator that provides safe, reference counted inodes for
> the callback to operate on. This is generally what most sb->s_inodes
> iterators use, and it allows the iterator to drop locks and perform
> blocking operations on the inode before moving to the next inode in
> the sb->s_inodes list.
> 
> There is one quirk to this interface - INO_ITER_REFERENCE - because
> fsnotify iterates the inode cache -after- evict_inodes() has been
> called during superblock shutdown to evict all non-referenced
> inodes. Hence it should only find referenced inodes, and it has
> a check to skip unreferenced inodes. This flag does the same.

Overall I really like the series. A lot of duplicated code removed and
scalability improved, we don't get such deals frequently :) Regarding
INO_ITER_REFERENCE I think that after commit 1edc8eb2e9313 ("fs: call
fsnotify_sb_delete after evict_inodes") the check for 0 i_count in
fsnotify_unmount_inodes() isn't that useful anymore so I'd be actually fine
dropping it (as a separate patch please).

That being said I'd like to discuss one thing: As you have surely noticed,
some of the places iterating inodes perform additional checks on the inode
to determine whether the inode is interesting or not (e.g. the Landlock
iterator or iterators in quota code) to avoid the unnecessary iget / iput
and locking dance. The inode refcount check you've worked-around with
INO_ITER_REFERENCE is a special case of that. Have you considered option to
provide callback for the check inside the iterator?

Also maybe I'm went a *bit* overboard here with macro magic but the code
below should provide an iterator that you can use like:

	for_each_sb_inode(sb, inode, inode_eligible_check(inode)) {
		do my stuff here
	}

that will avoid any indirect calls and will magically handle all the
cleanup that needs to be done if you break / jump out of the loop or
similar. I actually find such constructs more convenient to use than your
version of the iterator because there's no need to create & pass around the
additional data structure for the iterator body, no need for special return
values to abort iteration etc.

								Honza

/* Find next inode on the inode list eligible for processing */
#define sb_inode_iter_next(sb, inode, old_inode, inode_eligible) 	\
({									\
	struct inode *ret = NULL;					\
									\
	cond_resched();							\
	spin_lock(&(sb)->s_inode_list_lock);				\
	if (!(inode))							\
		inode = list_first_entry((sb)->s_inodes, struct inode,	\
					 i_sb_list);			\
	while (1) {							\
		if (list_entry_is_head(inode, (sb)->s_inodes, i_sb_list)) { \
			spin_unlock(&(sb)->s_inode_list_lock);		\
			break;						\
		}							\
		spin_lock(&inode->i_lock);				\
		if ((inode)->i_state & (I_NEW | I_FREEING | I_WILL_FREE) || \
		    !inode_eligible) {					\
			spin_unlock(&(inode)->i_lock);			\
			continue;					\
		}							\
		__iget(inode);						\
		spin_unlock(&(inode)->i_lock);				\
		spin_unlock(&(sb)->s_inode_list_lock);			\
		iput(*old_inode);					\
		*old_inode = inode;					\
		ret = inode;						\
		break;							\
	}								\
	ret;								\
})

#define for_each_sb_inode(sb, inode, inode_eligible)			\
	for (DEFINE_FREE(old_inode, struct inode *, if (_T) iput(_T)),	\
	     inode = NULL;						\
	     inode = sb_inode_iter_next((sb), inode, &old_inode,	\
					 inode_eligible);		\
	    )

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03  7:38     ` Christoph Hellwig
@ 2024-10-03 11:57       ` Jan Kara
  2024-10-03 12:11         ` Christoph Hellwig
  2024-10-07 20:37         ` Linus Torvalds
  0 siblings, 2 replies; 72+ messages in thread
From: Jan Kara @ 2024-10-03 11:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds, Mickaël Salaün, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Jan Kara,
	Amir Goldstein

On Thu 03-10-24 00:38:05, Christoph Hellwig wrote:
> On Thu, Oct 03, 2024 at 12:23:41AM -0700, Christoph Hellwig wrote:
> > On Wed, Oct 02, 2024 at 11:33:21AM +1000, Dave Chinner wrote:
> > > --- a/security/landlock/fs.c
> > > +++ b/security/landlock/fs.c
> > > @@ -1223,109 +1223,60 @@ static void hook_inode_free_security_rcu(void *inode_security)
> > >  
> > >  /*
> > >   * Release the inodes used in a security policy.
> > > - *
> > > - * Cf. fsnotify_unmount_inodes() and invalidate_inodes()
> > >   */
> > > +static int release_inode_fn(struct inode *inode, void *data)
> > 
> > Looks like this is called from the sb_delete LSM hook, which
> > is only implemented by landlock, and only called from
> > generic_shutdown_super, separated from evict_inodes only by call
> > to fsnotify_sb_delete.  Why did LSM not hook into that and instead
> 
> An the main thing that fsnotify_sb_delete does is yet another inode
> iteration..
> 
> Ay chance you all could get together an figure out how to get down
> to a single sb inode iteration per unmount?

Fair enough. If we go with the iterator variant I've suggested to Dave in
[1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
Landlocks hook_sb_delete() into a single iteration relatively easily. But
I'd wait with that convertion until this series lands.

								Honza

[1] https://lore.kernel.org/all/20241003114555.bl34fkqsja4s5tok@quack3
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 11:57       ` Jan Kara
@ 2024-10-03 12:11         ` Christoph Hellwig
  2024-10-03 12:26           ` Jan Kara
  2024-10-07 20:37         ` Linus Torvalds
  1 sibling, 1 reply; 72+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Thu, Oct 03, 2024 at 01:57:21PM +0200, Jan Kara wrote:
> Fair enough. If we go with the iterator variant I've suggested to Dave in
> [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> Landlocks hook_sb_delete() into a single iteration relatively easily. But
> I'd wait with that convertion until this series lands.

I don't see how that has anything to do with iterators or not.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-03 11:45 ` Jan Kara
@ 2024-10-03 12:18   ` Christoph Hellwig
  2024-10-03 12:46     ` Jan Kara
  2024-10-03 13:03   ` Dave Chinner
  1 sibling, 1 reply; 72+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds

On Thu, Oct 03, 2024 at 01:45:55PM +0200, Jan Kara wrote:
> /* Find next inode on the inode list eligible for processing */
> #define sb_inode_iter_next(sb, inode, old_inode, inode_eligible) 	\
> ({									\
> 	struct inode *ret = NULL;					\

<snip>

> 	ret;								\
> })

How is this going to interact with calling into the file system
to do the interaction, which is kinda the point of this series?

Sure, you could have a get_next_inode-style method, but it would need
a fair amount of entirely file system specific state that needs
to be stashed away somewhere, and all options for that looks pretty
ugly.

Also even with your pre-iget callback we'd at best halve the number
of indirect calls for something that isn't exactly performance
critical.  So while it could be done that way, it feels like a
more complex and much harder to reason about version for no real
benefit.

> #define for_each_sb_inode(sb, inode, inode_eligible)			\
> 	for (DEFINE_FREE(old_inode, struct inode *, if (_T) iput(_T)),	\
> 	     inode = NULL;						\
> 	     inode = sb_inode_iter_next((sb), inode, &old_inode,	\
> 					 inode_eligible);		\
> 	    )

And while I liked:

	obj = NULL;

	while ((obj = get_next_object(foo, obj))) {
	}

style iterators, magic for_each macros that do magic cleanup are just
a nightmare to read.  Keep it simple and optimize for someone actually
having to read and understand the code, and not for saving a few lines
of code.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 12:11         ` Christoph Hellwig
@ 2024-10-03 12:26           ` Jan Kara
  2024-10-03 12:39             ` Christoph Hellwig
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Kara @ 2024-10-03 12:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Dave Chinner, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds, Mickaël Salaün, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein

On Thu 03-10-24 05:11:11, Christoph Hellwig wrote:
> On Thu, Oct 03, 2024 at 01:57:21PM +0200, Jan Kara wrote:
> > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > I'd wait with that convertion until this series lands.
> 
> I don't see how that has anything to do with iterators or not.

Well, the patches would obviously conflict which seems pointless if we
could live with three iterations for a few years until somebody noticed :).
And with current Dave's version of iterators it will not be possible to
integrate evict_inodes() iteration with the other two without a layering
violation. Still we could go from 3 to 2 iterations.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 12:26           ` Jan Kara
@ 2024-10-03 12:39             ` Christoph Hellwig
  2024-10-03 12:56               ` Jan Kara
  0 siblings, 1 reply; 72+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Thu, Oct 03, 2024 at 02:26:57PM +0200, Jan Kara wrote:
> On Thu 03-10-24 05:11:11, Christoph Hellwig wrote:
> > On Thu, Oct 03, 2024 at 01:57:21PM +0200, Jan Kara wrote:
> > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > I'd wait with that convertion until this series lands.
> > 
> > I don't see how that has anything to do with iterators or not.
> 
> Well, the patches would obviously conflict

Conflict with what?

> which seems pointless if we
> could live with three iterations for a few years until somebody noticed :).
> And with current Dave's version of iterators it will not be possible to
> integrate evict_inodes() iteration with the other two without a layering
> violation. Still we could go from 3 to 2 iterations.

What layering violation?

Below is quick compile tested part to do the fsnotify side and
get rid of the fsnotify iteration, which looks easily worth it.

landlock is a bit more complex because of lsm hooks, and the internal
underobj abstraction inside of landlock.  But looking at the release
inode data vs unomunt synchronization it has and the duplicate code I
think doing it this way is worth the effort even more, it'll just need 
someone who knows landlock and the lsm layering to help with the work.

diff --git a/fs/inode.c b/fs/inode.c
index 3f335f78c5b228..7d5f8a09e4d29d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -789,11 +789,23 @@ static bool dispose_list(struct list_head *head)
  */
 static int evict_inode_fn(struct inode *inode, void *data)
 {
+	struct super_block *sb = inode->i_sb;
 	struct list_head *dispose = data;
+	bool post_unmount = !(sb->s_flags & SB_ACTIVE);
 
 	spin_lock(&inode->i_lock);
-	if (atomic_read(&inode->i_count) ||
-	    (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) {
+	if (atomic_read(&inode->i_count)) {
+		spin_unlock(&inode->i_lock);
+
+		/* for each watch, send FS_UNMOUNT and then remove it */
+		if (post_unmount && fsnotify_sb_info(sb)) {
+			fsnotify_inode(inode, FS_UNMOUNT);
+			fsnotify_inode_delete(inode);
+		}
+		return INO_ITER_DONE;
+	}
+
+	if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 		spin_unlock(&inode->i_lock);
 		return INO_ITER_DONE;
 	}
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 68c34ed9427190..cf89aa69e82c8d 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -28,16 +28,6 @@ void __fsnotify_vfsmount_delete(struct vfsmount *mnt)
 	fsnotify_clear_marks_by_mount(mnt);
 }
 
-static int fsnotify_unmount_inode_fn(struct inode *inode, void *data)
-{
-	spin_unlock(&inode->i_lock);
-
-	/* for each watch, send FS_UNMOUNT and then remove it */
-	fsnotify_inode(inode, FS_UNMOUNT);
-	fsnotify_inode_delete(inode);
-	return INO_ITER_DONE;
-}
-
 void fsnotify_sb_delete(struct super_block *sb)
 {
 	struct fsnotify_sb_info *sbinfo = fsnotify_sb_info(sb);
@@ -46,19 +36,6 @@ void fsnotify_sb_delete(struct super_block *sb)
 	if (!sbinfo)
 		return;
 
-	/*
-	 * If i_count is zero, the inode cannot have any watches and
-	 * doing an __iget/iput with SB_ACTIVE clear would actually
-	 * evict all inodes with zero i_count from icache which is
-	 * unnecessarily violent and may in fact be illegal to do.
-	 * However, we should have been called /after/ evict_inodes
-	 * removed all zero refcount inodes, in any case. Hence we use
-	 * INO_ITER_REFERENCED to ensure zero refcount inodes are filtered
-	 * properly.
-	 */
-	super_iter_inodes(sb, fsnotify_unmount_inode_fn, NULL,
-			INO_ITER_REFERENCED);
-
 	fsnotify_clear_marks_by_sb(sb);
 	/* Wait for outstanding object references from connectors */
 	wait_var_event(fsnotify_sb_watched_objects(sb),
diff --git a/fs/super.c b/fs/super.c
index 971ad4e996e0ba..88dd1703fe73db 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -167,28 +167,17 @@ static void super_wake(struct super_block *sb, unsigned int flag)
 	wake_up_var(&sb->s_flags);
 }
 
-bool super_iter_iget(struct inode *inode, int flags)
+bool super_iter_iget(struct inode *inode)
 {
-	bool	ret = false;
+	bool ret = false;
 
 	spin_lock(&inode->i_lock);
-	if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))
-		goto out_unlock;
-
-	/*
-	 * Skip over zero refcount inode if the caller only wants
-	 * referenced inodes to be iterated.
-	 */
-	if ((flags & INO_ITER_REFERENCED) &&
-	    !atomic_read(&inode->i_count))
-		goto out_unlock;
-
-	__iget(inode);
-	ret = true;
-out_unlock:
+	if (!(inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) {
+		__iget(inode);
+		ret = true;
+	}
 	spin_unlock(&inode->i_lock);
 	return ret;
-
 }
 EXPORT_SYMBOL_GPL(super_iter_iget);
 
@@ -216,7 +205,7 @@ int super_iter_inodes(struct super_block *sb, ino_iter_fn iter_fn,
 
 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (!super_iter_iget(inode, flags))
+		if (!super_iter_iget(inode))
 			continue;
 		spin_unlock(&sb->s_inode_list_lock);
 		iput(old_inode);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index ee544556cee728..5a174e690424fb 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1654,8 +1654,7 @@ xfs_iter_vfs_igrab(
 	if (ip->i_flags & XFS_ITER_VFS_NOGRAB_IFLAGS)
 		goto out_unlock_noent;
 
-	if ((flags & INO_ITER_UNSAFE) ||
-	    super_iter_iget(inode, flags))
+	if ((flags & INO_ITER_UNSAFE) || super_iter_iget(inode))
 		ret = true;
 
 out_unlock_noent:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2aa335228b84bf..a3c682f0d94c1b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2224,7 +2224,7 @@ enum freeze_holder {
 typedef int (*ino_iter_fn)(struct inode *inode, void *priv);
 int super_iter_inodes(struct super_block *sb, ino_iter_fn iter_fn,
 		void *private_data, int flags);
-bool super_iter_iget(struct inode *inode, int flags);
+bool super_iter_iget(struct inode *inode);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-03 12:18   ` Christoph Hellwig
@ 2024-10-03 12:46     ` Jan Kara
  2024-10-03 13:35       ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Kara @ 2024-10-03 12:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Dave Chinner, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds

On Thu 03-10-24 05:18:30, Christoph Hellwig wrote:
> On Thu, Oct 03, 2024 at 01:45:55PM +0200, Jan Kara wrote:
> > /* Find next inode on the inode list eligible for processing */
> > #define sb_inode_iter_next(sb, inode, old_inode, inode_eligible) 	\
> > ({									\
> > 	struct inode *ret = NULL;					\
> 
> <snip>
> 
> > 	ret;								\
> > })
> 
> How is this going to interact with calling into the file system
> to do the interaction, which is kinda the point of this series?

Yeah, I was concentrated on the VFS bits and forgot why Dave wrote this
series in the first place. So this style of iterator isn't useful for what
Dave wants to achieve. Sorry for the noise. Still the possibility to have a
callback under inode->i_lock being able to do stuff and decide whether we
should grab a reference or continue would be useful (and would allow us to
combine the three iterations on unmount into one without too much hassle).

> > #define for_each_sb_inode(sb, inode, inode_eligible)			\
> > 	for (DEFINE_FREE(old_inode, struct inode *, if (_T) iput(_T)),	\
> > 	     inode = NULL;						\
> > 	     inode = sb_inode_iter_next((sb), inode, &old_inode,	\
> > 					 inode_eligible);		\
> > 	    )
> 
> And while I liked:
> 
> 	obj = NULL;
> 
> 	while ((obj = get_next_object(foo, obj))) {
> 	}
> 
> style iterators, magic for_each macros that do magic cleanup are just
> a nightmare to read.  Keep it simple and optimize for someone actually
> having to read and understand the code, and not for saving a few lines
> of code.

Well, I agree the above is hard to read but I don't know how to write it in
a more readable way while keeping the properties of the iterator (like
auto-cleanup when you break out of the loop - which is IMO a must for a
sane iterator). Anyway, this is now mostly academic since I agree this
iterator isn't really useful for the situation here.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 12:39             ` Christoph Hellwig
@ 2024-10-03 12:56               ` Jan Kara
  2024-10-03 13:04                 ` Christoph Hellwig
  2024-10-03 13:59                 ` Dave Chinner
  0 siblings, 2 replies; 72+ messages in thread
From: Jan Kara @ 2024-10-03 12:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Dave Chinner, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds, Mickaël Salaün, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein

On Thu 03-10-24 05:39:23, Christoph Hellwig wrote:
> On Thu, Oct 03, 2024 at 02:26:57PM +0200, Jan Kara wrote:
> > On Thu 03-10-24 05:11:11, Christoph Hellwig wrote:
> > > On Thu, Oct 03, 2024 at 01:57:21PM +0200, Jan Kara wrote:
> > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > I'd wait with that convertion until this series lands.
> > > 
> > > I don't see how that has anything to do with iterators or not.
> > 
> > Well, the patches would obviously conflict
> 
> Conflict with what?

I thought you wanted the interations to be unified in current state of
code. If you meant after Dave's series, then we are in agreement.

> > which seems pointless if we
> > could live with three iterations for a few years until somebody noticed :).
> > And with current Dave's version of iterators it will not be possible to
> > integrate evict_inodes() iteration with the other two without a layering
> > violation. Still we could go from 3 to 2 iterations.
> 
> What layering violation?
> 
> Below is quick compile tested part to do the fsnotify side and
> get rid of the fsnotify iteration, which looks easily worth it.

...

> @@ -789,11 +789,23 @@ static bool dispose_list(struct list_head *head)
>   */
>  static int evict_inode_fn(struct inode *inode, void *data)
>  {
> +	struct super_block *sb = inode->i_sb;
>  	struct list_head *dispose = data;
> +	bool post_unmount = !(sb->s_flags & SB_ACTIVE);
>  
>  	spin_lock(&inode->i_lock);
> -	if (atomic_read(&inode->i_count) ||
> -	    (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) {
> +	if (atomic_read(&inode->i_count)) {
> +		spin_unlock(&inode->i_lock);
> +
> +		/* for each watch, send FS_UNMOUNT and then remove it */
> +		if (post_unmount && fsnotify_sb_info(sb)) {
> +			fsnotify_inode(inode, FS_UNMOUNT);
> +			fsnotify_inode_delete(inode);
> +		}

This will not work because you are in unsafe iterator holding
sb->s_inode_list_lock. To be able to call into fsnotify, you need to do the
iget / iput dance and releasing of s_inode_list_lock which does not work
when a filesystem has its own inodes iterator AFAICT... That's why I've
called it a layering violation.

									Honza

> +		return INO_ITER_DONE;
> +	}
> +
> +	if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
>  		spin_unlock(&inode->i_lock);
>  		return INO_ITER_DONE;
>  	}
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-03 11:45 ` Jan Kara
  2024-10-03 12:18   ` Christoph Hellwig
@ 2024-10-03 13:03   ` Dave Chinner
  1 sibling, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-03 13:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

On Thu, Oct 03, 2024 at 01:45:55PM +0200, Jan Kara wrote:
> Hi Dave!
> 
> On Wed 02-10-24 11:33:17, Dave Chinner wrote:
> > There are two superblock iterator functions provided. The first is a
> > generic iterator that provides safe, reference counted inodes for
> > the callback to operate on. This is generally what most sb->s_inodes
> > iterators use, and it allows the iterator to drop locks and perform
> > blocking operations on the inode before moving to the next inode in
> > the sb->s_inodes list.
> > 
> > There is one quirk to this interface - INO_ITER_REFERENCE - because
> > fsnotify iterates the inode cache -after- evict_inodes() has been
> > called during superblock shutdown to evict all non-referenced
> > inodes. Hence it should only find referenced inodes, and it has
> > a check to skip unreferenced inodes. This flag does the same.
> 
> Overall I really like the series. A lot of duplicated code removed and
> scalability improved, we don't get such deals frequently :) Regarding
> INO_ITER_REFERENCE I think that after commit 1edc8eb2e9313 ("fs: call
> fsnotify_sb_delete after evict_inodes") the check for 0 i_count in
> fsnotify_unmount_inodes() isn't that useful anymore so I'd be actually fine
> dropping it (as a separate patch please).
> 
> That being said I'd like to discuss one thing: As you have surely noticed,
> some of the places iterating inodes perform additional checks on the inode
> to determine whether the inode is interesting or not (e.g. the Landlock
> iterator or iterators in quota code) to avoid the unnecessary iget / iput
> and locking dance.

Yes, but we really don't care. None of these cases are performance
critical, and I'd much prefer that we have a consistent behaviour.

> The inode refcount check you've worked-around with
> INO_ITER_REFERENCE is a special case of that. Have you considered option to
> provide callback for the check inside the iterator?

I did. I decided that it wasn't necessary just to avoid the
occasional iget/iput. It's certainly not necessary for the
fsnotify/landlock cases where INO_ITER_REFERENCE was used because
at that point there are only landlock and fsnotify inodes left in
the cache. We're going to be doing iget/iput on all of them
anyway.

Really, subsystems should be tracking inodes they have references to
themselves, not running 'needle in haystack' searches for inodes
they hold references to. That would get rid of both the fsnotify and
landlock iterators completely...

> Also maybe I'm went a *bit* overboard here with macro magic but the code
> below should provide an iterator that you can use like:
> 
> 	for_each_sb_inode(sb, inode, inode_eligible_check(inode)) {
> 		do my stuff here
> 	}

As I explained to Kent: wrapping the existing code in a different
iterator defeats the entire purpose of the change to the iteration
code.

> that will avoid any indirect calls and will magically handle all the
> cleanup that needs to be done if you break / jump out of the loop or
> similar. I actually find such constructs more convenient to use than your
> version of the iterator because there's no need to create & pass around the
> additional data structure for the iterator body, no need for special return
> values to abort iteration etc.

I'm not introducing the callback-based iterator function to clean
the code up - I'm introducing it as infrastructure that allows the
*iteration mechanism to be completely replaced* by filesystems that
have more efficient, more scalable  inode iterators already built
in.

This change of iterator model also allows seamless transition of
indivudal filesystems to new iterator mechanisms. Macro based
iterators do not allow for different iterator implementations to
co-exist, but that's exactly what I'm trying to acheive here.
I'm not trying to clean the code up - I'm trying to lay the
ground-work for new functionality....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 12:56               ` Jan Kara
@ 2024-10-03 13:04                 ` Christoph Hellwig
  2024-10-03 13:59                 ` Dave Chinner
  1 sibling, 0 replies; 72+ messages in thread
From: Christoph Hellwig @ 2024-10-03 13:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Thu, Oct 03, 2024 at 02:56:50PM +0200, Jan Kara wrote:
> > +	if (atomic_read(&inode->i_count)) {
> > +		spin_unlock(&inode->i_lock);
> > +
> > +		/* for each watch, send FS_UNMOUNT and then remove it */
> > +		if (post_unmount && fsnotify_sb_info(sb)) {
> > +			fsnotify_inode(inode, FS_UNMOUNT);
> > +			fsnotify_inode_delete(inode);
> > +		}
> 
> This will not work because you are in unsafe iterator holding
> sb->s_inode_list_lock. To be able to call into fsnotify, you need to do the
> iget / iput dance and releasing of s_inode_list_lock which does not work
> when a filesystem has its own inodes iterator AFAICT... That's why I've
> called it a layering violation.

Ah, yes.  So we'll need to special case it some way either way.  Still
feels saner to do it in one iteration and make the inode eviction not
use the unsafe version, but maybe that's indeed better postponed until
after Dave's series.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
  2024-10-03 12:46     ` Jan Kara
@ 2024-10-03 13:35       ` Dave Chinner
  0 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-03 13:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds

On Thu, Oct 03, 2024 at 02:46:19PM +0200, Jan Kara wrote:
> On Thu 03-10-24 05:18:30, Christoph Hellwig wrote:
> > On Thu, Oct 03, 2024 at 01:45:55PM +0200, Jan Kara wrote:
> > > /* Find next inode on the inode list eligible for processing */
> > > #define sb_inode_iter_next(sb, inode, old_inode, inode_eligible) 	\
> > > ({									\
> > > 	struct inode *ret = NULL;					\
> > 
> > <snip>
> > 
> > > 	ret;								\
> > > })
> > 
> > How is this going to interact with calling into the file system
> > to do the interaction, which is kinda the point of this series?
> 
> Yeah, I was concentrated on the VFS bits and forgot why Dave wrote this
> series in the first place. So this style of iterator isn't useful for what
> Dave wants to achieve. Sorry for the noise. Still the possibility to have a
> callback under inode->i_lock being able to do stuff and decide whether we

I did that first, and turned into an utter mess the moment we step
outside the existing iterator mechanism.

I implemented a separate XFS icwalk function because having to hold
the inode->i_lock between inode lookup and the callback function
means we cannot do batched inode lookups from the radix tree.

The existing icwalk code grabs 32 inodes at a time from the radix
tree and validates them all, then runs the callback on them one at a
time, then it drops them all.

If the VFS inode callback requires the inode i_lock to be held and
be atomic with the initial state checks, then we have to nest 32
spinlocks in what is effectively a random lock order.

So I implemented a non-batched icwalk method, and it didn't get that
much cleaner. It wasn't until I dropped the inode->i_lock from the
callback API that everything cleaned up and the offload mechanism
started to make sense. And it was this change that also makes it
possible for XFs to use it's existing batched lookup mechanisms
instead of the special case implementation that I wrote for this
patch set.

IOWs, we can't hold the inode->i_lock across lookup validation to
callback if we want to provide freedom of implementation to the
filesystem specific code. We aren't really concerned about
performance of traversals, so I went with freedom of implementation
over clunky locking semantics to optimise away a couple of atomic
ops per inode for iget/iput calls.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 12:56               ` Jan Kara
  2024-10-03 13:04                 ` Christoph Hellwig
@ 2024-10-03 13:59                 ` Dave Chinner
  2024-10-03 16:17                   ` Jan Kara
  1 sibling, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2024-10-03 13:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds, Mickaël Salaün, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein

On Thu, Oct 03, 2024 at 02:56:50PM +0200, Jan Kara wrote:
> On Thu 03-10-24 05:39:23, Christoph Hellwig wrote:
> > @@ -789,11 +789,23 @@ static bool dispose_list(struct list_head *head)
> >   */
> >  static int evict_inode_fn(struct inode *inode, void *data)
> >  {
> > +	struct super_block *sb = inode->i_sb;
> >  	struct list_head *dispose = data;
> > +	bool post_unmount = !(sb->s_flags & SB_ACTIVE);
> >  
> >  	spin_lock(&inode->i_lock);
> > -	if (atomic_read(&inode->i_count) ||
> > -	    (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) {
> > +	if (atomic_read(&inode->i_count)) {
> > +		spin_unlock(&inode->i_lock);
> > +
> > +		/* for each watch, send FS_UNMOUNT and then remove it */
> > +		if (post_unmount && fsnotify_sb_info(sb)) {
> > +			fsnotify_inode(inode, FS_UNMOUNT);
> > +			fsnotify_inode_delete(inode);
> > +		}
> 
> This will not work because you are in unsafe iterator holding
> sb->s_inode_list_lock. To be able to call into fsnotify, you need to do the
> iget / iput dance and releasing of s_inode_list_lock which does not work
> when a filesystem has its own inodes iterator AFAICT... That's why I've
> called it a layering violation.

The whole point of the iget/iput dance is to stabilise the
s_inodes list iteration whilst it is unlocked - the actual fsnotify
calls don't need an inode reference to work correctly.

IOWs, we don't need to run the fsnotify stuff right here - we can
defer that like we do with the dispose list for all the inodes we
mark as I_FREEING here.

So if we pass a structure:

struct evict_inode_args {
	struct list_head	dispose;
	struct list_head	fsnotify;
};

If we use __iget() instead of requiring an inode state flag to keep
the inode off the LRU for the fsnotify cleanup, then the code
fragment above becomes:

	if (atomic_read(&inode->i_count)) {
		if (post_unmount && fsnotify_sb_info(sb)) {
			__iget(inode);
			inode_lru_list_del(inode);
			spin_unlock(&inode->i_lock);
			list_add(&inode->i_lru, &args->fsnotify);
		}
		return INO_ITER_DONE;
	}

And then once we return to evict_inodes(), we do this:

	while (!list_empty(args->fsnotify)) {
		struct inode *inode

		inode = list_first_entry(head, struct inode, i_lru);
                list_del_init(&inode->i_lru);

		fsnotify_inode(inode, FS_UNMOUNT);
		fsnotify_inode_delete(inode);
		iput(inode);
		cond_resched();
	}

And so now all the fsnotify cleanup is done outside the traversal in
one large batch from evict_inodes().

As for the landlock code, I think it needs to have it's own internal
tracking mechanism and not search the sb inode list for inodes that
it holds references to. LSM cleanup should be run before before we
get to tearing down the inode cache, not after....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 13:59                 ` Dave Chinner
@ 2024-10-03 16:17                   ` Jan Kara
  2024-10-04  0:46                     ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Kara @ 2024-10-03 16:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Thu 03-10-24 23:59:51, Dave Chinner wrote:
> On Thu, Oct 03, 2024 at 02:56:50PM +0200, Jan Kara wrote:
> > On Thu 03-10-24 05:39:23, Christoph Hellwig wrote:
> > > @@ -789,11 +789,23 @@ static bool dispose_list(struct list_head *head)
> > >   */
> > >  static int evict_inode_fn(struct inode *inode, void *data)
> > >  {
> > > +	struct super_block *sb = inode->i_sb;
> > >  	struct list_head *dispose = data;
> > > +	bool post_unmount = !(sb->s_flags & SB_ACTIVE);
> > >  
> > >  	spin_lock(&inode->i_lock);
> > > -	if (atomic_read(&inode->i_count) ||
> > > -	    (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) {
> > > +	if (atomic_read(&inode->i_count)) {
> > > +		spin_unlock(&inode->i_lock);
> > > +
> > > +		/* for each watch, send FS_UNMOUNT and then remove it */
> > > +		if (post_unmount && fsnotify_sb_info(sb)) {
> > > +			fsnotify_inode(inode, FS_UNMOUNT);
> > > +			fsnotify_inode_delete(inode);
> > > +		}
> > 
> > This will not work because you are in unsafe iterator holding
> > sb->s_inode_list_lock. To be able to call into fsnotify, you need to do the
> > iget / iput dance and releasing of s_inode_list_lock which does not work
> > when a filesystem has its own inodes iterator AFAICT... That's why I've
> > called it a layering violation.
> 
> The whole point of the iget/iput dance is to stabilise the
> s_inodes list iteration whilst it is unlocked - the actual fsnotify
> calls don't need an inode reference to work correctly.
> 
> IOWs, we don't need to run the fsnotify stuff right here - we can
> defer that like we do with the dispose list for all the inodes we
> mark as I_FREEING here.
> 
> So if we pass a structure:
> 
> struct evict_inode_args {
> 	struct list_head	dispose;
> 	struct list_head	fsnotify;
> };
> 
> If we use __iget() instead of requiring an inode state flag to keep
> the inode off the LRU for the fsnotify cleanup, then the code
> fragment above becomes:
> 
> 	if (atomic_read(&inode->i_count)) {
> 		if (post_unmount && fsnotify_sb_info(sb)) {
> 			__iget(inode);
> 			inode_lru_list_del(inode);
> 			spin_unlock(&inode->i_lock);
> 			list_add(&inode->i_lru, &args->fsnotify);
> 		}

Nit: Need to release i_lock in else branch here.  Otherwise interesting
idea. Yes, something like this could work even in unsafe iterator.

> 		return INO_ITER_DONE;
> 	}
> And then once we return to evict_inodes(), we do this:
> 
> 	while (!list_empty(args->fsnotify)) {
> 		struct inode *inode
> 
> 		inode = list_first_entry(head, struct inode, i_lru);
>                 list_del_init(&inode->i_lru);
> 
> 		fsnotify_inode(inode, FS_UNMOUNT);
> 		fsnotify_inode_delete(inode);
> 		iput(inode);
> 		cond_resched();
> 	}
> 
> And so now all the fsnotify cleanup is done outside the traversal in
> one large batch from evict_inodes().

Yup.

> As for the landlock code, I think it needs to have it's own internal
> tracking mechanism and not search the sb inode list for inodes that
> it holds references to. LSM cleanup should be run before before we
> get to tearing down the inode cache, not after....

Well, I think LSM cleanup could in principle be handled together with the
fsnotify cleanup but I didn't check the details.


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 16:17                   ` Jan Kara
@ 2024-10-04  0:46                     ` Dave Chinner
  2024-10-04  7:21                       ` Christian Brauner
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2024-10-04  0:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-bcachefs,
	kent.overstreet, torvalds, Mickaël Salaün, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein

On Thu, Oct 03, 2024 at 06:17:31PM +0200, Jan Kara wrote:
> On Thu 03-10-24 23:59:51, Dave Chinner wrote:
> > As for the landlock code, I think it needs to have it's own internal
> > tracking mechanism and not search the sb inode list for inodes that
> > it holds references to. LSM cleanup should be run before before we
> > get to tearing down the inode cache, not after....
> 
> Well, I think LSM cleanup could in principle be handled together with the
> fsnotify cleanup but I didn't check the details.

I'm not sure how we tell if an inode potentially has a LSM related
reference hanging off it. The landlock code looks to make an
assumption in that the only referenced inodes it sees will have a
valid inode->i_security pointer if landlock is enabled. i.e. it
calls landlock_inode(inode) and dereferences the returned value
without ever checking if inode->i_security is NULL or not.

I mean, we could do a check for inode->i_security when the refcount
is elevated and replace the security_sb_delete hook with an
security_evict_inode hook similar to the proposed fsnotify eviction
from evict_inodes().

But screwing with LSM instructure looks ....  obnoxiously complex
from the outside...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-04  0:46                     ` Dave Chinner
@ 2024-10-04  7:21                       ` Christian Brauner
  2024-10-04 12:14                         ` Christoph Hellwig
  2024-10-04 22:57                         ` Dave Chinner
  0 siblings, 2 replies; 72+ messages in thread
From: Christian Brauner @ 2024-10-04  7:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Fri, Oct 04, 2024 at 10:46:27AM GMT, Dave Chinner wrote:
> On Thu, Oct 03, 2024 at 06:17:31PM +0200, Jan Kara wrote:
> > On Thu 03-10-24 23:59:51, Dave Chinner wrote:
> > > As for the landlock code, I think it needs to have it's own internal
> > > tracking mechanism and not search the sb inode list for inodes that
> > > it holds references to. LSM cleanup should be run before before we
> > > get to tearing down the inode cache, not after....
> > 
> > Well, I think LSM cleanup could in principle be handled together with the
> > fsnotify cleanup but I didn't check the details.
> 
> I'm not sure how we tell if an inode potentially has a LSM related
> reference hanging off it. The landlock code looks to make an
> assumption in that the only referenced inodes it sees will have a
> valid inode->i_security pointer if landlock is enabled. i.e. it
> calls landlock_inode(inode) and dereferences the returned value
> without ever checking if inode->i_security is NULL or not.
> 
> I mean, we could do a check for inode->i_security when the refcount
> is elevated and replace the security_sb_delete hook with an
> security_evict_inode hook similar to the proposed fsnotify eviction
> from evict_inodes().
> 
> But screwing with LSM instructure looks ....  obnoxiously complex
> from the outside...

Imho, please just focus on the immediate feedback and ignore all the
extra bells and whistles that we could or should do. I prefer all of
that to be done after this series lands.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/7] vfs: add inode iteration superblock method
  2024-10-02  1:33 ` [PATCH 2/7] vfs: add inode iteration superblock method Dave Chinner
  2024-10-03  7:12   ` Christoph Hellwig
@ 2024-10-04  9:53   ` kernel test robot
  1 sibling, 0 replies; 72+ messages in thread
From: kernel test robot @ 2024-10-04  9:53 UTC (permalink / raw)
  To: Dave Chinner, linux-fsdevel
  Cc: oe-kbuild-all, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

Hi Dave,

kernel test robot noticed the following build warnings:

[auto build test WARNING on brauner-vfs/vfs.all]
[also build test WARNING on xfs-linux/for-next axboe-block/for-next linus/master v6.12-rc1 next-20241004]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Chinner/vfs-replace-invalidate_inodes-with-evict_inodes/20241002-094254
base:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.all
patch link:    https://lore.kernel.org/r/20241002014017.3801899-3-david%40fromorbit.com
patch subject: [PATCH 2/7] vfs: add inode iteration superblock method
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20241004/202410041724.REiCiIEQ-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241004/202410041724.REiCiIEQ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410041724.REiCiIEQ-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> fs/super.c:183: warning: Function parameter or struct member 'private_data' not described in 'super_iter_inodes'
>> fs/super.c:183: warning: Function parameter or struct member 'flags' not described in 'super_iter_inodes'
>> fs/super.c:241: warning: bad line: 
>> fs/super.c:260: warning: Function parameter or struct member 'private_data' not described in 'super_iter_inodes_unsafe'


vim +183 fs/super.c

   169	
   170	/**
   171	 * super_iter_inodes - iterate all the cached inodes on a superblock
   172	 * @sb: superblock to iterate
   173	 * @iter_fn: callback to run on every inode found.
   174	 *
   175	 * This function iterates all cached inodes on a superblock that are not in
   176	 * the process of being initialised or torn down. It will run @iter_fn() with
   177	 * a valid, referenced inode, so it is safe for the caller to do anything
   178	 * it wants with the inode except drop the reference the iterator holds.
   179	 *
   180	 */
   181	int super_iter_inodes(struct super_block *sb, ino_iter_fn iter_fn,
   182			void *private_data, int flags)
 > 183	{
   184		struct inode *inode, *old_inode = NULL;
   185		int ret = 0;
   186	
   187		spin_lock(&sb->s_inode_list_lock);
   188		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
   189			spin_lock(&inode->i_lock);
   190			if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
   191				spin_unlock(&inode->i_lock);
   192				continue;
   193			}
   194	
   195			/*
   196			 * Skip over zero refcount inode if the caller only wants
   197			 * referenced inodes to be iterated.
   198			 */
   199			if ((flags & INO_ITER_REFERENCED) &&
   200			    !atomic_read(&inode->i_count)) {
   201				spin_unlock(&inode->i_lock);
   202				continue;
   203			}
   204	
   205			__iget(inode);
   206			spin_unlock(&inode->i_lock);
   207			spin_unlock(&sb->s_inode_list_lock);
   208			iput(old_inode);
   209	
   210			ret = iter_fn(inode, private_data);
   211	
   212			old_inode = inode;
   213			if (ret == INO_ITER_ABORT) {
   214				ret = 0;
   215				break;
   216			}
   217			if (ret < 0)
   218				break;
   219	
   220			cond_resched();
   221			spin_lock(&sb->s_inode_list_lock);
   222		}
   223		spin_unlock(&sb->s_inode_list_lock);
   224		iput(old_inode);
   225		return ret;
   226	}
   227	
   228	/**
   229	 * super_iter_inodes_unsafe - unsafely iterate all the inodes on a superblock
   230	 * @sb: superblock to iterate
   231	 * @iter_fn: callback to run on every inode found.
   232	 *
   233	 * This is almost certainly not the function you want. It is for internal VFS
   234	 * operations only. Please use super_iter_inodes() instead. If you must use
   235	 * this function, please add a comment explaining why it is necessary and the
   236	 * locking that makes it safe to use this function.
   237	 *
   238	 * This function iterates all cached inodes on a superblock that are attached to
   239	 * the superblock. It will pass each inode to @iter_fn unlocked and without
   240	 * having performed any existences checks on it.
 > 241	
   242	 * @iter_fn must perform all necessary state checks on the inode itself to
   243	 * ensure safe operation. super_iter_inodes_unsafe() only guarantees that the
   244	 * inode exists and won't be freed whilst the callback is running.
   245	 *
   246	 * @iter_fn must not block. It is run in an atomic context that is not allowed
   247	 * to sleep to provide the inode existence guarantees. If the callback needs to
   248	 * do blocking operations it needs to track the inode itself and defer those
   249	 * operations until after the iteration completes.
   250	 *
   251	 * @iter_fn must provide conditional reschedule checks itself. If rescheduling
   252	 * or deferred processing is needed, it must return INO_ITER_ABORT to return to
   253	 * the high level function to perform those operations. It can then restart the
   254	 * iteration again. The high level code must provide forwards progress
   255	 * guarantees if they are necessary.
   256	 *
   257	 */
   258	void super_iter_inodes_unsafe(struct super_block *sb, ino_iter_fn iter_fn,
   259			void *private_data)
 > 260	{
   261		struct inode *inode;
   262		int ret;
   263	
   264		rcu_read_lock();
   265		spin_lock(&sb->s_inode_list_lock);
   266		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
   267			ret = iter_fn(inode, private_data);
   268			if (ret == INO_ITER_ABORT)
   269				break;
   270		}
   271		spin_unlock(&sb->s_inode_list_lock);
   272		rcu_read_unlock();
   273	}
   274	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 3/7] vfs: convert vfs inode iterators to super_iter_inodes_unsafe()
  2024-10-02  1:33 ` [PATCH 3/7] vfs: convert vfs inode iterators to super_iter_inodes_unsafe() Dave Chinner
  2024-10-03  7:14   ` Christoph Hellwig
@ 2024-10-04 10:55   ` kernel test robot
  1 sibling, 0 replies; 72+ messages in thread
From: kernel test robot @ 2024-10-04 10:55 UTC (permalink / raw)
  To: Dave Chinner, linux-fsdevel
  Cc: oe-kbuild-all, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds

Hi Dave,

kernel test robot noticed the following build warnings:

[auto build test WARNING on brauner-vfs/vfs.all]
[also build test WARNING on xfs-linux/for-next axboe-block/for-next linus/master v6.12-rc1 next-20241004]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Chinner/vfs-replace-invalidate_inodes-with-evict_inodes/20241002-094254
base:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.all
patch link:    https://lore.kernel.org/r/20241002014017.3801899-4-david%40fromorbit.com
patch subject: [PATCH 3/7] vfs: convert vfs inode iterators to super_iter_inodes_unsafe()
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20241004/202410041848.j3wt7yFP-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241004/202410041848.j3wt7yFP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410041848.j3wt7yFP-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> fs/inode.c:874: warning: Function parameter or struct member 'inode' not described in 'evict_inode_fn'
>> fs/inode.c:874: warning: Function parameter or struct member 'data' not described in 'evict_inode_fn'
>> fs/inode.c:874: warning: expecting prototype for evict_inodes(). Prototype was for evict_inode_fn() instead


vim +874 fs/inode.c

^1da177e4c3f41 Linus Torvalds 2005-04-16  863  
63997e98a3be68 Al Viro        2010-10-25  864  /**
63997e98a3be68 Al Viro        2010-10-25  865   * evict_inodes	- evict all evictable inodes for a superblock
63997e98a3be68 Al Viro        2010-10-25  866   * @sb:		superblock to operate on
63997e98a3be68 Al Viro        2010-10-25  867   *
63997e98a3be68 Al Viro        2010-10-25  868   * Make sure that no inodes with zero refcount are retained.  This is
1751e8a6cb935e Linus Torvalds 2017-11-27  869   * called by superblock shutdown after having SB_ACTIVE flag removed,
63997e98a3be68 Al Viro        2010-10-25  870   * so any inode reaching zero refcount during or after that call will
63997e98a3be68 Al Viro        2010-10-25  871   * be immediately evicted.
^1da177e4c3f41 Linus Torvalds 2005-04-16  872   */
f3df82b20474b6 Dave Chinner   2024-10-02  873  static int evict_inode_fn(struct inode *inode, void *data)
^1da177e4c3f41 Linus Torvalds 2005-04-16 @874  {
f3df82b20474b6 Dave Chinner   2024-10-02  875  	struct list_head *dispose = data;
250df6ed274d76 Dave Chinner   2011-03-22  876  
250df6ed274d76 Dave Chinner   2011-03-22  877  	spin_lock(&inode->i_lock);
f3df82b20474b6 Dave Chinner   2024-10-02  878  	if (atomic_read(&inode->i_count) ||
f3df82b20474b6 Dave Chinner   2024-10-02  879  	    (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) {
250df6ed274d76 Dave Chinner   2011-03-22  880  		spin_unlock(&inode->i_lock);
f3df82b20474b6 Dave Chinner   2024-10-02  881  		return INO_ITER_DONE;
250df6ed274d76 Dave Chinner   2011-03-22  882  	}
63997e98a3be68 Al Viro        2010-10-25  883  
63997e98a3be68 Al Viro        2010-10-25  884  	inode->i_state |= I_FREEING;
02afc410f363f9 Dave Chinner   2011-03-22  885  	inode_lru_list_del(inode);
250df6ed274d76 Dave Chinner   2011-03-22  886  	spin_unlock(&inode->i_lock);
f3df82b20474b6 Dave Chinner   2024-10-02  887  	list_add(&inode->i_lru, dispose);
ac05fbb4006241 Josef Bacik    2015-03-04  888  
ac05fbb4006241 Josef Bacik    2015-03-04  889  	/*
f3df82b20474b6 Dave Chinner   2024-10-02  890  	 * If we've run long enough to need rescheduling, abort the
f3df82b20474b6 Dave Chinner   2024-10-02  891  	 * iteration so we can return to evict_inodes() and dispose of the
f3df82b20474b6 Dave Chinner   2024-10-02  892  	 * inodes before collecting more inodes to evict.
ac05fbb4006241 Josef Bacik    2015-03-04  893  	 */
f3df82b20474b6 Dave Chinner   2024-10-02  894  	if (need_resched())
f3df82b20474b6 Dave Chinner   2024-10-02  895  		return INO_ITER_ABORT;
f3df82b20474b6 Dave Chinner   2024-10-02  896  	return INO_ITER_DONE;
ac05fbb4006241 Josef Bacik    2015-03-04  897  }
63997e98a3be68 Al Viro        2010-10-25  898  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-04  7:21                       ` Christian Brauner
@ 2024-10-04 12:14                         ` Christoph Hellwig
  2024-10-04 13:49                           ` Jan Kara
  2024-10-04 22:57                         ` Dave Chinner
  1 sibling, 1 reply; 72+ messages in thread
From: Christoph Hellwig @ 2024-10-04 12:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dave Chinner, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote:
> > But screwing with LSM instructure looks ....  obnoxiously complex
> > from the outside...
> 
> Imho, please just focus on the immediate feedback and ignore all the
> extra bells and whistles that we could or should do. I prefer all of
> that to be done after this series lands.

For the LSM mess: absolutely.  For fsnotify it seems like Dave has
a good idea to integrate it, and it removes the somewhat awkward
need for the reffed flag.  So if that delayed notify idea works out
I'd prefer to see that in over the flag.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-04 12:14                         ` Christoph Hellwig
@ 2024-10-04 13:49                           ` Jan Kara
  2024-10-04 18:15                             ` Paul Moore
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Kara @ 2024-10-04 13:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Dave Chinner, Jan Kara, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Fri 04-10-24 05:14:36, Christoph Hellwig wrote:
> On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote:
> > > But screwing with LSM instructure looks ....  obnoxiously complex
> > > from the outside...
> > 
> > Imho, please just focus on the immediate feedback and ignore all the
> > extra bells and whistles that we could or should do. I prefer all of
> > that to be done after this series lands.
> 
> For the LSM mess: absolutely.  For fsnotify it seems like Dave has
> a good idea to integrate it, and it removes the somewhat awkward
> need for the reffed flag.  So if that delayed notify idea works out
> I'd prefer to see that in over the flag.

As I wrote in one of the emails in this (now huge) thread, I'm fine with
completely dropping that inode->i_refcount check from the
fsnotify_unmount_inodes(). It made sense when it was called before
evict_inodes() but after 1edc8eb2e931 ("fs: call fsnotify_sb_delete after
evict_inodes") the usefulness of this check is rather doubtful. So we can
drop the awkward flag regardless whether we unify evict_inodes() with
fsnotify_unmount_inodes() or not. I have no strong preference whether the
unification happens as part of this patch set or later on so it's up to
Dave as far as I'm concerned.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-04 13:49                           ` Jan Kara
@ 2024-10-04 18:15                             ` Paul Moore
  0 siblings, 0 replies; 72+ messages in thread
From: Paul Moore @ 2024-10-04 18:15 UTC (permalink / raw)
  To: Mickaël Salaün, Günther Noack
  Cc: Jan Kara, Christoph Hellwig, Christian Brauner, Dave Chinner,
	linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds, Mickaël Salaün, Jann Horn, Serge Hallyn,
	Kees Cook, linux-security-module, Amir Goldstein

On Fri, Oct 4, 2024 at 9:49 AM Jan Kara <jack@suse.cz> wrote:
> On Fri 04-10-24 05:14:36, Christoph Hellwig wrote:
> > On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote:
> > > > But screwing with LSM instructure looks ....  obnoxiously complex
> > > > from the outside...
> > >
> > > Imho, please just focus on the immediate feedback and ignore all the
> > > extra bells and whistles that we could or should do. I prefer all of
> > > that to be done after this series lands.
> >
> > For the LSM mess: absolutely.  For fsnotify it seems like Dave has
> > a good idea to integrate it, and it removes the somewhat awkward
> > need for the reffed flag.  So if that delayed notify idea works out
> > I'd prefer to see that in over the flag.
>
> As I wrote in one of the emails in this (now huge) thread, I'm fine with
> completely dropping that inode->i_refcount check from the
> fsnotify_unmount_inodes(). It made sense when it was called before
> evict_inodes() but after 1edc8eb2e931 ("fs: call fsnotify_sb_delete after
> evict_inodes") the usefulness of this check is rather doubtful. So we can
> drop the awkward flag regardless whether we unify evict_inodes() with
> fsnotify_unmount_inodes() or not. I have no strong preference whether the
> unification happens as part of this patch set or later on so it's up to
> Dave as far as I'm concerned.

I didn't get a chance to look at this thread until just now and I'm
noticing that the email used for Mickaël is likely not the best, I'm
adding the email he uses in MAINTAINERS as well as that of Günther
Noack, a designated Landlock reviewer.

Mickaël, Günther, the lore link for the full discussion is below:

https://lore.kernel.org/all/Zv5GfY1WS_aaczZM@infradead.org

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-04  7:21                       ` Christian Brauner
  2024-10-04 12:14                         ` Christoph Hellwig
@ 2024-10-04 22:57                         ` Dave Chinner
  2024-10-05 15:21                           ` Mickaël Salaün
  1 sibling, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2024-10-04 22:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, torvalds,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote:
> On Fri, Oct 04, 2024 at 10:46:27AM GMT, Dave Chinner wrote:
> > On Thu, Oct 03, 2024 at 06:17:31PM +0200, Jan Kara wrote:
> > > On Thu 03-10-24 23:59:51, Dave Chinner wrote:
> > > > As for the landlock code, I think it needs to have it's own internal
> > > > tracking mechanism and not search the sb inode list for inodes that
> > > > it holds references to. LSM cleanup should be run before before we
> > > > get to tearing down the inode cache, not after....
> > > 
> > > Well, I think LSM cleanup could in principle be handled together with the
> > > fsnotify cleanup but I didn't check the details.
> > 
> > I'm not sure how we tell if an inode potentially has a LSM related
> > reference hanging off it. The landlock code looks to make an
> > assumption in that the only referenced inodes it sees will have a
> > valid inode->i_security pointer if landlock is enabled. i.e. it
> > calls landlock_inode(inode) and dereferences the returned value
> > without ever checking if inode->i_security is NULL or not.
> > 
> > I mean, we could do a check for inode->i_security when the refcount
> > is elevated and replace the security_sb_delete hook with an
> > security_evict_inode hook similar to the proposed fsnotify eviction
> > from evict_inodes().
> > 
> > But screwing with LSM instructure looks ....  obnoxiously complex
> > from the outside...
> 
> Imho, please just focus on the immediate feedback and ignore all the
> extra bells and whistles that we could or should do. I prefer all of
> that to be done after this series lands.

Actually, it's not as bad as I thought it was going to be. I've
already moved both fsnotify and LSM inode eviction to
evict_inodes() as preparatory patches...

Dave Chinner (2):
      vfs: move fsnotify inode eviction to evict_inodes()
      vfs, lsm: rework lsm inode eviction at unmount

 fs/inode.c                    |  52 +++++++++++++---
 fs/notify/fsnotify.c          |  60 -------------------
 fs/super.c                    |   8 +--
 include/linux/lsm_hook_defs.h |   2 +-
 include/linux/security.h      |   2 +-
 security/landlock/fs.c        | 134 ++++++++++--------------------------------
 security/security.c           |  31 ++++++----
7 files changed, 99 insertions(+), 190 deletions(-)

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-04 22:57                         ` Dave Chinner
@ 2024-10-05 15:21                           ` Mickaël Salaün
  2024-10-05 16:03                             ` Mickaël Salaün
  2024-10-05 16:03                             ` Paul Moore
  0 siblings, 2 replies; 72+ messages in thread
From: Mickaël Salaün @ 2024-10-05 15:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, torvalds, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein,
	Paul Moore, Günther Noack

On Sat, Oct 05, 2024 at 08:57:32AM +1000, Dave Chinner wrote:
> On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote:
> > On Fri, Oct 04, 2024 at 10:46:27AM GMT, Dave Chinner wrote:
> > > On Thu, Oct 03, 2024 at 06:17:31PM +0200, Jan Kara wrote:
> > > > On Thu 03-10-24 23:59:51, Dave Chinner wrote:
> > > > > As for the landlock code, I think it needs to have it's own internal
> > > > > tracking mechanism and not search the sb inode list for inodes that
> > > > > it holds references to. LSM cleanup should be run before before we
> > > > > get to tearing down the inode cache, not after....
> > > > 
> > > > Well, I think LSM cleanup could in principle be handled together with the
> > > > fsnotify cleanup but I didn't check the details.
> > > 
> > > I'm not sure how we tell if an inode potentially has a LSM related
> > > reference hanging off it. The landlock code looks to make an
> > > assumption in that the only referenced inodes it sees will have a
> > > valid inode->i_security pointer if landlock is enabled. i.e. it
> > > calls landlock_inode(inode) and dereferences the returned value
> > > without ever checking if inode->i_security is NULL or not.

Correct, i_security should always be valid when this hook is called
because it means that at least Landlock is enabled and then i_security
refers to a valid LSM blob.

> > > 
> > > I mean, we could do a check for inode->i_security when the refcount
> > > is elevated and replace the security_sb_delete hook with an
> > > security_evict_inode hook similar to the proposed fsnotify eviction
> > > from evict_inodes().

That would be nice.

> > > 
> > > But screwing with LSM instructure looks ....  obnoxiously complex
> > > from the outside...
> > 
> > Imho, please just focus on the immediate feedback and ignore all the
> > extra bells and whistles that we could or should do. I prefer all of
> > that to be done after this series lands.
> 
> Actually, it's not as bad as I thought it was going to be. I've
> already moved both fsnotify and LSM inode eviction to
> evict_inodes() as preparatory patches...

Good, please Cc me and Günther on related patch series.

FYI, we have the two release_inodes tests to check this hook in
tools/testing/selftests/landlock/fs_test.c

> 
> Dave Chinner (2):
>       vfs: move fsnotify inode eviction to evict_inodes()
>       vfs, lsm: rework lsm inode eviction at unmount
> 
>  fs/inode.c                    |  52 +++++++++++++---
>  fs/notify/fsnotify.c          |  60 -------------------
>  fs/super.c                    |   8 +--
>  include/linux/lsm_hook_defs.h |   2 +-
>  include/linux/security.h      |   2 +-
>  security/landlock/fs.c        | 134 ++++++++++--------------------------------
>  security/security.c           |  31 ++++++----
> 7 files changed, 99 insertions(+), 190 deletions(-)
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-05 15:21                           ` Mickaël Salaün
@ 2024-10-05 16:03                             ` Mickaël Salaün
  2024-10-05 16:03                             ` Paul Moore
  1 sibling, 0 replies; 72+ messages in thread
From: Mickaël Salaün @ 2024-10-05 16:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, torvalds, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein,
	Paul Moore, Günther Noack

On Sat, Oct 05, 2024 at 05:21:30PM +0200, Mickaël Salaün wrote:
> On Sat, Oct 05, 2024 at 08:57:32AM +1000, Dave Chinner wrote:

> > Actually, it's not as bad as I thought it was going to be. I've
> > already moved both fsnotify and LSM inode eviction to
> > evict_inodes() as preparatory patches...
> 
> Good, please Cc me and Günther on related patch series.
> 
> FYI, we have the two release_inodes tests to check this hook in
> tools/testing/selftests/landlock/fs_test.c
> 
> > 
> > Dave Chinner (2):
> >       vfs: move fsnotify inode eviction to evict_inodes()
> >       vfs, lsm: rework lsm inode eviction at unmount
> > 
> >  fs/inode.c                    |  52 +++++++++++++---
> >  fs/notify/fsnotify.c          |  60 -------------------
> >  fs/super.c                    |   8 +--
> >  include/linux/lsm_hook_defs.h |   2 +-
> >  include/linux/security.h      |   2 +-
> >  security/landlock/fs.c        | 134 ++++++++++--------------------------------

Please run clang-format -i security/landlock/fs.c

> >  security/security.c           |  31 ++++++----
> > 7 files changed, 99 insertions(+), 190 deletions(-)
> > 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-05 15:21                           ` Mickaël Salaün
  2024-10-05 16:03                             ` Mickaël Salaün
@ 2024-10-05 16:03                             ` Paul Moore
  1 sibling, 0 replies; 72+ messages in thread
From: Paul Moore @ 2024-10-05 16:03 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Dave Chinner, Christian Brauner, Jan Kara, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	torvalds, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein, Günther Noack

On Sat, Oct 5, 2024 at 11:21 AM Mickaël Salaün <mic@digikod.net> wrote:
> On Sat, Oct 05, 2024 at 08:57:32AM +1000, Dave Chinner wrote:

...

> > Actually, it's not as bad as I thought it was going to be. I've
> > already moved both fsnotify and LSM inode eviction to
> > evict_inodes() as preparatory patches...
>
> Good, please Cc me and Günther on related patch series.

As well as the LSM list since the LSM framework looks to have some
changes as well.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-03 11:57       ` Jan Kara
  2024-10-03 12:11         ` Christoph Hellwig
@ 2024-10-07 20:37         ` Linus Torvalds
  2024-10-07 23:33           ` Dave Chinner
  1 sibling, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2024-10-07 20:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, Mickaël Salaün,
	Jann Horn, Serge Hallyn, Kees Cook, linux-security-module,
	Amir Goldstein

On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
>
> Fair enough. If we go with the iterator variant I've suggested to Dave in
> [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> Landlocks hook_sb_delete() into a single iteration relatively easily. But
> I'd wait with that convertion until this series lands.

Honza, I looked at this a bit more, particularly with an eye of "what
happens if we just end up making the inode lifetimes subject to the
dentry lifetimes" as suggested by Dave elsewhere.

And honestly, the whole inode list use by the fsnotify layer seems to
kind of suck. But I may be entirely missing something, so maybe I'm
very wrong for some reason.

The reason I say it "seems to kind of suck" is that the whole final

                /* for each watch, send FS_UNMOUNT and then remove it */
                fsnotify_inode(inode, FS_UNMOUNT);

                fsnotify_inode_delete(inode);

sequence seems to be entirely timing-dependent, and largely pointless and wrong.

Why?

Because inodes with no users will get removed at completely arbitrary
times under memory pressure in evict() -> destroy_inode(), and
obviously with I_DONTCACHE that ends up happening even earlier when
the dentry is removed.

So the whole "send FS_UNMOUNT and then remove it " thing seems to be
entirely bogus, and depending on memory pressure, lots of inodes will
only see the fsnotify_inode_delete() at eviction time and never get
the FS_UNMOUNT notification anyway.

So I get the feeling that we'd be better off entirely removing the
sb->s_inodes use from fsnotify, and replace this "get rid of them at
umount" with something like this instead:

  diff --git a/fs/dcache.c b/fs/dcache.c
  index 0f6b16ba30d0..aa2558de8d1f 100644
  --- a/fs/dcache.c
  +++ b/fs/dcache.c
  @@ -406,6 +406,7 @@ static void dentry_unlink_inode(struct dentry * dentry)
        spin_unlock(&inode->i_lock);
        if (!inode->i_nlink)
                fsnotify_inoderemove(inode);
  +     fsnotify_inode_delete(inode);
        if (dentry->d_op && dentry->d_op->d_iput)
                dentry->d_op->d_iput(dentry, inode);
        else
  diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
  index 278620e063ab..ea91cc216028 100644
  --- a/include/linux/fsnotify.h
  +++ b/include/linux/fsnotify.h
  @@ -261,7 +261,6 @@ static inline void
fsnotify_vfsmount_delete(struct vfsmount *mnt)
   static inline void fsnotify_inoderemove(struct inode *inode)
   {
        fsnotify_inode(inode, FS_DELETE_SELF);
  -     __fsnotify_inode_delete(inode);
   }

   /*

which makes the fsnotify_inode_delete() happen when the inode is
removed from the dentry.

Then at umount time, the dentry shrinking will deal with all live
dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
just the root dentry inodes?

Wouldn't that make things much cleaner, and remove at least *one* odd
use of the nasty s_inodes list?

I have this feeling that maybe we can just remove the other users too
using similar models. I think the LSM layer use (in landlock) is bogus
for exactly the same reason - there's really no reason to keep things
around for a random cached inode without a dentry.

And I wonder if the quota code (which uses the s_inodes list to enable
quotas on already mounted filesystems) could for all the same reasons
just walk the dentry tree instead (and remove_dquot_ref similarly
could just remove it at dentry_unlink_inode() time)?

It really feels like most (all?) of the s_inode list users are
basically historical, and shouldn't use that list at all. And there
aren't _that_ many of them. I think Dave was right in just saying that
this list should go away entirely (or was it somebody else who made
that comment?)

                   Linus

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-07 20:37         ` Linus Torvalds
@ 2024-10-07 23:33           ` Dave Chinner
  2024-10-08  0:28             ` Linus Torvalds
  2024-10-08  8:57             ` Amir Goldstein
  0 siblings, 2 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-07 23:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, Mickaël Salaün,
	Jann Horn, Serge Hallyn, Kees Cook, linux-security-module,
	Amir Goldstein

On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> >
> > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > I'd wait with that convertion until this series lands.
> 
> Honza, I looked at this a bit more, particularly with an eye of "what
> happens if we just end up making the inode lifetimes subject to the
> dentry lifetimes" as suggested by Dave elsewhere.

....

> which makes the fsnotify_inode_delete() happen when the inode is
> removed from the dentry.

There may be other inode references being held that make
the inode live longer than the dentry cache. When should the
fsnotify marks be removed from the inode in that case? Do they need
to remain until, e.g, writeback completes?

> Then at umount time, the dentry shrinking will deal with all live
> dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> just the root dentry inodes?

I don't think even that is necessary, because
shrink_dcache_for_umount() drops the sb->s_root dentry after
trimming the dentry tree. Hence the dcache drop would cleanup all
inode references, roots included.

> Wouldn't that make things much cleaner, and remove at least *one* odd
> use of the nasty s_inodes list?

Yes, it would, but someone who knows exactly when the fsnotify
marks can be removed needs to chime in here...

> I have this feeling that maybe we can just remove the other users too
> using similar models. I think the LSM layer use (in landlock) is bogus
> for exactly the same reason - there's really no reason to keep things
> around for a random cached inode without a dentry.

Perhaps, but I'm not sure what the landlock code is actually trying
to do. It seems to be trying to avoid races between syscalls
releasing inode references and unmount calling security_sb_delete()
to clean up inode references that it has leaked. This implies that
it's not a) tracking inodes itself, and b) not cleaning up internal
state early enough in unmount.

Hence, to me, the lifecycle and reference counting of inode related
objects in landlock doesn't seem quite right, and the use of the
security_sb_delete() callout appears to be papering over an internal
lifecycle issue.

I'd love to get rid of it altogether.

> And I wonder if the quota code (which uses the s_inodes list to enable
> quotas on already mounted filesystems) could for all the same reasons
> just walk the dentry tree instead (and remove_dquot_ref similarly
> could just remove it at dentry_unlink_inode() time)?

I don't think that will work because we have to be able to modify
quota in evict() processing. This is especially true for unlinked
inodes being evicted from cache, but also the dquots need to stay
attached until writeback completes.

Hence I don't think we can remove the quota refs from the inode
before we call iput_final(), and so I think quotaoff (at least)
still needs to iterate inodes...

> It really feels like most (all?) of the s_inode list users are
> basically historical, and shouldn't use that list at all. And there
> aren't _that_ many of them. I think Dave was right in just saying that
> this list should go away entirely (or was it somebody else who made
> that comment?)

Yeah, I said that it should go away entirely.

My view of this whole s_inodes list is that subsystems that are
taking references to inodes *must* track or manage the references to
the inodes themselves.

The canonical example is the VFS itself: evict_inodes() doesn't need
to iterate s_inodes at all. It can walk the inode LRU to purge all
the unreferenced cached inodes from memory. iput_final() guarantees
that all unreferenced inodes are either put on the LRU or torn down
immediately.

Hence I think that it is a poor architectural decision to require
superblock teardown to clean up inode references random subsystems
have *leaked* to prevent UAFs.  It forces the sb to track all
inodes whether the VFS actually needs to track them or not.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-07 23:33           ` Dave Chinner
@ 2024-10-08  0:28             ` Linus Torvalds
  2024-10-08  0:54               ` Linus Torvalds
  2024-10-08 12:59               ` Mickaël Salaün
  2024-10-08  8:57             ` Amir Goldstein
  1 sibling, 2 replies; 72+ messages in thread
From: Linus Torvalds @ 2024-10-08  0:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, Mickaël Salaün,
	Jann Horn, Serge Hallyn, Kees Cook, linux-security-module,
	Amir Goldstein

On Mon, 7 Oct 2024 at 16:33, Dave Chinner <david@fromorbit.com> wrote:
>
> There may be other inode references being held that make
> the inode live longer than the dentry cache. When should the
> fsnotify marks be removed from the inode in that case? Do they need
> to remain until, e.g, writeback completes?

Note that my idea is to just remove the fsnotify marks when the dentry
discards the inode.

That means that yes, the inode may still have a lifetime after the
dentry (because of other references, _or_ just because I_DONTCACHE
isn't set and we keep caching the inode).

BUT - fsnotify won't care. There won't be any fsnotify marks on that
inode any more, and without a dentry that points to it, there's no way
to add such marks.

(A new dentry may be re-attached to such an inode, and then fsnotify
could re-add new marks, but that doesn't change anything - the next
time the dentry is detached, the marks would go away again).

And yes, this changes the timing on when fsnotify events happen, but
what I'm actually hoping for is that Jan will agree that it doesn't
actually matter semantically.

> > Then at umount time, the dentry shrinking will deal with all live
> > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > just the root dentry inodes?
>
> I don't think even that is necessary, because
> shrink_dcache_for_umount() drops the sb->s_root dentry after
> trimming the dentry tree. Hence the dcache drop would cleanup all
> inode references, roots included.

Ahh - even better.

I didn't actually look very closely at the actual umount path, I was
looking just at the fsnotify_inoderemove() place in
dentry_unlink_inode() and went "couldn't we do _this_ instead?"

> > Wouldn't that make things much cleaner, and remove at least *one* odd
> > use of the nasty s_inodes list?
>
> Yes, it would, but someone who knows exactly when the fsnotify
> marks can be removed needs to chime in here...

Yup. Honza?

(Aside: I don't actually know if you prefer Jan or Honza, so I use
both randomly and interchangeably?)

> > I have this feeling that maybe we can just remove the other users too
> > using similar models. I think the LSM layer use (in landlock) is bogus
> > for exactly the same reason - there's really no reason to keep things
> > around for a random cached inode without a dentry.
>
> Perhaps, but I'm not sure what the landlock code is actually trying
> to do.

Yeah, I wouldn't be surprised if it's just confused - it's very odd.

But I'd be perfectly happy just removing one use at a time - even if
we keep the s_inodes list around because of other users, it would
still be "one less thing".

> Hence, to me, the lifecycle and reference counting of inode related
> objects in landlock doesn't seem quite right, and the use of the
> security_sb_delete() callout appears to be papering over an internal
> lifecycle issue.
>
> I'd love to get rid of it altogether.

Yeah, I think the inode lifetime is just so random these days that
anything that depends on it is questionable.

The quota case is probably the only thing where the inode lifetime
*really* makes sense, and that's the one where I looked at the code
and went "I *hope* this can be converted to traversing the dentry
tree", but at the same time it did look sensible to make it be about
inodes.

If we can convert the quota side to be based on dentry lifetimes, it
will almost certainly then have to react to the places that do
"d_add()" when re-connecting an inode to a dentry at lookup time.

So yeah, the quota code looks worse, but even if we could just remove
fsnotify and landlock, I'd still be much happier.

             Linus

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08  0:28             ` Linus Torvalds
@ 2024-10-08  0:54               ` Linus Torvalds
  2024-10-09  9:49                 ` Jan Kara
  2024-10-08 12:59               ` Mickaël Salaün
  1 sibling, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2024-10-08  0:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-bcachefs, kent.overstreet, Mickaël Salaün,
	Jann Horn, Serge Hallyn, Kees Cook, linux-security-module,
	Amir Goldstein

On Mon, 7 Oct 2024 at 17:28, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And yes, this changes the timing on when fsnotify events happen, but
> what I'm actually hoping for is that Jan will agree that it doesn't
> actually matter semantically.

.. and yes, I realize it might actually matter. fsnotify does do
'ihold()' to hold an inode ref, and with this that would actually be
more or less pointless, because the mark would be removed _despite_
such a ref.

So maybe it's not an option to do what I suggested. I don't know the
users well enough.

         Linus

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-07 23:33           ` Dave Chinner
  2024-10-08  0:28             ` Linus Torvalds
@ 2024-10-08  8:57             ` Amir Goldstein
  2024-10-08 11:23               ` Jan Kara
  1 sibling, 1 reply; 72+ messages in thread
From: Amir Goldstein @ 2024-10-08  8:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> > >
> > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > I'd wait with that convertion until this series lands.
> >
> > Honza, I looked at this a bit more, particularly with an eye of "what
> > happens if we just end up making the inode lifetimes subject to the
> > dentry lifetimes" as suggested by Dave elsewhere.
>
> ....
>
> > which makes the fsnotify_inode_delete() happen when the inode is
> > removed from the dentry.
>
> There may be other inode references being held that make
> the inode live longer than the dentry cache. When should the
> fsnotify marks be removed from the inode in that case? Do they need
> to remain until, e.g, writeback completes?
>

fsnotify inode marks remain until explicitly removed or until sb
is unmounted (*), so other inode references are irrelevant to
inode mark removal.

(*) fanotify has "evictable" inode marks, which do not hold inode
reference and go away on inode evict, but those mark evictions
do not generate any event (i.e. there is no FAN_UNMOUNT).

> > Then at umount time, the dentry shrinking will deal with all live
> > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > just the root dentry inodes?
>
> I don't think even that is necessary, because
> shrink_dcache_for_umount() drops the sb->s_root dentry after
> trimming the dentry tree. Hence the dcache drop would cleanup all
> inode references, roots included.
>
> > Wouldn't that make things much cleaner, and remove at least *one* odd
> > use of the nasty s_inodes list?
>
> Yes, it would, but someone who knows exactly when the fsnotify
> marks can be removed needs to chime in here...
>
> > I have this feeling that maybe we can just remove the other users too
> > using similar models. I think the LSM layer use (in landlock) is bogus
> > for exactly the same reason - there's really no reason to keep things
> > around for a random cached inode without a dentry.
>
> Perhaps, but I'm not sure what the landlock code is actually trying
> to do. It seems to be trying to avoid races between syscalls
> releasing inode references and unmount calling security_sb_delete()
> to clean up inode references that it has leaked. This implies that
> it's not a) tracking inodes itself, and b) not cleaning up internal
> state early enough in unmount.
>
> Hence, to me, the lifecycle and reference counting of inode related
> objects in landlock doesn't seem quite right, and the use of the
> security_sb_delete() callout appears to be papering over an internal
> lifecycle issue.
>
> I'd love to get rid of it altogether.
>
> > And I wonder if the quota code (which uses the s_inodes list to enable
> > quotas on already mounted filesystems) could for all the same reasons
> > just walk the dentry tree instead (and remove_dquot_ref similarly
> > could just remove it at dentry_unlink_inode() time)?
>
> I don't think that will work because we have to be able to modify
> quota in evict() processing. This is especially true for unlinked
> inodes being evicted from cache, but also the dquots need to stay
> attached until writeback completes.
>
> Hence I don't think we can remove the quota refs from the inode
> before we call iput_final(), and so I think quotaoff (at least)
> still needs to iterate inodes...
>
> > It really feels like most (all?) of the s_inode list users are
> > basically historical, and shouldn't use that list at all. And there
> > aren't _that_ many of them. I think Dave was right in just saying that
> > this list should go away entirely (or was it somebody else who made
> > that comment?)
>
> Yeah, I said that it should go away entirely.
>
> My view of this whole s_inodes list is that subsystems that are
> taking references to inodes *must* track or manage the references to
> the inodes themselves.
>
> The canonical example is the VFS itself: evict_inodes() doesn't need
> to iterate s_inodes at all. It can walk the inode LRU to purge all
> the unreferenced cached inodes from memory. iput_final() guarantees
> that all unreferenced inodes are either put on the LRU or torn down
> immediately.
>
> Hence I think that it is a poor architectural decision to require
> superblock teardown to clean up inode references random subsystems
> have *leaked* to prevent UAFs.  It forces the sb to track all
> inodes whether the VFS actually needs to track them or not.
>

For fsnotify, I think we can/should maintain a list of marked inodes
inside sb->s_fsnotify_info, we can iterate this private list in
fsnotify_unmount_inodes() to remove the marks.

TBH, I am not sure I understand the suggested change for inode
lifetime. An inode can have a reference from dentry or from some
subsystem (e.g. fsnotify) which is responsible for putting their held
reference before unmount. What is the alternative?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08  8:57             ` Amir Goldstein
@ 2024-10-08 11:23               ` Jan Kara
  2024-10-08 12:16                 ` Christian Brauner
  2024-10-08 23:44                 ` Dave Chinner
  0 siblings, 2 replies; 72+ messages in thread
From: Jan Kara @ 2024-10-08 11:23 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Linus Torvalds, Jan Kara, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Tue 08-10-24 10:57:22, Amir Goldstein wrote:
> On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > I'd wait with that convertion until this series lands.
> > >
> > > Honza, I looked at this a bit more, particularly with an eye of "what
> > > happens if we just end up making the inode lifetimes subject to the
> > > dentry lifetimes" as suggested by Dave elsewhere.
> >
> > ....
> >
> > > which makes the fsnotify_inode_delete() happen when the inode is
> > > removed from the dentry.
> >
> > There may be other inode references being held that make
> > the inode live longer than the dentry cache. When should the
> > fsnotify marks be removed from the inode in that case? Do they need
> > to remain until, e.g, writeback completes?
> >
> 
> fsnotify inode marks remain until explicitly removed or until sb
> is unmounted (*), so other inode references are irrelevant to
> inode mark removal.
> 
> (*) fanotify has "evictable" inode marks, which do not hold inode
> reference and go away on inode evict, but those mark evictions
> do not generate any event (i.e. there is no FAN_UNMOUNT).

Yes. Amir beat me with the response so let me just add that FS_UMOUNT event
is for inotify which guarantees that either you get an event about somebody
unlinking the inode (e.g. IN_DELETE_SELF) or event about filesystem being
unmounted (IN_UMOUNT) if you place mark on some inode. I also don't see how
we would maintain this behavior with what Linus proposes.

> > > Then at umount time, the dentry shrinking will deal with all live
> > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > just the root dentry inodes?
> >
> > I don't think even that is necessary, because
> > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > trimming the dentry tree. Hence the dcache drop would cleanup all
> > inode references, roots included.
> >
> > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > use of the nasty s_inodes list?
> >
> > Yes, it would, but someone who knows exactly when the fsnotify
> > marks can be removed needs to chime in here...

So fsnotify needs a list of inodes for the superblock which have marks
attached and for which we hold inode reference. We can keep it inside
fsnotify code although it would practically mean another list_head for the
inode for this list (probably in our fsnotify_connector structure which
connects list of notification marks to the inode). If we actually get rid
of i_sb_list in struct inode, this will be a win for the overall system,
otherwise it is a net loss IMHO. So if we can figure out how to change
other s_inodes owners we can certainly do this fsnotify change.

> > > And I wonder if the quota code (which uses the s_inodes list to enable
> > > quotas on already mounted filesystems) could for all the same reasons
> > > just walk the dentry tree instead (and remove_dquot_ref similarly
> > > could just remove it at dentry_unlink_inode() time)?
> >
> > I don't think that will work because we have to be able to modify
> > quota in evict() processing. This is especially true for unlinked
> > inodes being evicted from cache, but also the dquots need to stay
> > attached until writeback completes.
> >
> > Hence I don't think we can remove the quota refs from the inode
> > before we call iput_final(), and so I think quotaoff (at least)
> > still needs to iterate inodes...

Yeah, I'm not sure how to get rid of the s_inodes use in quota code. One of
the things we need s_inodes list for is during quotaoff on a mounted
filesystem when we need to iterate all inodes which are referencing quota
structures and free them.  In theory we could keep a list of inodes
referencing quota structures but that would require adding list_head to
inode structure for filesystems that support quotas. Now for the sake of
full context I'll also say that enabling / disabling quotas on a mounted
filesystem is a legacy feature because it is quite easy that quota
accounting goes wrong with it. So ext4 and f2fs support for quite a few
years a mode where quota tracking is enabled on mount and disabled on
unmount (if appropriate fs feature is enabled) and you can only enable /
disable enforcement of quota limits during runtime.  So I could see us
deprecating this functionality altogether although jfs never adapted to
this new way we do quotas so we'd have to deal with that somehow.  But one
way or another it would take a significant amount of time before we can
completely remove this so it is out of question for this series.

I see one problem with the idea "whoever has a need to iterate inodes needs
to keep track of inodes it needs to iterate through". It is fine
conceptually but with s_inodes list we pay the cost only once and multiple
users benefit. With each subsystem tracking inodes we pay the cost for each
user (both in terms of memory and CPU). So if you don't use any of the
subsystems that need iteration, you win, but if you use two or more of
these subsystems, in particular those which need to track significant
portion of all inodes, you are losing.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08 11:23               ` Jan Kara
@ 2024-10-08 12:16                 ` Christian Brauner
  2024-10-09  0:03                   ` Dave Chinner
  2024-10-08 23:44                 ` Dave Chinner
  1 sibling, 1 reply; 72+ messages in thread
From: Christian Brauner @ 2024-10-08 12:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Dave Chinner, Linus Torvalds, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Tue, Oct 08, 2024 at 01:23:44PM GMT, Jan Kara wrote:
> On Tue 08-10-24 10:57:22, Amir Goldstein wrote:
> > On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > > I'd wait with that convertion until this series lands.
> > > >
> > > > Honza, I looked at this a bit more, particularly with an eye of "what
> > > > happens if we just end up making the inode lifetimes subject to the
> > > > dentry lifetimes" as suggested by Dave elsewhere.
> > >
> > > ....
> > >
> > > > which makes the fsnotify_inode_delete() happen when the inode is
> > > > removed from the dentry.
> > >
> > > There may be other inode references being held that make
> > > the inode live longer than the dentry cache. When should the
> > > fsnotify marks be removed from the inode in that case? Do they need
> > > to remain until, e.g, writeback completes?
> > >
> > 
> > fsnotify inode marks remain until explicitly removed or until sb
> > is unmounted (*), so other inode references are irrelevant to
> > inode mark removal.
> > 
> > (*) fanotify has "evictable" inode marks, which do not hold inode
> > reference and go away on inode evict, but those mark evictions
> > do not generate any event (i.e. there is no FAN_UNMOUNT).
> 
> Yes. Amir beat me with the response so let me just add that FS_UMOUNT event
> is for inotify which guarantees that either you get an event about somebody
> unlinking the inode (e.g. IN_DELETE_SELF) or event about filesystem being
> unmounted (IN_UMOUNT) if you place mark on some inode. I also don't see how
> we would maintain this behavior with what Linus proposes.
> 
> > > > Then at umount time, the dentry shrinking will deal with all live
> > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > just the root dentry inodes?
> > >
> > > I don't think even that is necessary, because
> > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > inode references, roots included.
> > >
> > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > use of the nasty s_inodes list?
> > >
> > > Yes, it would, but someone who knows exactly when the fsnotify
> > > marks can be removed needs to chime in here...
> 
> So fsnotify needs a list of inodes for the superblock which have marks
> attached and for which we hold inode reference. We can keep it inside
> fsnotify code although it would practically mean another list_head for the
> inode for this list (probably in our fsnotify_connector structure which
> connects list of notification marks to the inode). If we actually get rid
> of i_sb_list in struct inode, this will be a win for the overall system,
> otherwise it is a net loss IMHO. So if we can figure out how to change
> other s_inodes owners we can certainly do this fsnotify change.
> 
> > > > And I wonder if the quota code (which uses the s_inodes list to enable
> > > > quotas on already mounted filesystems) could for all the same reasons
> > > > just walk the dentry tree instead (and remove_dquot_ref similarly
> > > > could just remove it at dentry_unlink_inode() time)?
> > >
> > > I don't think that will work because we have to be able to modify
> > > quota in evict() processing. This is especially true for unlinked
> > > inodes being evicted from cache, but also the dquots need to stay
> > > attached until writeback completes.
> > >
> > > Hence I don't think we can remove the quota refs from the inode
> > > before we call iput_final(), and so I think quotaoff (at least)
> > > still needs to iterate inodes...
> 
> Yeah, I'm not sure how to get rid of the s_inodes use in quota code. One of
> the things we need s_inodes list for is during quotaoff on a mounted
> filesystem when we need to iterate all inodes which are referencing quota
> structures and free them.  In theory we could keep a list of inodes
> referencing quota structures but that would require adding list_head to
> inode structure for filesystems that support quotas. Now for the sake of
> full context I'll also say that enabling / disabling quotas on a mounted
> filesystem is a legacy feature because it is quite easy that quota
> accounting goes wrong with it. So ext4 and f2fs support for quite a few
> years a mode where quota tracking is enabled on mount and disabled on
> unmount (if appropriate fs feature is enabled) and you can only enable /
> disable enforcement of quota limits during runtime.  So I could see us
> deprecating this functionality altogether although jfs never adapted to
> this new way we do quotas so we'd have to deal with that somehow.  But one
> way or another it would take a significant amount of time before we can
> completely remove this so it is out of question for this series.

I still maintain that we don't need to solve the fsnotify and lsm rework
as part of this particular series.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08  0:28             ` Linus Torvalds
  2024-10-08  0:54               ` Linus Torvalds
@ 2024-10-08 12:59               ` Mickaël Salaün
  2024-10-09  0:21                 ` Dave Chinner
  1 sibling, 1 reply; 72+ messages in thread
From: Mickaël Salaün @ 2024-10-08 12:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein,
	Günther Noack

On Mon, Oct 07, 2024 at 05:28:57PM -0700, Linus Torvalds wrote:
> On Mon, 7 Oct 2024 at 16:33, Dave Chinner <david@fromorbit.com> wrote:
> >
> > There may be other inode references being held that make
> > the inode live longer than the dentry cache. When should the
> > fsnotify marks be removed from the inode in that case? Do they need
> > to remain until, e.g, writeback completes?
> 
> Note that my idea is to just remove the fsnotify marks when the dentry
> discards the inode.
> 
> That means that yes, the inode may still have a lifetime after the
> dentry (because of other references, _or_ just because I_DONTCACHE
> isn't set and we keep caching the inode).
> 
> BUT - fsnotify won't care. There won't be any fsnotify marks on that
> inode any more, and without a dentry that points to it, there's no way
> to add such marks.
> 
> (A new dentry may be re-attached to such an inode, and then fsnotify
> could re-add new marks, but that doesn't change anything - the next
> time the dentry is detached, the marks would go away again).
> 
> And yes, this changes the timing on when fsnotify events happen, but
> what I'm actually hoping for is that Jan will agree that it doesn't
> actually matter semantically.
> 
> > > Then at umount time, the dentry shrinking will deal with all live
> > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > just the root dentry inodes?
> >
> > I don't think even that is necessary, because
> > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > trimming the dentry tree. Hence the dcache drop would cleanup all
> > inode references, roots included.
> 
> Ahh - even better.
> 
> I didn't actually look very closely at the actual umount path, I was
> looking just at the fsnotify_inoderemove() place in
> dentry_unlink_inode() and went "couldn't we do _this_ instead?"
> 
> > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > use of the nasty s_inodes list?
> >
> > Yes, it would, but someone who knows exactly when the fsnotify
> > marks can be removed needs to chime in here...
> 
> Yup. Honza?
> 
> (Aside: I don't actually know if you prefer Jan or Honza, so I use
> both randomly and interchangeably?)
> 
> > > I have this feeling that maybe we can just remove the other users too
> > > using similar models. I think the LSM layer use (in landlock) is bogus
> > > for exactly the same reason - there's really no reason to keep things
> > > around for a random cached inode without a dentry.
> >
> > Perhaps, but I'm not sure what the landlock code is actually trying
> > to do.

In Landlock, inodes (see landlock_object) may be referenced by several
rulesets, either tied to a task's cred or a ruleset's file descriptor.
A ruleset may outlive its referenced inodes, and this should not block
related umounts.  security_sb_delete() is used to gracefully release
such references.

> 
> Yeah, I wouldn't be surprised if it's just confused - it's very odd.
> 
> But I'd be perfectly happy just removing one use at a time - even if
> we keep the s_inodes list around because of other users, it would
> still be "one less thing".
> 
> > Hence, to me, the lifecycle and reference counting of inode related
> > objects in landlock doesn't seem quite right, and the use of the
> > security_sb_delete() callout appears to be papering over an internal
> > lifecycle issue.
> >
> > I'd love to get rid of it altogether.

I'm not sure to fully understand the implications for now, but it would
definitely be good to simplify this lifetime management.  The only
requirement for Landlock is that inodes references should live as long
as the related inodes are accessible by user space or already in use.
The sooner these references are removed from related ruleset, the
better.

> 
> Yeah, I think the inode lifetime is just so random these days that
> anything that depends on it is questionable.
> 
> The quota case is probably the only thing where the inode lifetime
> *really* makes sense, and that's the one where I looked at the code
> and went "I *hope* this can be converted to traversing the dentry
> tree", but at the same time it did look sensible to make it be about
> inodes.
> 
> If we can convert the quota side to be based on dentry lifetimes, it
> will almost certainly then have to react to the places that do
> "d_add()" when re-connecting an inode to a dentry at lookup time.
> 
> So yeah, the quota code looks worse, but even if we could just remove
> fsnotify and landlock, I'd still be much happier.
> 
>              Linus

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08 11:23               ` Jan Kara
  2024-10-08 12:16                 ` Christian Brauner
@ 2024-10-08 23:44                 ` Dave Chinner
  2024-10-09  6:10                   ` Amir Goldstein
  2024-10-09 14:18                   ` Jan Kara
  1 sibling, 2 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-08 23:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Linus Torvalds, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Tue, Oct 08, 2024 at 01:23:44PM +0200, Jan Kara wrote:
> On Tue 08-10-24 10:57:22, Amir Goldstein wrote:
> > On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > > I'd wait with that convertion until this series lands.
> > > >
> > > > Honza, I looked at this a bit more, particularly with an eye of "what
> > > > happens if we just end up making the inode lifetimes subject to the
> > > > dentry lifetimes" as suggested by Dave elsewhere.
> > >
> > > ....
> > >
> > > > which makes the fsnotify_inode_delete() happen when the inode is
> > > > removed from the dentry.
> > >
> > > There may be other inode references being held that make
> > > the inode live longer than the dentry cache. When should the
> > > fsnotify marks be removed from the inode in that case? Do they need
> > > to remain until, e.g, writeback completes?
> > >
> > 
> > fsnotify inode marks remain until explicitly removed or until sb
> > is unmounted (*), so other inode references are irrelevant to
> > inode mark removal.
> > 
> > (*) fanotify has "evictable" inode marks, which do not hold inode
> > reference and go away on inode evict, but those mark evictions
> > do not generate any event (i.e. there is no FAN_UNMOUNT).
> 
> Yes. Amir beat me with the response so let me just add that FS_UMOUNT event
> is for inotify which guarantees that either you get an event about somebody
> unlinking the inode (e.g. IN_DELETE_SELF) or event about filesystem being
> unmounted (IN_UMOUNT) if you place mark on some inode. I also don't see how
> we would maintain this behavior with what Linus proposes.

Thanks. I didn't respond last night when I read Amir's decription
because I wanted to think it over. Knowing where the unmount event
requirement certainly helps.

I am probably missing something important, but it really seems to me
that the object reference counting model is the back to
front.  Currently the mark is being attached to the inode and then
the inode pinned by a reference count to make the mark attached
to the inode persistent until unmount. This then requires the inodes
to be swept by unmount because fsnotify has effectively leaked them
as it isn't tracking such inodes itself.

[ Keep in mind that I'm not saying this was a bad or wrong thing to
do because the s_inodes list was there to be able to do this sort of
lazy cleanup. But now that we want to remove the s_inodes list if at
all possible, it is a problem we need to solve differently. ]

AFAICT, inotify does not appear to require the inode to send events
- it only requires access to the inode mark itself. Hence it does
not the inode in cache to generate IN_UNMOUNT events, it just
needs the mark itself to be findable at unmount.  Do any of the
other backends that require unmount notifications that require
special access to the inode itself?

If not, and the fsnotify sb info is tracking these persistent marks,
then we don't need to iterate inodes at unmount. This means we don't
need to pin inodes when they have marks attached, and so the
dependency on the s_inodes list goes away.

With this inverted model, we need the first fsnotify event callout
after the inode is instantiated to look for a persistent mark for
the inode. We know how to do this efficiently - it's exactly the
same caching model we use for ACLs. On the first lookup, we check
the inode for ACL data and set the ACL pointer appropriately to
indicate that a lookup has been done and there are no ACLs
associated with the inode.

At this point, the fsnotify inode marks can all be removed from the
inode when it is being evicted and there's no need for fsnotify to
pin inodes at all.

> > > > Then at umount time, the dentry shrinking will deal with all live
> > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > just the root dentry inodes?
> > >
> > > I don't think even that is necessary, because
> > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > inode references, roots included.
> > >
> > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > use of the nasty s_inodes list?
> > >
> > > Yes, it would, but someone who knows exactly when the fsnotify
> > > marks can be removed needs to chime in here...
> 
> So fsnotify needs a list of inodes for the superblock which have marks
> attached and for which we hold inode reference. We can keep it inside
> fsnotify code although it would practically mean another list_head for the
> inode for this list (probably in our fsnotify_connector structure which
> connects list of notification marks to the inode).

I don't think that is necessary. We need to get rid of the inode
reference, not move where we track inode references. The persistent
object is the fsnotify mark, not the cached inode. It's the mark
that needs to be persistent, and that's what the fsnotify code
should be tracking.

The fsnotify marks are much smaller than inodes, and there going to
be fewer cached marks than inodes, especially once inode pinning is
removed. Hence I think this will result in a net reduction in memory
footprint for "marked-until-unmount" configurations as we won't pin
nearly as many inodes in cache...

> If we actually get rid
> of i_sb_list in struct inode, this will be a win for the overall system,
> otherwise it is a net loss IMHO. So if we can figure out how to change
> other s_inodes owners we can certainly do this fsnotify change.

Yes, I am exploring what it would take to get rid of i_sb_list
altogether right now. That, I don't think this is a concern given
the difference in memory footprint of the same number of persistent
marks. i.e. "persistent mark, reclaimable inode" will always have a
significantly lower memory footprint than "persistent inode and
mark" under memory pressure....

> > > > And I wonder if the quota code (which uses the s_inodes list
> > > > to enable quotas on already mounted filesystems) could for
> > > > all the same reasons just walk the dentry tree instead (and
> > > > remove_dquot_ref similarly could just remove it at
> > > > dentry_unlink_inode() time)?
> > >
> > > I don't think that will work because we have to be able to
> > > modify quota in evict() processing. This is especially true
> > > for unlinked inodes being evicted from cache, but also the
> > > dquots need to stay attached until writeback completes.
> > >
> > > Hence I don't think we can remove the quota refs from the
> > > inode before we call iput_final(), and so I think quotaoff (at
> > > least) still needs to iterate inodes...
> 
> Yeah, I'm not sure how to get rid of the s_inodes use in quota
> code. One of the things we need s_inodes list for is during
> quotaoff on a mounted filesystem when we need to iterate all
> inodes which are referencing quota structures and free them.  In
> theory we could keep a list of inodes referencing quota structures
> but that would require adding list_head to inode structure for
> filesystems that support quotas.

I don't think that's quite true. Quota is not modular, so we can
lazily free quota objects even when quota is turned off. All we need
to ensure is that code checks whether quota is enabled, not for the
existence of quota objects attached to the inode.

Hence quota-off simply turns off all the quota operations in memory,
and normal inode eviction cleans up the stale quota objects
naturally.

My main question is why the quota-on add_dquot_ref() pass is
required. AFAICT all of the filesystem operations that will modify
quota call dquot_initialize() directly to attach the required dquots
to the inode before the operation is started. If that's true, then
why does quota-on need to do this for all the inodes that are
already in cache?

i.e. I'm not sure I understand why we need quota to do these
iterations at all...

> Now for the sake of
> full context I'll also say that enabling / disabling quotas on a mounted
> filesystem is a legacy feature because it is quite easy that quota
> accounting goes wrong with it. So ext4 and f2fs support for quite a few
> years a mode where quota tracking is enabled on mount and disabled on
> unmount (if appropriate fs feature is enabled) and you can only enable /
> disable enforcement of quota limits during runtime.

Sure, this is how XFS works, too. But I think this behaviour is
largely irrelevant because there are still filesystems out there
that do stuff the old way...

> So I could see us
> deprecating this functionality altogether although jfs never adapted to
> this new way we do quotas so we'd have to deal with that somehow.  But one
> way or another it would take a significant amount of time before we can
> completely remove this so it is out of question for this series.

I'm not sure that matters, though it adds to the reasons why we
should be removing old, unmaintained filesystems from the tree
and old, outdated formats from maintained filesystems....

> I see one problem with the idea "whoever has a need to iterate inodes needs
> to keep track of inodes it needs to iterate through". It is fine
> conceptually but with s_inodes list we pay the cost only once and multiple
> users benefit. With each subsystem tracking inodes we pay the cost for each
> user (both in terms of memory and CPU). So if you don't use any of the
> subsystems that need iteration, you win, but if you use two or more of
> these subsystems, in particular those which need to track significant
> portion of all inodes, you are losing.

AFAICT, most of the subsystems don't need to track inodes directly.

We don't need s_inodes for evict_inodes() - we have the inode LRU
tracking all unreferenced inodes on the superblock. The GFS2 use
case can probably walk the inode LRU directly, too.

It looks to me that we can avoid needing unmount iteration for
fsnotify, and I suspect landlock can likely use the same persistence
inversion as fsnotify (same persistent ruleset model).

The bdev superblock can implement it's own internal list using
inode->i_devices as this list_head is only used by chardev
inodes.

All that then remains is the page cache dropping code, and that's
not really critical to have exacting behaviour. We certainly
shouldn't be taking a runtime penalty just to optimise the rare
case of dropping caches..

IOWs, there aren't that many users, and I think there are ways to
make all these iterations go away without adding new per-inode
list heads to track inodes.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08 12:16                 ` Christian Brauner
@ 2024-10-09  0:03                   ` Dave Chinner
  0 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2024-10-09  0:03 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Tue, Oct 08, 2024 at 02:16:04PM +0200, Christian Brauner wrote:
> I still maintain that we don't need to solve the fsnotify and lsm rework
> as part of this particular series.

Sure, I heard you the first time. :)

However, the patchset I posted was just a means to start the
discussion with a concrete proposal. Now I'm trying to work out how
all the pieces of the bigger puzzle go together as people think
about what it means, not polish the first little step.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08 12:59               ` Mickaël Salaün
@ 2024-10-09  0:21                 ` Dave Chinner
  2024-10-09  9:23                   ` Mickaël Salaün
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2024-10-09  0:21 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Linus Torvalds, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein,
	Günther Noack

On Tue, Oct 08, 2024 at 02:59:07PM +0200, Mickaël Salaün wrote:
> On Mon, Oct 07, 2024 at 05:28:57PM -0700, Linus Torvalds wrote:
> > On Mon, 7 Oct 2024 at 16:33, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > There may be other inode references being held that make
> > > the inode live longer than the dentry cache. When should the
> > > fsnotify marks be removed from the inode in that case? Do they need
> > > to remain until, e.g, writeback completes?
> > 
> > Note that my idea is to just remove the fsnotify marks when the dentry
> > discards the inode.
> > 
> > That means that yes, the inode may still have a lifetime after the
> > dentry (because of other references, _or_ just because I_DONTCACHE
> > isn't set and we keep caching the inode).
> > 
> > BUT - fsnotify won't care. There won't be any fsnotify marks on that
> > inode any more, and without a dentry that points to it, there's no way
> > to add such marks.
> > 
> > (A new dentry may be re-attached to such an inode, and then fsnotify
> > could re-add new marks, but that doesn't change anything - the next
> > time the dentry is detached, the marks would go away again).
> > 
> > And yes, this changes the timing on when fsnotify events happen, but
> > what I'm actually hoping for is that Jan will agree that it doesn't
> > actually matter semantically.
> > 
> > > > Then at umount time, the dentry shrinking will deal with all live
> > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > just the root dentry inodes?
> > >
> > > I don't think even that is necessary, because
> > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > inode references, roots included.
> > 
> > Ahh - even better.
> > 
> > I didn't actually look very closely at the actual umount path, I was
> > looking just at the fsnotify_inoderemove() place in
> > dentry_unlink_inode() and went "couldn't we do _this_ instead?"
> > 
> > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > use of the nasty s_inodes list?
> > >
> > > Yes, it would, but someone who knows exactly when the fsnotify
> > > marks can be removed needs to chime in here...
> > 
> > Yup. Honza?
> > 
> > (Aside: I don't actually know if you prefer Jan or Honza, so I use
> > both randomly and interchangeably?)
> > 
> > > > I have this feeling that maybe we can just remove the other users too
> > > > using similar models. I think the LSM layer use (in landlock) is bogus
> > > > for exactly the same reason - there's really no reason to keep things
> > > > around for a random cached inode without a dentry.
> > >
> > > Perhaps, but I'm not sure what the landlock code is actually trying
> > > to do.
> 
> In Landlock, inodes (see landlock_object) may be referenced by several
> rulesets, either tied to a task's cred or a ruleset's file descriptor.
> A ruleset may outlive its referenced inodes, and this should not block
> related umounts.  security_sb_delete() is used to gracefully release
> such references.

Ah, there's the problem. The ruleset is persistent, not the inode.
Like fsnotify, the life cycle and reference counting is upside down.
The inode should cache the ruleset rather than the ruleset pinning
the inode.

See my reply to Jan about fsnotify.

> > Yeah, I wouldn't be surprised if it's just confused - it's very odd.
> > 
> > But I'd be perfectly happy just removing one use at a time - even if
> > we keep the s_inodes list around because of other users, it would
> > still be "one less thing".
> > 
> > > Hence, to me, the lifecycle and reference counting of inode related
> > > objects in landlock doesn't seem quite right, and the use of the
> > > security_sb_delete() callout appears to be papering over an internal
> > > lifecycle issue.
> > >
> > > I'd love to get rid of it altogether.
> 
> I'm not sure to fully understand the implications for now, but it would
> definitely be good to simplify this lifetime management.  The only
> requirement for Landlock is that inodes references should live as long
> as the related inodes are accessible by user space or already in use.
> The sooner these references are removed from related ruleset, the
> better.

I'm missing something.  Inodes are accessible to users even when
they are not in cache - we just read them from disk and instantiate
a new VFS inode.

So how do you attach the correct ruleset to a newly instantiated
inode?

i.e. If you can find the ruleset for any given inode that is brought
into cache (e.g. opening an existing, uncached file), then why do
you need to take inode references so they are never evicted?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08 23:44                 ` Dave Chinner
@ 2024-10-09  6:10                   ` Amir Goldstein
  2024-10-09 14:18                   ` Jan Kara
  1 sibling, 0 replies; 72+ messages in thread
From: Amir Goldstein @ 2024-10-09  6:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Linus Torvalds, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Wed, Oct 9, 2024 at 1:44 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Oct 08, 2024 at 01:23:44PM +0200, Jan Kara wrote:
> > On Tue 08-10-24 10:57:22, Amir Goldstein wrote:
> > > On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > > > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > > > I'd wait with that convertion until this series lands.
> > > > >
> > > > > Honza, I looked at this a bit more, particularly with an eye of "what
> > > > > happens if we just end up making the inode lifetimes subject to the
> > > > > dentry lifetimes" as suggested by Dave elsewhere.
> > > >
> > > > ....
> > > >
> > > > > which makes the fsnotify_inode_delete() happen when the inode is
> > > > > removed from the dentry.
> > > >
> > > > There may be other inode references being held that make
> > > > the inode live longer than the dentry cache. When should the
> > > > fsnotify marks be removed from the inode in that case? Do they need
> > > > to remain until, e.g, writeback completes?
> > > >
> > >
> > > fsnotify inode marks remain until explicitly removed or until sb
> > > is unmounted (*), so other inode references are irrelevant to
> > > inode mark removal.
> > >
> > > (*) fanotify has "evictable" inode marks, which do not hold inode
> > > reference and go away on inode evict, but those mark evictions
> > > do not generate any event (i.e. there is no FAN_UNMOUNT).
> >
> > Yes. Amir beat me with the response so let me just add that FS_UMOUNT event
> > is for inotify which guarantees that either you get an event about somebody
> > unlinking the inode (e.g. IN_DELETE_SELF) or event about filesystem being
> > unmounted (IN_UMOUNT) if you place mark on some inode. I also don't see how
> > we would maintain this behavior with what Linus proposes.
>
> Thanks. I didn't respond last night when I read Amir's decription
> because I wanted to think it over. Knowing where the unmount event
> requirement certainly helps.
>
> I am probably missing something important, but it really seems to me
> that the object reference counting model is the back to
> front.  Currently the mark is being attached to the inode and then
> the inode pinned by a reference count to make the mark attached
> to the inode persistent until unmount. This then requires the inodes
> to be swept by unmount because fsnotify has effectively leaked them
> as it isn't tracking such inodes itself.
>
> [ Keep in mind that I'm not saying this was a bad or wrong thing to
> do because the s_inodes list was there to be able to do this sort of
> lazy cleanup. But now that we want to remove the s_inodes list if at
> all possible, it is a problem we need to solve differently. ]
>
> AFAICT, inotify does not appear to require the inode to send events
> - it only requires access to the inode mark itself. Hence it does
> not the inode in cache to generate IN_UNMOUNT events, it just
> needs the mark itself to be findable at unmount.  Do any of the
> other backends that require unmount notifications that require
> special access to the inode itself?
>

No other backend supports IN_UNMOUNT/FS_UNMOUNT.
We want to add unmount events support to fanotify, but those are
only going to be possible for watching a mount or an sb, not inodes.

> If not, and the fsnotify sb info is tracking these persistent marks,
> then we don't need to iterate inodes at unmount. This means we don't
> need to pin inodes when they have marks attached, and so the
> dependency on the s_inodes list goes away.
>
> With this inverted model, we need the first fsnotify event callout
> after the inode is instantiated to look for a persistent mark for
> the inode. We know how to do this efficiently - it's exactly the
> same caching model we use for ACLs. On the first lookup, we check
> the inode for ACL data and set the ACL pointer appropriately to
> indicate that a lookup has been done and there are no ACLs
> associated with the inode.
>
> At this point, the fsnotify inode marks can all be removed from the
> inode when it is being evicted and there's no need for fsnotify to
> pin inodes at all.
>
> > > > > Then at umount time, the dentry shrinking will deal with all live
> > > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > > just the root dentry inodes?
> > > >
> > > > I don't think even that is necessary, because
> > > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > > inode references, roots included.
> > > >
> > > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > > use of the nasty s_inodes list?
> > > >
> > > > Yes, it would, but someone who knows exactly when the fsnotify
> > > > marks can be removed needs to chime in here...
> >
> > So fsnotify needs a list of inodes for the superblock which have marks
> > attached and for which we hold inode reference. We can keep it inside
> > fsnotify code although it would practically mean another list_head for the
> > inode for this list (probably in our fsnotify_connector structure which
> > connects list of notification marks to the inode).
>
> I don't think that is necessary. We need to get rid of the inode
> reference, not move where we track inode references. The persistent
> object is the fsnotify mark, not the cached inode. It's the mark
> that needs to be persistent, and that's what the fsnotify code
> should be tracking.
>
> The fsnotify marks are much smaller than inodes, and there going to
> be fewer cached marks than inodes, especially once inode pinning is
> removed. Hence I think this will result in a net reduction in memory
> footprint for "marked-until-unmount" configurations as we won't pin
> nearly as many inodes in cache...
>

It is a feasible design which has all the benefits that you listed.
But it is a big change, just to get away from s_inodes
(much easier to maintain a private list of pinned inodes).

inotify (recursive tree watches for that matter) has been
inefficient that way for a long time, and users now have less
memory hogging solutions like fanotify mount and sb marks.
granted, not unprivileged users, but still.

So there needs to be a good justification to make this design change.
One such justification would be to provide the infrastructure to
the feature that Jan referred to as the "holy grail" in his LPC talk,
namely, subtree watches.

If we introduce code that looks up persistent "mark rules" on
inode instantiation, then we could use it to "reconnect" inotify
persistent inode marks (by ino/fid) or to establish automatic
marks based on subtree/path based rules.

audit code has something that resembles this and I suspect that
this Landlock is doing something similar (?), but I didn't check.
path based rules are always going to be elusive and tricky and
Al is always going to hate them ;)

Bottom line - good idea, not easy, requires allocating development resources.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-09  0:21                 ` Dave Chinner
@ 2024-10-09  9:23                   ` Mickaël Salaün
  0 siblings, 0 replies; 72+ messages in thread
From: Mickaël Salaün @ 2024-10-09  9:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet, Jann Horn,
	Serge Hallyn, Kees Cook, linux-security-module, Amir Goldstein,
	Günther Noack, Christian Brauner

On Wed, Oct 09, 2024 at 11:21:10AM +1100, Dave Chinner wrote:
> On Tue, Oct 08, 2024 at 02:59:07PM +0200, Mickaël Salaün wrote:
> > On Mon, Oct 07, 2024 at 05:28:57PM -0700, Linus Torvalds wrote:
> > > On Mon, 7 Oct 2024 at 16:33, Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > There may be other inode references being held that make
> > > > the inode live longer than the dentry cache. When should the
> > > > fsnotify marks be removed from the inode in that case? Do they need
> > > > to remain until, e.g, writeback completes?
> > > 
> > > Note that my idea is to just remove the fsnotify marks when the dentry
> > > discards the inode.
> > > 
> > > That means that yes, the inode may still have a lifetime after the
> > > dentry (because of other references, _or_ just because I_DONTCACHE
> > > isn't set and we keep caching the inode).
> > > 
> > > BUT - fsnotify won't care. There won't be any fsnotify marks on that
> > > inode any more, and without a dentry that points to it, there's no way
> > > to add such marks.
> > > 
> > > (A new dentry may be re-attached to such an inode, and then fsnotify
> > > could re-add new marks, but that doesn't change anything - the next
> > > time the dentry is detached, the marks would go away again).
> > > 
> > > And yes, this changes the timing on when fsnotify events happen, but
> > > what I'm actually hoping for is that Jan will agree that it doesn't
> > > actually matter semantically.
> > > 
> > > > > Then at umount time, the dentry shrinking will deal with all live
> > > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > > just the root dentry inodes?
> > > >
> > > > I don't think even that is necessary, because
> > > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > > inode references, roots included.
> > > 
> > > Ahh - even better.
> > > 
> > > I didn't actually look very closely at the actual umount path, I was
> > > looking just at the fsnotify_inoderemove() place in
> > > dentry_unlink_inode() and went "couldn't we do _this_ instead?"
> > > 
> > > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > > use of the nasty s_inodes list?
> > > >
> > > > Yes, it would, but someone who knows exactly when the fsnotify
> > > > marks can be removed needs to chime in here...
> > > 
> > > Yup. Honza?
> > > 
> > > (Aside: I don't actually know if you prefer Jan or Honza, so I use
> > > both randomly and interchangeably?)
> > > 
> > > > > I have this feeling that maybe we can just remove the other users too
> > > > > using similar models. I think the LSM layer use (in landlock) is bogus
> > > > > for exactly the same reason - there's really no reason to keep things
> > > > > around for a random cached inode without a dentry.
> > > >
> > > > Perhaps, but I'm not sure what the landlock code is actually trying
> > > > to do.
> > 
> > In Landlock, inodes (see landlock_object) may be referenced by several
> > rulesets, either tied to a task's cred or a ruleset's file descriptor.
> > A ruleset may outlive its referenced inodes, and this should not block
> > related umounts.  security_sb_delete() is used to gracefully release
> > such references.
> 
> Ah, there's the problem. The ruleset is persistent, not the inode.
> Like fsnotify, the life cycle and reference counting is upside down.
> The inode should cache the ruleset rather than the ruleset pinning
> the inode.

A ruleset needs to takes a reference to the inode as for an opened file
and keep it "alive" as long as it may be re-used by user space (i.e. as
long as the superblock exists).  One of the goal of a ruleset is to
identify inodes as long as they are accessible.  When a sandboxed
process request to open a file, its sandbox's ruleset checks against the
referenced inodes (in a nutshell).

In practice, rulesets reference a set of struct landlock_object which
references an inode or not (if it vanished).  There is only one
landlock_object referenced per inode.  This makes it possible to have a
dynamic N:M mapping between rulesets and inodes which enables a ruleset
to be deleted before its referenced inodes, or the other way around.

> 
> See my reply to Jan about fsnotify.
> 
> > > Yeah, I wouldn't be surprised if it's just confused - it's very odd.
> > > 
> > > But I'd be perfectly happy just removing one use at a time - even if
> > > we keep the s_inodes list around because of other users, it would
> > > still be "one less thing".
> > > 
> > > > Hence, to me, the lifecycle and reference counting of inode related
> > > > objects in landlock doesn't seem quite right, and the use of the
> > > > security_sb_delete() callout appears to be papering over an internal
> > > > lifecycle issue.
> > > >
> > > > I'd love to get rid of it altogether.
> > 
> > I'm not sure to fully understand the implications for now, but it would
> > definitely be good to simplify this lifetime management.  The only
> > requirement for Landlock is that inodes references should live as long
> > as the related inodes are accessible by user space or already in use.
> > The sooner these references are removed from related ruleset, the
> > better.
> 
> I'm missing something.  Inodes are accessible to users even when
> they are not in cache - we just read them from disk and instantiate
> a new VFS inode.
> 
> So how do you attach the correct ruleset to a newly instantiated
> inode?

We can see a Landlock ruleset as a set of weakly opened files/inodes.  A
Landolck ruleset call iget() to keep the related VFS inodes alive, which
means that when user space opens a file pointing to the same inode, the
same VFS inode will be re-used and then we can match it against a ruleset.

> 
> i.e. If you can find the ruleset for any given inode that is brought
> into cache (e.g. opening an existing, uncached file), then why do
> you need to take inode references so they are never evicted?

A landlock_object only keep a reference to an inode, not to the rulesets
pointing to it:
* inode -> 1 landlock_object or NULL
* landlock_object -> 1 inode or NULL
* ruleset -> N landlock_object

There are mainly two different operations:
1. Match 1 inode against a set of N inode references (i.e. a ruleset).
2. Drop the references of N rulesets (in practice 1 intermediate
   landlock_object) pointing to 1 inode.

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08  0:54               ` Linus Torvalds
@ 2024-10-09  9:49                 ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2024-10-09  9:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Jan Kara, Christoph Hellwig, linux-fsdevel,
	linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module, Amir Goldstein

On Mon 07-10-24 17:54:16, Linus Torvalds wrote:
> On Mon, 7 Oct 2024 at 17:28, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And yes, this changes the timing on when fsnotify events happen, but
> > what I'm actually hoping for is that Jan will agree that it doesn't
> > actually matter semantically.
> 
> .. and yes, I realize it might actually matter. fsnotify does do
> 'ihold()' to hold an inode ref, and with this that would actually be
> more or less pointless, because the mark would be removed _despite_
> such a ref.
> 
> So maybe it's not an option to do what I suggested. I don't know the
> users well enough.

Yeah, we need to keep the notification mark alive either until the inode is
deleted or until the filesystem is unmounted to maintain behavior of
inotify and fanotify APIs.

That being said we could rework lifetime rules inside fsnotify subsystem as
Dave suggests so that fsnotify would not pin inodes, detach it's structures
from inodes on inode reclaim and associate notification marks with inodes
when they are loaded from disk.  But it's a relatively big overhaul.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()
  2024-10-08 23:44                 ` Dave Chinner
  2024-10-09  6:10                   ` Amir Goldstein
@ 2024-10-09 14:18                   ` Jan Kara
  1 sibling, 0 replies; 72+ messages in thread
From: Jan Kara @ 2024-10-09 14:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-bcachefs, kent.overstreet,
	Mickaël Salaün, Jann Horn, Serge Hallyn, Kees Cook,
	linux-security-module

On Wed 09-10-24 10:44:12, Dave Chinner wrote:
> On Tue, Oct 08, 2024 at 01:23:44PM +0200, Jan Kara wrote:
> > On Tue 08-10-24 10:57:22, Amir Goldstein wrote:
> > > On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > > > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > > > I'd wait with that convertion until this series lands.
> > > > >
> > > > > Honza, I looked at this a bit more, particularly with an eye of "what
> > > > > happens if we just end up making the inode lifetimes subject to the
> > > > > dentry lifetimes" as suggested by Dave elsewhere.
> > > >
> > > > ....
> > > >
> > > > > which makes the fsnotify_inode_delete() happen when the inode is
> > > > > removed from the dentry.
> > > >
> > > > There may be other inode references being held that make
> > > > the inode live longer than the dentry cache. When should the
> > > > fsnotify marks be removed from the inode in that case? Do they need
> > > > to remain until, e.g, writeback completes?
> > > >
> > > 
> > > fsnotify inode marks remain until explicitly removed or until sb
> > > is unmounted (*), so other inode references are irrelevant to
> > > inode mark removal.
> > > 
> > > (*) fanotify has "evictable" inode marks, which do not hold inode
> > > reference and go away on inode evict, but those mark evictions
> > > do not generate any event (i.e. there is no FAN_UNMOUNT).
> > 
> > Yes. Amir beat me with the response so let me just add that FS_UMOUNT event
> > is for inotify which guarantees that either you get an event about somebody
> > unlinking the inode (e.g. IN_DELETE_SELF) or event about filesystem being
> > unmounted (IN_UMOUNT) if you place mark on some inode. I also don't see how
> > we would maintain this behavior with what Linus proposes.
> 
> Thanks. I didn't respond last night when I read Amir's decription
> because I wanted to think it over. Knowing where the unmount event
> requirement certainly helps.
> 
> I am probably missing something important, but it really seems to me
> that the object reference counting model is the back to
> front.  Currently the mark is being attached to the inode and then
> the inode pinned by a reference count to make the mark attached
> to the inode persistent until unmount. This then requires the inodes
> to be swept by unmount because fsnotify has effectively leaked them
> as it isn't tracking such inodes itself.
> 
> [ Keep in mind that I'm not saying this was a bad or wrong thing to
> do because the s_inodes list was there to be able to do this sort of
> lazy cleanup. But now that we want to remove the s_inodes list if at
> all possible, it is a problem we need to solve differently. ]

Yes, agreed.

> AFAICT, inotify does not appear to require the inode to send events
> - it only requires access to the inode mark itself. Hence it does
> not the inode in cache to generate IN_UNMOUNT events, it just
> needs the mark itself to be findable at unmount.  Do any of the
> other backends that require unmount notifications that require
> special access to the inode itself?

No, I don't think unmount notification requires looking at the inode and it
is inotify-specific thing as Amir wrote. We do require inode access when
generating fanotify events (to open fd where event happened) but that gets
handled separately by creating struct path when event happens and using it
for dentry_open() later when reporting to userspace so that carries its own
set on dentry + mnt references while the event is waiting in the queue.

> If not, and the fsnotify sb info is tracking these persistent marks,
> then we don't need to iterate inodes at unmount. This means we don't
> need to pin inodes when they have marks attached, and so the
> dependency on the s_inodes list goes away.
> 
> With this inverted model, we need the first fsnotify event callout
> after the inode is instantiated to look for a persistent mark for
> the inode. We know how to do this efficiently - it's exactly the
> same caching model we use for ACLs. On the first lookup, we check
> the inode for ACL data and set the ACL pointer appropriately to
> indicate that a lookup has been done and there are no ACLs
> associated with the inode.

Yes, I agree such scheme should be possible although a small snag I see is
that we need to keep in fsnotify mark enough info so that it can be
associated with an inode when it is read from the disk. And this info is
filesystem specific with uncertain size for filesystems which use iget5().
So I suspect we'll need some support from individual filesystems which is
always tedious.

> At this point, the fsnotify inode marks can all be removed from the
> inode when it is being evicted and there's no need for fsnotify to
> pin inodes at all.
> 
> > > > > Then at umount time, the dentry shrinking will deal with all live
> > > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > > just the root dentry inodes?
> > > >
> > > > I don't think even that is necessary, because
> > > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > > inode references, roots included.
> > > >
> > > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > > use of the nasty s_inodes list?
> > > >
> > > > Yes, it would, but someone who knows exactly when the fsnotify
> > > > marks can be removed needs to chime in here...
> > 
> > So fsnotify needs a list of inodes for the superblock which have marks
> > attached and for which we hold inode reference. We can keep it inside
> > fsnotify code although it would practically mean another list_head for the
> > inode for this list (probably in our fsnotify_connector structure which
> > connects list of notification marks to the inode).
> 
> I don't think that is necessary. We need to get rid of the inode
> reference, not move where we track inode references. The persistent
> object is the fsnotify mark, not the cached inode. It's the mark
> that needs to be persistent, and that's what the fsnotify code
> should be tracking.

Right, I was not precise here. We don't need a list of tracked inodes. We
are fine with a list of all marks for inodes on a superblock which we could
crawl on umount.

> The fsnotify marks are much smaller than inodes, and there going to
> be fewer cached marks than inodes, especially once inode pinning is
> removed. Hence I think this will result in a net reduction in memory
> footprint for "marked-until-unmount" configurations as we won't pin
> nearly as many inodes in cache...

I agree. If fsnotify marks stop pinning inodes, we'll probably win much
more memory by keeping inodes reclaimable than we loose by extra overhead
of the mark tracking.

> > > > > And I wonder if the quota code (which uses the s_inodes list
> > > > > to enable quotas on already mounted filesystems) could for
> > > > > all the same reasons just walk the dentry tree instead (and
> > > > > remove_dquot_ref similarly could just remove it at
> > > > > dentry_unlink_inode() time)?
> > > >
> > > > I don't think that will work because we have to be able to
> > > > modify quota in evict() processing. This is especially true
> > > > for unlinked inodes being evicted from cache, but also the
> > > > dquots need to stay attached until writeback completes.
> > > >
> > > > Hence I don't think we can remove the quota refs from the
> > > > inode before we call iput_final(), and so I think quotaoff (at
> > > > least) still needs to iterate inodes...
> > 
> > Yeah, I'm not sure how to get rid of the s_inodes use in quota
> > code. One of the things we need s_inodes list for is during
> > quotaoff on a mounted filesystem when we need to iterate all
> > inodes which are referencing quota structures and free them.  In
> > theory we could keep a list of inodes referencing quota structures
> > but that would require adding list_head to inode structure for
> > filesystems that support quotas.
> 
> I don't think that's quite true. Quota is not modular, so we can
> lazily free quota objects even when quota is turned off. All we need
> to ensure is that code checks whether quota is enabled, not for the
> existence of quota objects attached to the inode.
> 
> Hence quota-off simply turns off all the quota operations in memory,
> and normal inode eviction cleans up the stale quota objects
> naturally.

Ho, hum, possibly yes. I need to think a bit more about this.

> My main question is why the quota-on add_dquot_ref() pass is
> required. AFAICT all of the filesystem operations that will modify
> quota call dquot_initialize() directly to attach the required dquots
> to the inode before the operation is started. If that's true, then
> why does quota-on need to do this for all the inodes that are
> already in cache?

This is again for handling quotaon on already mounted filesystem. We
initialize quotas for the inode when opening a file so if some files are
already open when we do quotaon, we want to attach quota structures to
these inodes. I think this was kind of important to limit mismatch between
real usage and accounted usage when old style quotas were used e.g. for
root filesystem but to be fair this code was there when I became quota
maintainer in 1999 and I never dared to remove it :)

> > Now for the sake of
> > full context I'll also say that enabling / disabling quotas on a mounted
> > filesystem is a legacy feature because it is quite easy that quota
> > accounting goes wrong with it. So ext4 and f2fs support for quite a few
> > years a mode where quota tracking is enabled on mount and disabled on
> > unmount (if appropriate fs feature is enabled) and you can only enable /
> > disable enforcement of quota limits during runtime.
> 
> Sure, this is how XFS works, too. But I think this behaviour is
> largely irrelevant because there are still filesystems out there
> that do stuff the old way...
> 
> > So I could see us
> > deprecating this functionality altogether although jfs never adapted to
> > this new way we do quotas so we'd have to deal with that somehow.  But one
> > way or another it would take a significant amount of time before we can
> > completely remove this so it is out of question for this series.
> 
> I'm not sure that matters, though it adds to the reasons why we
> should be removing old, unmaintained filesystems from the tree
> and old, outdated formats from maintained filesystems....

True.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2024-10-09 14:18 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-02  1:33 [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Dave Chinner
2024-10-02  1:33 ` [PATCH 1/7] vfs: replace invalidate_inodes() with evict_inodes() Dave Chinner
2024-10-03  7:07   ` Christoph Hellwig
2024-10-03  9:20   ` Jan Kara
2024-10-02  1:33 ` [PATCH 2/7] vfs: add inode iteration superblock method Dave Chinner
2024-10-03  7:12   ` Christoph Hellwig
2024-10-03 10:35     ` Dave Chinner
2024-10-04  9:53   ` kernel test robot
2024-10-02  1:33 ` [PATCH 3/7] vfs: convert vfs inode iterators to super_iter_inodes_unsafe() Dave Chinner
2024-10-03  7:14   ` Christoph Hellwig
2024-10-03 10:45     ` Dave Chinner
2024-10-04 10:55   ` kernel test robot
2024-10-02  1:33 ` [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes() Dave Chinner
2024-10-03  7:23   ` lsm sb_delete hook, was " Christoph Hellwig
2024-10-03  7:38     ` Christoph Hellwig
2024-10-03 11:57       ` Jan Kara
2024-10-03 12:11         ` Christoph Hellwig
2024-10-03 12:26           ` Jan Kara
2024-10-03 12:39             ` Christoph Hellwig
2024-10-03 12:56               ` Jan Kara
2024-10-03 13:04                 ` Christoph Hellwig
2024-10-03 13:59                 ` Dave Chinner
2024-10-03 16:17                   ` Jan Kara
2024-10-04  0:46                     ` Dave Chinner
2024-10-04  7:21                       ` Christian Brauner
2024-10-04 12:14                         ` Christoph Hellwig
2024-10-04 13:49                           ` Jan Kara
2024-10-04 18:15                             ` Paul Moore
2024-10-04 22:57                         ` Dave Chinner
2024-10-05 15:21                           ` Mickaël Salaün
2024-10-05 16:03                             ` Mickaël Salaün
2024-10-05 16:03                             ` Paul Moore
2024-10-07 20:37         ` Linus Torvalds
2024-10-07 23:33           ` Dave Chinner
2024-10-08  0:28             ` Linus Torvalds
2024-10-08  0:54               ` Linus Torvalds
2024-10-09  9:49                 ` Jan Kara
2024-10-08 12:59               ` Mickaël Salaün
2024-10-09  0:21                 ` Dave Chinner
2024-10-09  9:23                   ` Mickaël Salaün
2024-10-08  8:57             ` Amir Goldstein
2024-10-08 11:23               ` Jan Kara
2024-10-08 12:16                 ` Christian Brauner
2024-10-09  0:03                   ` Dave Chinner
2024-10-08 23:44                 ` Dave Chinner
2024-10-09  6:10                   ` Amir Goldstein
2024-10-09 14:18                   ` Jan Kara
2024-10-02  1:33 ` [PATCH 5/7] vfs: add inode iteration superblock method Dave Chinner
2024-10-03  7:24   ` Christoph Hellwig
2024-10-02  1:33 ` [PATCH 6/7] xfs: implement sb->iter_vfs_inodes Dave Chinner
2024-10-03  7:30   ` Christoph Hellwig
2024-10-02  1:33 ` [PATCH 7/7] bcachefs: " Dave Chinner
2024-10-02 10:00 ` [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Christian Brauner
2024-10-02 12:34   ` Dave Chinner
2024-10-02 19:29     ` Kent Overstreet
2024-10-02 22:23       ` Dave Chinner
2024-10-02 23:20         ` Kent Overstreet
2024-10-03  1:41           ` Dave Chinner
2024-10-03  2:24             ` Kent Overstreet
2024-10-03  9:17             ` Jan Kara
2024-10-03  9:59               ` Dave Chinner
2024-10-02 19:49     ` Linus Torvalds
2024-10-02 20:28       ` Kent Overstreet
2024-10-02 23:17         ` Dave Chinner
2024-10-03  1:22           ` Kent Overstreet
2024-10-03  2:20             ` Dave Chinner
2024-10-03  2:42               ` Kent Overstreet
2024-10-03 11:45 ` Jan Kara
2024-10-03 12:18   ` Christoph Hellwig
2024-10-03 12:46     ` Jan Kara
2024-10-03 13:35       ` Dave Chinner
2024-10-03 13:03   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).