Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Christian Brauner <brauner@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	 linux-bcachefs@vger.kernel.org, kent.overstreet@linux.dev,
	torvalds@linux-foundation.org
Subject: Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability
Date: Wed, 2 Oct 2024 12:00:01 +0200	[thread overview]
Message-ID: <20241002-lethargisch-hypnose-fd06ae7a0977@brauner> (raw)
In-Reply-To: <20241002014017.3801899-1-david@fromorbit.com>

On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> The management of the sb->s_inodes list is a scalability limitation;
> it is protected by a single lock and every inode that is
> instantiated has to be added to the list. Those inodes then need to
> be removed from the list when evicted from cache. Hence every inode
> that moves through the VFS inode cache must take this global scope
> lock twice.
> 
> This proves to be a significant limiting factor for concurrent file
> access workloads that repeatedly miss the dentry cache on lookup.
> Directory search and traversal workloads are particularly prone to
> these issues, though on XFS we have enough concurrency capability
> in file creation and unlink for the sb->s_inodes list to be a
> limitation there as well.
> 
> Previous efforts to solve this problem have
> largely centered around reworking the sb->s_inodes list into
> something more scalable such as this longstanding patchset does:
> 
> https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/
> 
> However, a recent discussion about inode cache behaviour that arose
> from the bcachefs 6.12-rc1 pull request opened a new direction for
> us to explore. With both XFS and bcachefs now providing their own
> per-superblock inode cache implementations, we should try to make
> use of these inode caches as first class citizens.
> 
> With that new direction in mind, it became obvious that XFS could
> elide the sb->s_inodes list completely - "the best part is no part"
> - if iteration was not reliant on open-coded sb->s_inodes list
> walks.
> 
> We already use the internal inode cache for iteration, and we have
> filters for selecting specific inodes to operate on with specific
> callback operations. If we had an abstraction for iterating
> all VFS inodes, we can easily implement that directly on the XFS
> inode cache.
> 
> This is what this patchset aims to implement.
> 
> There are two superblock iterator functions provided. The first is a
> generic iterator that provides safe, reference counted inodes for
> the callback to operate on. This is generally what most sb->s_inodes
> iterators use, and it allows the iterator to drop locks and perform
> blocking operations on the inode before moving to the next inode in
> the sb->s_inodes list.
> 
> There is one quirk to this interface - INO_ITER_REFERENCE - because
> fsnotify iterates the inode cache -after- evict_inodes() has been
> called during superblock shutdown to evict all non-referenced
> inodes. Hence it should only find referenced inodes, and it has
> a check to skip unreferenced inodes. This flag does the same.
> 
> However, I suspect this is now somewhat sub-optimal because LSMs can
> hold references to inodes beyond evict_inodes(), and they don't get
> torn down until after fsnotify evicts the referenced inodes it
> holds. However, the landlock LSM doesn't have checks for
> unreferenced inodes (i.e. doesn't use INO_ITER_REFERENCE), so this
> guard is not consistently applied.
> 
> I'm undecided on how best to handle this, but it does not need to be
> solved for this patchset to work. fsnotify and
> landlock don't need to run -after- evict_inodes(), but moving them
> to before evict_inodes() mean we now do three full inode cache
> iterations to evict all the inodes from the cache. That doesn't seem
> like a good idea when there might be hundreds of millions of cached
> inodes at unmount.
> 
> Similarly, needing the iterator to be aware that there should be no
> unreferenced inodes left when they run doesn't seem like a good
> idea, either. So perhaps the answer is that the iterator checks for
> SB_ACTIVE (or some other similar flag) that indicates the superblock
> is being torn down and so will skip zero-referenced inodes
> automatically in this case. Like I said - this doesn't need to be
> solved right now, it's just something to be aware of.
> 
> The second iterator is the "unsafe" iterator variant that only
> provides the callback with an existence guarantee. It does this by
> holding the rcu_read_lock() to guarantee that the inode is not freed
> from under the callback. There are no validity checks performed on
> the inode - it is entirely up to the callback to validate the inode
> can be operated on safely.
> 
> Hence the "unsafe" variant is only for very specific internal uses
> only. Nobody should be adding new uses of this function without
> as there are very few good reasons for external access to inodes
> without holding a valid reference. I have not decided whether the
> unsafe callbacks should require a lockdep_assert_in_rcu_read_lock()
> check in them to clearly document the context under which they are
> running.
> 
> The patchset converts all the open coded iterators to use these
> new iterator functions, which means the only use of sb->s_inodes
> is now confined to fs/super.c (iterator API) and fs/inode.c
> (add/remove API). A new superblock operation is then added to
> call out from the iterators into the filesystem to allow them to run
> the iteration instead of walking the sb->s_inodes list.
> 
> XFS is then converted to use this new superblock operation. I didn't
> use the existing iterator function for this functionality right now
> as it is currently based on radix tree tag lookups. It also uses a
> batched 'lookup and lock' mechanism that complicated matters as I
> developed this code. Hence I open coded a new, simpler cache walk
> for testing purposes.
> 
> Now that I have stuff working and I think I have the callback API
> semantics settled, batched radix tree lookups should still work to
> minimise the iteration overhead. Also, we might want to tag VFS
> inodes in the radix tree so that we can filter them efficiently for
> traversals. This would allow us to use the existing generic inode
> cache walker rather than a separate variant as this patch set
> implements. This can be done as future work, though.
> 
> In terms of scalability improvements, a quick 'will it scale' test
> demonstrates where the sb->s_inodes list hurts. Running a sharded,
> share-nothing cold cache workload with 100,000 files per thread in
> per-thread directories gives the following results on a 4-node 64p
> machine with 128GB RAM.
> 
> The workloads "walk", "chmod" and "unlink" are all directory
> traversal workloads that stream cold cache inodes into the cache.
> There is enough memory on this test machine that these indoes are
> not being reclaimed during the workload, and are being freed between
> steps via drop_caches (which iterates the inode cache and so
> explicitly tests the new iteration APIs!). Hence the sb->s_inodes
> scalability issues aren't as bad in these tests as when memory is
> tight and inodes are being reclaimed (i.e. the issues are worse in
> real workloads).
> 
> The "bulkstat" workload uses the XFS bulkstat ioctl to iterate
> inodes via walking the internal inode btrees. It uses
> d_mark_dontcache() so it is actually tearing down each inode as soon
> as it has been sampled by the bulkstat code. Hence it is doing two
> sb->s_inodes list manipulations per inode and so shows scalability
> issues much earlier than the other workloads.
> 
> Before:
> 
> Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
>        xfs     400000     4      4.269      3.225      4.557      7.316      1.306
>        xfs     800000     8      4.844      3.227      4.702      7.905      1.908
>        xfs    1600000    16      6.286      3.296      5.592      8.838      4.392
>        xfs    3200000    32      8.912      5.681      8.505     11.724      7.085
>        xfs    6400000    64     15.344     11.144     14.162     18.604     15.494
> 
> Bulkstat starts to show issues at 8 threads, walk and chmod between
> 16 and 32 threads, and unlink is limited by internal XFS stuff.
> Bulkstat is bottlenecked at about 400-450 thousand inodes/s by the
> sb->s_inodes list management.
> 
> After:
> 
> Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
>        xfs     400000     4      4.140      3.502      4.154      7.242      1.164
>        xfs     800000     8      4.637      2.836      4.444      7.896      1.093
>        xfs    1600000    16      5.549      3.054      5.213      8.696      1.107
>        xfs    3200000    32      8.387      3.218      6.867     10.668      1.125
>        xfs    6400000    64     14.112      3.953     10.365     18.620      1.270
> 
> When patched, walk shows little in way of scalability degradation
> out to 64 threads, chmod is significantly improved at 32-64 threads,
> and bulkstat shows perfect scalability out to 64 threads now.
> 
> I did a couple of other longer running, higher inode count tests
> with bulkstat to get an idea of inode cache streaming rates - 32
> million inodes scanned in 4.4 seconds at 64 threads. That's about
> 7.2 million inodes/s being streamed through the inode cache with the
> IO rates are peaking well above 5.5GB/s (near IO bound).
> 
> Hence raw VFS inode cache throughput sees a ~17x scalability
> improvement on XFS at 64 threads (and probably a -lot- more on
> higher CPU count machines).  That's far better performance than I
> ever got from the dlist conversion of the sb->s_inodes list in
> previous patchsets, so this seems like a much better direction to be
> heading for optimising the way we cache inodes.
> 
> I haven't done a lot of testing on this patchset yet - it boots and
> appears to work OK for block devices, ext4 and XFS, but checking
> stuff like quota on/off is still working properly on ext4 hasn't
> been done yet.
> 
> What do people think of moving towards per-sb inode caching and
> traversal mechanisms like this?

Patches 1-4 are great cleanups that I would like us to merge even
independent of the rest.

I don't have big conceptual issues with the series otherwise. The only
thing that makes me a bit uneasy is that we are now providing an api
that may encourage filesystems to do their own inode caching even if
they don't really have a need for it just because it's there. So really
a way that would've solved this issue generically would have been my
preference.

But the reality is that xfs has been doing that private inode cache for
a long time and reading through 5/7 and 6/7 it clearly provides value
for xfs. So I find it hard to object to adding ->iter_vfs_inodes()
(Though I would like to s/iter_vfs_inodes/iter_inodes/g).

next prev parent reply	other threads:[~2024-10-02 10:00 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-02  1:33 [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Dave Chinner
2024-10-02  1:33 ` [PATCH 1/7] vfs: replace invalidate_inodes() with evict_inodes() Dave Chinner
2024-10-03  7:07   ` Christoph Hellwig
2024-10-03  9:20   ` Jan Kara
2024-10-02  1:33 ` [PATCH 2/7] vfs: add inode iteration superblock method Dave Chinner
2024-10-03  7:12   ` Christoph Hellwig
2024-10-03 10:35     ` Dave Chinner
2024-10-04  9:53   ` kernel test robot
2024-10-02  1:33 ` [PATCH 3/7] vfs: convert vfs inode iterators to super_iter_inodes_unsafe() Dave Chinner
2024-10-03  7:14   ` Christoph Hellwig
2024-10-03 10:45     ` Dave Chinner
2024-10-04 10:55   ` kernel test robot
2024-10-02  1:33 ` [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes() Dave Chinner
2024-10-03  7:23   ` lsm sb_delete hook, was " Christoph Hellwig
2024-10-03  7:38     ` Christoph Hellwig
2024-10-03 11:57       ` Jan Kara
2024-10-03 12:11         ` Christoph Hellwig
2024-10-03 12:26           ` Jan Kara
2024-10-03 12:39             ` Christoph Hellwig
2024-10-03 12:56               ` Jan Kara
2024-10-03 13:04                 ` Christoph Hellwig
2024-10-03 13:59                 ` Dave Chinner
2024-10-03 16:17                   ` Jan Kara
2024-10-04  0:46                     ` Dave Chinner
2024-10-04  7:21                       ` Christian Brauner
2024-10-04 12:14                         ` Christoph Hellwig
2024-10-04 13:49                           ` Jan Kara
2024-10-04 18:15                             ` Paul Moore
2024-10-04 22:57                         ` Dave Chinner
2024-10-05 15:21                           ` Mickaël Salaün
2024-10-05 16:03                             ` Mickaël Salaün
2024-10-05 16:03                             ` Paul Moore
2024-10-07 20:37         ` Linus Torvalds
2024-10-07 23:33           ` Dave Chinner
2024-10-08  0:28             ` Linus Torvalds
2024-10-08  0:54               ` Linus Torvalds
2024-10-09  9:49                 ` Jan Kara
2024-10-08 12:59               ` Mickaël Salaün
2024-10-09  0:21                 ` Dave Chinner
2024-10-09  9:23                   ` Mickaël Salaün
2024-10-08  8:57             ` Amir Goldstein
2024-10-08 11:23               ` Jan Kara
2024-10-08 12:16                 ` Christian Brauner
2024-10-09  0:03                   ` Dave Chinner
2024-10-08 23:44                 ` Dave Chinner
2024-10-09  6:10                   ` Amir Goldstein
2024-10-09 14:18                   ` Jan Kara
2024-10-02  1:33 ` [PATCH 5/7] vfs: add inode iteration superblock method Dave Chinner
2024-10-03  7:24   ` Christoph Hellwig
2024-10-02  1:33 ` [PATCH 6/7] xfs: implement sb->iter_vfs_inodes Dave Chinner
2024-10-03  7:30   ` Christoph Hellwig
2024-10-02  1:33 ` [PATCH 7/7] bcachefs: " Dave Chinner
2024-10-02 10:00 ` Christian Brauner [this message]
2024-10-02 12:34   ` [RFC PATCH 0/7] vfs: improving inode cache iteration scalability Dave Chinner
2024-10-02 19:29     ` Kent Overstreet
2024-10-02 22:23       ` Dave Chinner
2024-10-02 23:20         ` Kent Overstreet
2024-10-03  1:41           ` Dave Chinner
2024-10-03  2:24             ` Kent Overstreet
2024-10-03  9:17             ` Jan Kara
2024-10-03  9:59               ` Dave Chinner
2024-10-02 19:49     ` Linus Torvalds
2024-10-02 20:28       ` Kent Overstreet
2024-10-02 23:17         ` Dave Chinner
2024-10-03  1:22           ` Kent Overstreet
2024-10-03  2:20             ` Dave Chinner
2024-10-03  2:42               ` Kent Overstreet
2024-10-03 11:45 ` Jan Kara
2024-10-03 12:18   ` Christoph Hellwig
2024-10-03 12:46     ` Jan Kara
2024-10-03 13:35       ` Dave Chinner
2024-10-03 13:03   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241002-lethargisch-hypnose-fd06ae7a0977@brauner \
    --to=brauner@kernel.org \
    --cc=david@fromorbit.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).