linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/54] fs: rework inode reference counting
@ 2025-08-26 15:39 Josef Bacik
  2025-08-26 15:39 ` [PATCH v2 01/54] fs: make the i_state flags an enum Josef Bacik
                   ` (57 more replies)
  0 siblings, 58 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
	brauner, viro, amir73il

v1: https://lore.kernel.org/linux-fsdevel/cover.1755806649.git.josef@toxicpanda.com/

v1->v2:
- Fixed all the things that Christian pointed out.
- Re-ordered some of the patches to the front in case Christian wants to take
  those first.
- Added a new patch for reading the current i_count and propagated that
  everywhere.
- Fixed the cifs build breakage.
- Removed I_REFERENCED since it's no longer needed.
- Remove I_LRU_ISOLATING since it's no longer needed.
- Reworked the drop_nlink/clear_nlink part to simply remove the inode from the
  LRU in the unlink path, and made this its own patch to make the behavior
  change clear.
- NOTE: I'm re-running fstests on this now, there was a slight issue with
  removing the drop_nlink/clear_nlink patch and so I had to add the unlink/rmdir
  patch to resolve it. I assume everything will be fine but just an FYI.
- NOTE #2: I reordered stuff, and I did a rebase and rebuild at every step, but
  I noticed this morning I still missed an odd rebase artifact, so by all means
  validate I didn't make any silly mistakes on the in-between patches.

--- Original email ---

Hello,

This series is the first part of a larger body of work geared towards solving a
variety of scalability issues in the VFS.

We have historically had a variety of foot-guns related to inode freeing.  We
have I_WILL_FREE and I_FREEING flags that indicated when the inode was in the
different stages of being reclaimed.  This lead to confusion, and bugs in cases
where one was checked but the other wasn't.  Additionally, it's frankly
confusing to have both of these flags and to deal with them in practice.

However, this exists because we have an odd behavior with inodes, we allow them
to have a 0 reference count and still be usable. This again is a pretty unfun
footgun, because generally speaking we want reference counts to be meaningful.

The problem with the way we reference inodes is the final iput(). The majority
of file systems do their final truncate of a unlinked inode in their
->evict_inode() callback, which happens when the inode is actually being
evicted. This can be a long process for large inodes, and thus isn't safe to
happen in a variety of contexts. Btrfs, for example, has an entire delayed iput
infrastructure to make sure that we do not do the final iput() in a dangerous
context. We cannot expand the use of this reference count to all the places the
inode is used, because there are cases where we would need to iput() in an IRQ
context  (end folio writeback) or other unsafe context, which is not allowed.

To that end, resolve this by introducing a new i_obj_count reference count. This
will be used to control when we can actually free the inode. We then can use
this reference count in all the places where we may reference the inode. This
removes another huge footgun, having ways to access the inode itself without
having an actual reference to it. The writeback code is one of the main places
where we see this. Inodes end up on all sorts of lists here without a proper
reference count. This allows us to protect the inode from being freed by giving
this an other code mechanisms to protect their access to the inode.

With this we can separate the concept of the inode being usable, and the inode
being freed.  The next part of the patch series is to stop allowing for inodes
to have an i_count of 0 and still be viable.  This comes with some warts. The
biggest wart is now if we choose to cache inodes in the LRU list we have to
remove the inode from the LRU list if we access it once it's on the LRU list.
This will result in more contention on the lru list lock, but in practice we
rarely have inodes that do not have a dentry, and if we do that inode is not
long for this world.

With not allowing inodes to hit a refcount of 0, we can take advantage of that
common pattern of using refcount_inc_not_zero() in all of the lockless places
where we do inode lookup in cache.  From there we can change all the users who
check I_WILL_FREE or I_FREEING to simply check the i_count. If it is 0 then they
aren't allowed to do their work, othrwise they can proceed as normal.

With all of that in place we can finally remove these two flags.

This is a large series, but it is mostly mechanical. I've kept the patches very
small, to make it easy to review and logic about each change. I have run this
through fstests for btrfs and ext4, xfs is currently going. I wanted to get this
out for review to make sure this big design changes are reasonable to everybody.

The series is based on vfs/vfs.all branch, which is based on 6.9-rc1. Thanks,

Josef

Josef Bacik (54):
  fs: make the i_state flags an enum
  fs: add an icount_read helper
  fs: rework iput logic
  fs: add an i_obj_count refcount to the inode
  fs: hold an i_obj_count reference in wait_sb_inodes
  fs: hold an i_obj_count reference for the i_wb_list
  fs: hold an i_obj_count reference for the i_io_list
  fs: hold an i_obj_count reference in writeback_sb_inodes
  fs: hold an i_obj_count reference while on the hashtable
  fs: hold an i_obj_count reference while on the LRU list
  fs: hold an i_obj_count reference while on the sb inode list
  fs: stop accessing ->i_count directly in f2fs and gfs2
  fs: hold an i_obj_count when we have an i_count reference
  fs: add an I_LRU flag to the inode
  fs: maintain a list of pinned inodes
  fs: delete the inode from the LRU list on lookup
  fs: remove the inode from the LRU list on unlink/rmdir
  fs: change evict_inodes to use iput instead of evict directly
  fs: hold a full ref while the inode is on a LRU
  fs: disallow 0 reference count inodes
  fs: make evict_inodes add to the dispose list under the i_lock
  fs: convert i_count to refcount_t
  fs: use refcount_inc_not_zero in igrab
  fs: use inode_tryget in find_inode*
  fs: update find_inode_*rcu to check the i_count count
  fs: use igrab in insert_inode_locked
  fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru
  fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating
  fs: use inode_tryget in evict_inodes
  fs: change evict_dentries_for_decrypted_inodes to use refcount
  block: use igrab in sync_bdevs
  bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING
  btrfs: don't check I_WILL_FREE|I_FREEING
  fs: use igrab in drop_pagecache_sb
  fs: stop checking I_FREEING in d_find_alias_rcu
  ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env
  fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c
  gfs2: remove I_WILL_FREE|I_FREEING usage
  fs: remove I_WILL_FREE|I_FREEING check from dquot.c
  notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes
  xfs: remove I_FREEING check
  landlock: remove I_FREEING|I_WILL_FREE check
  fs: change inode_is_dirtytime_only to use refcount
  btrfs: remove references to I_FREEING
  ext4: remove reference to I_FREEING in inode.c
  ext4: remove reference to I_FREEING in orphan.c
  pnfs: use i_count refcount to determine if the inode is going away
  fs: remove some spurious I_FREEING references in inode.c
  xfs: remove reference to I_FREEING|I_WILL_FREE
  ocfs2: do not set I_WILL_FREE
  fs: remove I_FREEING|I_WILL_FREE
  fs: remove I_REFERENCED
  fs: remove I_LRU_ISOLATING flag
  fs: add documentation explaining the reference count rules for inodes

 Documentation/filesystems/vfs.rst        |  86 +++++
 arch/powerpc/platforms/cell/spufs/file.c |   2 +-
 block/bdev.c                             |   8 +-
 fs/bcachefs/fs.c                         |   3 +-
 fs/btrfs/inode.c                         |  11 +-
 fs/ceph/mds_client.c                     |   2 +-
 fs/crypto/keyring.c                      |   7 +-
 fs/dcache.c                              |   4 +-
 fs/drop_caches.c                         |  11 +-
 fs/ext4/ialloc.c                         |   4 +-
 fs/ext4/inode.c                          |   8 +-
 fs/ext4/orphan.c                         |   6 +-
 fs/f2fs/super.c                          |   4 +-
 fs/fs-writeback.c                        | 105 ++++--
 fs/gfs2/ops_fstype.c                     |  17 +-
 fs/hpfs/inode.c                          |   2 +-
 fs/inode.c                               | 422 ++++++++++++++---------
 fs/internal.h                            |   1 +
 fs/namei.c                               |  30 +-
 fs/nfs/inode.c                           |   4 +-
 fs/nfs/pnfs.c                            |   2 +-
 fs/notify/fsnotify.c                     |  26 +-
 fs/ocfs2/inode.c                         |   4 -
 fs/quota/dquot.c                         |   6 +-
 fs/smb/client/inode.c                    |   2 +-
 fs/super.c                               |   3 +
 fs/ubifs/super.c                         |   2 +-
 fs/xfs/scrub/common.c                    |   3 +-
 fs/xfs/xfs_bmap_util.c                   |   2 +-
 fs/xfs/xfs_inode.c                       |   2 +-
 fs/xfs/xfs_trace.h                       |   2 +-
 include/linux/fs.h                       | 285 ++++++++-------
 include/trace/events/filelock.h          |   2 +-
 include/trace/events/writeback.h         |   6 +-
 security/landlock/fs.c                   |  22 +-
 35 files changed, 684 insertions(+), 422 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2025-09-02 21:16 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
2025-08-26 15:39 ` [PATCH v2 01/54] fs: make the i_state flags an enum Josef Bacik
2025-08-26 15:39 ` [PATCH v2 02/54] fs: add an icount_read helper Josef Bacik
2025-08-26 22:18   ` Mateusz Guzik
2025-08-27 11:25   ` (subset) " Christian Brauner
2025-08-26 15:39 ` [PATCH v2 03/54] fs: rework iput logic Josef Bacik
2025-08-27 12:58   ` Mateusz Guzik
2025-08-27 14:18     ` Mateusz Guzik
2025-08-27 14:54       ` Josef Bacik
2025-08-27 14:57       ` Christian Brauner
2025-08-27 16:24         ` [PATCH] fs: revamp iput() Mateusz Guzik
2025-08-30 15:54           ` Mateusz Guzik
2025-09-01  8:50             ` Jan Kara
2025-09-01 10:39               ` Christian Brauner
2025-09-01 10:41             ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 04/54] fs: add an i_obj_count refcount to the inode Josef Bacik
2025-08-26 15:39 ` [PATCH v2 05/54] fs: hold an i_obj_count reference in wait_sb_inodes Josef Bacik
2025-08-26 15:39 ` [PATCH v2 06/54] fs: hold an i_obj_count reference for the i_wb_list Josef Bacik
2025-08-26 15:39 ` [PATCH v2 07/54] fs: hold an i_obj_count reference for the i_io_list Josef Bacik
2025-08-26 15:39 ` [PATCH v2 08/54] fs: hold an i_obj_count reference in writeback_sb_inodes Josef Bacik
2025-08-26 15:39 ` [PATCH v2 09/54] fs: hold an i_obj_count reference while on the hashtable Josef Bacik
2025-08-26 15:39 ` [PATCH v2 10/54] fs: hold an i_obj_count reference while on the LRU list Josef Bacik
2025-08-26 15:39 ` [PATCH v2 11/54] fs: hold an i_obj_count reference while on the sb inode list Josef Bacik
2025-08-26 15:39 ` [PATCH v2 12/54] fs: stop accessing ->i_count directly in f2fs and gfs2 Josef Bacik
2025-08-26 15:39 ` [PATCH v2 13/54] fs: hold an i_obj_count when we have an i_count reference Josef Bacik
2025-08-26 15:39 ` [PATCH v2 14/54] fs: add an I_LRU flag to the inode Josef Bacik
2025-08-26 15:39 ` [PATCH v2 15/54] fs: maintain a list of pinned inodes Josef Bacik
2025-08-27 15:20   ` Christian Brauner
2025-08-27 16:07     ` Josef Bacik
2025-08-28  8:24       ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 16/54] fs: delete the inode from the LRU list on lookup Josef Bacik
2025-08-27 21:46   ` Dave Chinner
2025-08-28 11:42     ` Josef Bacik
2025-09-02  4:07       ` Dave Chinner
2025-08-26 15:39 ` [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir Josef Bacik
2025-08-27 12:32   ` Christian Brauner
2025-08-27 16:08     ` Josef Bacik
2025-08-27 22:01     ` Dave Chinner
2025-08-28 11:46       ` Josef Bacik
2025-09-02  1:48         ` Dave Chinner
2025-08-28  9:00   ` Christian Brauner
2025-08-28  9:06   ` Christian Brauner
2025-08-28 10:43     ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 18/54] fs: change evict_inodes to use iput instead of evict directly Josef Bacik
2025-08-28 10:18   ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 19/54] fs: hold a full ref while the inode is on a LRU Josef Bacik
2025-08-28 10:51   ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 20/54] fs: disallow 0 reference count inodes Josef Bacik
2025-08-28 11:02   ` Christian Brauner
2025-08-28 11:44     ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 21/54] fs: make evict_inodes add to the dispose list under the i_lock Josef Bacik
2025-08-26 15:39 ` [PATCH v2 22/54] fs: convert i_count to refcount_t Josef Bacik
2025-08-28 12:00   ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 23/54] fs: use refcount_inc_not_zero in igrab Josef Bacik
2025-08-28 22:08   ` Eric Biggers
2025-08-29 13:42     ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 24/54] fs: use inode_tryget in find_inode* Josef Bacik
2025-08-26 15:39 ` [PATCH v2 25/54] fs: update find_inode_*rcu to check the i_count count Josef Bacik
2025-08-26 15:39 ` [PATCH v2 26/54] fs: use igrab in insert_inode_locked Josef Bacik
2025-08-28 12:15   ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 27/54] fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru Josef Bacik
2025-08-26 15:39 ` [PATCH v2 28/54] fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating Josef Bacik
2025-08-26 15:39 ` [PATCH v2 29/54] fs: use inode_tryget in evict_inodes Josef Bacik
2025-08-26 15:39 ` [PATCH v2 30/54] fs: change evict_dentries_for_decrypted_inodes to use refcount Josef Bacik
2025-08-28 12:25   ` Christian Brauner
2025-08-28 22:26     ` Eric Biggers
2025-08-29  7:38       ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 31/54] block: use igrab in sync_bdevs Josef Bacik
2025-08-26 15:39 ` [PATCH v2 32/54] bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING Josef Bacik
2025-08-26 15:39 ` [PATCH v2 33/54] btrfs: don't check I_WILL_FREE|I_FREEING Josef Bacik
2025-08-26 15:39 ` [PATCH v2 34/54] fs: use igrab in drop_pagecache_sb Josef Bacik
2025-08-26 15:39 ` [PATCH v2 35/54] fs: stop checking I_FREEING in d_find_alias_rcu Josef Bacik
2025-08-26 15:39 ` [PATCH v2 36/54] ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env Josef Bacik
2025-08-26 15:39 ` [PATCH v2 37/54] fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c Josef Bacik
2025-08-26 15:39 ` [PATCH v2 38/54] gfs2: remove I_WILL_FREE|I_FREEING usage Josef Bacik
2025-08-26 15:39 ` [PATCH v2 39/54] fs: remove I_WILL_FREE|I_FREEING check from dquot.c Josef Bacik
2025-08-28 12:35   ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 40/54] notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes Josef Bacik
2025-08-26 15:39 ` [PATCH v2 41/54] xfs: remove I_FREEING check Josef Bacik
2025-08-26 15:39 ` [PATCH v2 42/54] landlock: remove I_FREEING|I_WILL_FREE check Josef Bacik
2025-08-26 15:39 ` [PATCH v2 43/54] fs: change inode_is_dirtytime_only to use refcount Josef Bacik
2025-08-26 22:06   ` Mateusz Guzik
2025-08-28 12:38     ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 44/54] btrfs: remove references to I_FREEING Josef Bacik
2025-08-26 15:39 ` [PATCH v2 45/54] ext4: remove reference to I_FREEING in inode.c Josef Bacik
2025-08-26 15:39 ` [PATCH v2 46/54] ext4: remove reference to I_FREEING in orphan.c Josef Bacik
2025-08-26 15:39 ` [PATCH v2 47/54] pnfs: use i_count refcount to determine if the inode is going away Josef Bacik
2025-08-26 15:39 ` [PATCH v2 48/54] fs: remove some spurious I_FREEING references in inode.c Josef Bacik
2025-08-28 12:40   ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 49/54] xfs: remove reference to I_FREEING|I_WILL_FREE Josef Bacik
2025-08-26 15:39 ` [PATCH v2 50/54] ocfs2: do not set I_WILL_FREE Josef Bacik
2025-08-26 15:39 ` [PATCH v2 51/54] fs: remove I_FREEING|I_WILL_FREE Josef Bacik
2025-08-28 12:42   ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 52/54] fs: remove I_REFERENCED Josef Bacik
2025-08-28 12:47   ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 53/54] fs: remove I_LRU_ISOLATING flag Josef Bacik
2025-08-28  0:26   ` Dave Chinner
2025-08-28 10:35     ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 54/54] fs: add documentation explaining the reference count rules for inodes Josef Bacik
2025-08-27  8:03 ` [syzbot ci] Re: fs: rework inode reference counting syzbot ci
2025-08-27 11:14 ` (subset) [PATCH v2 00/54] " Christian Brauner
2025-08-28 12:51 ` Christian Brauner
2025-08-28 21:22   ` Josef Bacik
2025-09-02 10:06 ` Mateusz Guzik
2025-09-02 21:16   ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).