* [PATCH v2 00/54] fs: rework inode reference counting
@ 2025-08-26 15:39 Josef Bacik
2025-08-26 15:39 ` [PATCH v2 01/54] fs: make the i_state flags an enum Josef Bacik
` (57 more replies)
0 siblings, 58 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
v1: https://lore.kernel.org/linux-fsdevel/cover.1755806649.git.josef@toxicpanda.com/
v1->v2:
- Fixed all the things that Christian pointed out.
- Re-ordered some of the patches to the front in case Christian wants to take
those first.
- Added a new patch for reading the current i_count and propagated that
everywhere.
- Fixed the cifs build breakage.
- Removed I_REFERENCED since it's no longer needed.
- Remove I_LRU_ISOLATING since it's no longer needed.
- Reworked the drop_nlink/clear_nlink part to simply remove the inode from the
LRU in the unlink path, and made this its own patch to make the behavior
change clear.
- NOTE: I'm re-running fstests on this now, there was a slight issue with
removing the drop_nlink/clear_nlink patch and so I had to add the unlink/rmdir
patch to resolve it. I assume everything will be fine but just an FYI.
- NOTE #2: I reordered stuff, and I did a rebase and rebuild at every step, but
I noticed this morning I still missed an odd rebase artifact, so by all means
validate I didn't make any silly mistakes on the in-between patches.
--- Original email ---
Hello,
This series is the first part of a larger body of work geared towards solving a
variety of scalability issues in the VFS.
We have historically had a variety of foot-guns related to inode freeing. We
have I_WILL_FREE and I_FREEING flags that indicated when the inode was in the
different stages of being reclaimed. This lead to confusion, and bugs in cases
where one was checked but the other wasn't. Additionally, it's frankly
confusing to have both of these flags and to deal with them in practice.
However, this exists because we have an odd behavior with inodes, we allow them
to have a 0 reference count and still be usable. This again is a pretty unfun
footgun, because generally speaking we want reference counts to be meaningful.
The problem with the way we reference inodes is the final iput(). The majority
of file systems do their final truncate of a unlinked inode in their
->evict_inode() callback, which happens when the inode is actually being
evicted. This can be a long process for large inodes, and thus isn't safe to
happen in a variety of contexts. Btrfs, for example, has an entire delayed iput
infrastructure to make sure that we do not do the final iput() in a dangerous
context. We cannot expand the use of this reference count to all the places the
inode is used, because there are cases where we would need to iput() in an IRQ
context (end folio writeback) or other unsafe context, which is not allowed.
To that end, resolve this by introducing a new i_obj_count reference count. This
will be used to control when we can actually free the inode. We then can use
this reference count in all the places where we may reference the inode. This
removes another huge footgun, having ways to access the inode itself without
having an actual reference to it. The writeback code is one of the main places
where we see this. Inodes end up on all sorts of lists here without a proper
reference count. This allows us to protect the inode from being freed by giving
this an other code mechanisms to protect their access to the inode.
With this we can separate the concept of the inode being usable, and the inode
being freed. The next part of the patch series is to stop allowing for inodes
to have an i_count of 0 and still be viable. This comes with some warts. The
biggest wart is now if we choose to cache inodes in the LRU list we have to
remove the inode from the LRU list if we access it once it's on the LRU list.
This will result in more contention on the lru list lock, but in practice we
rarely have inodes that do not have a dentry, and if we do that inode is not
long for this world.
With not allowing inodes to hit a refcount of 0, we can take advantage of that
common pattern of using refcount_inc_not_zero() in all of the lockless places
where we do inode lookup in cache. From there we can change all the users who
check I_WILL_FREE or I_FREEING to simply check the i_count. If it is 0 then they
aren't allowed to do their work, othrwise they can proceed as normal.
With all of that in place we can finally remove these two flags.
This is a large series, but it is mostly mechanical. I've kept the patches very
small, to make it easy to review and logic about each change. I have run this
through fstests for btrfs and ext4, xfs is currently going. I wanted to get this
out for review to make sure this big design changes are reasonable to everybody.
The series is based on vfs/vfs.all branch, which is based on 6.9-rc1. Thanks,
Josef
Josef Bacik (54):
fs: make the i_state flags an enum
fs: add an icount_read helper
fs: rework iput logic
fs: add an i_obj_count refcount to the inode
fs: hold an i_obj_count reference in wait_sb_inodes
fs: hold an i_obj_count reference for the i_wb_list
fs: hold an i_obj_count reference for the i_io_list
fs: hold an i_obj_count reference in writeback_sb_inodes
fs: hold an i_obj_count reference while on the hashtable
fs: hold an i_obj_count reference while on the LRU list
fs: hold an i_obj_count reference while on the sb inode list
fs: stop accessing ->i_count directly in f2fs and gfs2
fs: hold an i_obj_count when we have an i_count reference
fs: add an I_LRU flag to the inode
fs: maintain a list of pinned inodes
fs: delete the inode from the LRU list on lookup
fs: remove the inode from the LRU list on unlink/rmdir
fs: change evict_inodes to use iput instead of evict directly
fs: hold a full ref while the inode is on a LRU
fs: disallow 0 reference count inodes
fs: make evict_inodes add to the dispose list under the i_lock
fs: convert i_count to refcount_t
fs: use refcount_inc_not_zero in igrab
fs: use inode_tryget in find_inode*
fs: update find_inode_*rcu to check the i_count count
fs: use igrab in insert_inode_locked
fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru
fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating
fs: use inode_tryget in evict_inodes
fs: change evict_dentries_for_decrypted_inodes to use refcount
block: use igrab in sync_bdevs
bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING
btrfs: don't check I_WILL_FREE|I_FREEING
fs: use igrab in drop_pagecache_sb
fs: stop checking I_FREEING in d_find_alias_rcu
ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env
fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c
gfs2: remove I_WILL_FREE|I_FREEING usage
fs: remove I_WILL_FREE|I_FREEING check from dquot.c
notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes
xfs: remove I_FREEING check
landlock: remove I_FREEING|I_WILL_FREE check
fs: change inode_is_dirtytime_only to use refcount
btrfs: remove references to I_FREEING
ext4: remove reference to I_FREEING in inode.c
ext4: remove reference to I_FREEING in orphan.c
pnfs: use i_count refcount to determine if the inode is going away
fs: remove some spurious I_FREEING references in inode.c
xfs: remove reference to I_FREEING|I_WILL_FREE
ocfs2: do not set I_WILL_FREE
fs: remove I_FREEING|I_WILL_FREE
fs: remove I_REFERENCED
fs: remove I_LRU_ISOLATING flag
fs: add documentation explaining the reference count rules for inodes
Documentation/filesystems/vfs.rst | 86 +++++
arch/powerpc/platforms/cell/spufs/file.c | 2 +-
block/bdev.c | 8 +-
fs/bcachefs/fs.c | 3 +-
fs/btrfs/inode.c | 11 +-
fs/ceph/mds_client.c | 2 +-
fs/crypto/keyring.c | 7 +-
fs/dcache.c | 4 +-
fs/drop_caches.c | 11 +-
fs/ext4/ialloc.c | 4 +-
fs/ext4/inode.c | 8 +-
fs/ext4/orphan.c | 6 +-
fs/f2fs/super.c | 4 +-
fs/fs-writeback.c | 105 ++++--
fs/gfs2/ops_fstype.c | 17 +-
fs/hpfs/inode.c | 2 +-
fs/inode.c | 422 ++++++++++++++---------
fs/internal.h | 1 +
fs/namei.c | 30 +-
fs/nfs/inode.c | 4 +-
fs/nfs/pnfs.c | 2 +-
fs/notify/fsnotify.c | 26 +-
fs/ocfs2/inode.c | 4 -
fs/quota/dquot.c | 6 +-
fs/smb/client/inode.c | 2 +-
fs/super.c | 3 +
fs/ubifs/super.c | 2 +-
fs/xfs/scrub/common.c | 3 +-
fs/xfs/xfs_bmap_util.c | 2 +-
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_trace.h | 2 +-
include/linux/fs.h | 285 ++++++++-------
include/trace/events/filelock.h | 2 +-
include/trace/events/writeback.h | 6 +-
security/landlock/fs.c | 22 +-
35 files changed, 684 insertions(+), 422 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 105+ messages in thread
* [PATCH v2 01/54] fs: make the i_state flags an enum
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 02/54] fs: add an icount_read helper Josef Bacik
` (56 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Adjusting i_state flags always means updating the values manually. Bring
these forward into the 2020's and make a nice clean macro for defining
the i_state values as an enum, providing __ variants for the cases where
we need the bit position instead of the actual value, and leaving the
actual NAME as the 1U << bit value.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
include/linux/fs.h | 229 +++++++++++++++++++++++----------------------
1 file changed, 117 insertions(+), 112 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a346422f5066..3dbaf1ca1828 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -665,6 +665,122 @@ is_uncached_acl(struct posix_acl *acl)
#define IOP_MGTIME 0x0020
#define IOP_CACHED_LINK 0x0040
+/*
+ * Inode state bits. Protected by inode->i_lock
+ *
+ * Four bits determine the dirty state of the inode: I_DIRTY_SYNC,
+ * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME.
+ *
+ * Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
+ * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
+ * various stages of removing an inode.
+ *
+ * Two bits are used for locking and completion notification, I_NEW and I_SYNC.
+ *
+ * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
+ * fdatasync() (unless I_DIRTY_DATASYNC is also set).
+ * Timestamp updates are the usual cause.
+ * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
+ * these changes separately from I_DIRTY_SYNC so that we
+ * don't have to write inode on fdatasync() when only
+ * e.g. the timestamps have changed.
+ * I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean.
+ * I_DIRTY_TIME The inode itself has dirty timestamps, and the
+ * lazytime mount option is enabled. We keep track of this
+ * separately from I_DIRTY_SYNC in order to implement
+ * lazytime. This gets cleared if I_DIRTY_INODE
+ * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
+ * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
+ * in place because writeback might already be in progress
+ * and we don't want to lose the time update
+ * I_NEW Serves as both a mutex and completion notification.
+ * New inodes set I_NEW. If two processes both create
+ * the same inode, one of them will release its inode and
+ * wait for I_NEW to be released before returning.
+ * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
+ * also cause waiting on I_NEW, without I_NEW actually
+ * being set. find_inode() uses this to prevent returning
+ * nearly-dead inodes.
+ * I_WILL_FREE Must be set when calling write_inode_now() if i_count
+ * is zero. I_FREEING must be set when I_WILL_FREE is
+ * cleared.
+ * I_FREEING Set when inode is about to be freed but still has dirty
+ * pages or buffers attached or the inode itself is still
+ * dirty.
+ * I_CLEAR Added by clear_inode(). In this state the inode is
+ * clean and can be destroyed. Inode keeps I_FREEING.
+ *
+ * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
+ * prohibited for many purposes. iget() must wait for
+ * the inode to be completely released, then create it
+ * anew. Other functions will just ignore such inodes,
+ * if appropriate. I_NEW is used for waiting.
+ *
+ * I_SYNC Writeback of inode is running. The bit is set during
+ * data writeback, and cleared with a wakeup on the bit
+ * address once it is done. The bit is also used to pin
+ * the inode in memory for flusher thread.
+ *
+ * I_REFERENCED Marks the inode as recently references on the LRU list.
+ *
+ * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
+ * synchronize competing switching instances and to tell
+ * wb stat updates to grab the i_pages lock. See
+ * inode_switch_wbs_work_fn() for details.
+ *
+ * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
+ * and work dirs among overlayfs mounts.
+ *
+ * I_CREATING New object's inode in the middle of setting up.
+ *
+ * I_DONTCACHE Evict inode as soon as it is not used anymore.
+ *
+ * I_SYNC_QUEUED Inode is queued in b_io or b_more_io writeback lists.
+ * Used to detect that mark_inode_dirty() should not move
+ * inode between dirty lists.
+ *
+ * I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback.
+ *
+ * I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding
+ * i_count.
+ *
+ * Q: What is the difference between I_WILL_FREE and I_FREEING?
+ *
+ * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
+ * upon. There's one free address left.
+ */
+
+enum inode_state_bits {
+ __I_NEW = 0U,
+ __I_SYNC = 1U,
+ __I_LRU_ISOLATING = 2U
+};
+
+enum inode_state_flags_t {
+ I_NEW = (1U << __I_NEW),
+ I_SYNC = (1U << __I_SYNC),
+ I_LRU_ISOLATING = (1U << __I_LRU_ISOLATING),
+ I_DIRTY_SYNC = (1U << 3),
+ I_DIRTY_DATASYNC = (1U << 4),
+ I_DIRTY_PAGES = (1U << 5),
+ I_WILL_FREE = (1U << 6),
+ I_FREEING = (1U << 7),
+ I_CLEAR = (1U << 8),
+ I_REFERENCED = (1U << 9),
+ I_LINKABLE = (1U << 10),
+ I_DIRTY_TIME = (1U << 11),
+ I_WB_SWITCH = (1U << 12),
+ I_OVL_INUSE = (1U << 13),
+ I_CREATING = (1U << 14),
+ I_DONTCACHE = (1U << 15),
+ I_SYNC_QUEUED = (1U << 16),
+ I_PINNING_NETFS_WB = (1U << 17)
+};
+
+#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
+#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
+#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
+
/*
* Keep mostly read-only and often accessed (especially for
* the RCU path lookup and 'stat' data) fields at the beginning
@@ -723,7 +839,7 @@ struct inode {
#endif
/* Misc */
- u32 i_state;
+ enum inode_state_flags_t i_state;
/* 32-bit hole */
struct rw_semaphore i_rwsem;
@@ -2483,117 +2599,6 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
};
}
-/*
- * Inode state bits. Protected by inode->i_lock
- *
- * Four bits determine the dirty state of the inode: I_DIRTY_SYNC,
- * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME.
- *
- * Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
- * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
- * various stages of removing an inode.
- *
- * Two bits are used for locking and completion notification, I_NEW and I_SYNC.
- *
- * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
- * fdatasync() (unless I_DIRTY_DATASYNC is also set).
- * Timestamp updates are the usual cause.
- * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
- * these changes separately from I_DIRTY_SYNC so that we
- * don't have to write inode on fdatasync() when only
- * e.g. the timestamps have changed.
- * I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean.
- * I_DIRTY_TIME The inode itself has dirty timestamps, and the
- * lazytime mount option is enabled. We keep track of this
- * separately from I_DIRTY_SYNC in order to implement
- * lazytime. This gets cleared if I_DIRTY_INODE
- * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
- * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
- * in place because writeback might already be in progress
- * and we don't want to lose the time update
- * I_NEW Serves as both a mutex and completion notification.
- * New inodes set I_NEW. If two processes both create
- * the same inode, one of them will release its inode and
- * wait for I_NEW to be released before returning.
- * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
- * also cause waiting on I_NEW, without I_NEW actually
- * being set. find_inode() uses this to prevent returning
- * nearly-dead inodes.
- * I_WILL_FREE Must be set when calling write_inode_now() if i_count
- * is zero. I_FREEING must be set when I_WILL_FREE is
- * cleared.
- * I_FREEING Set when inode is about to be freed but still has dirty
- * pages or buffers attached or the inode itself is still
- * dirty.
- * I_CLEAR Added by clear_inode(). In this state the inode is
- * clean and can be destroyed. Inode keeps I_FREEING.
- *
- * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
- * prohibited for many purposes. iget() must wait for
- * the inode to be completely released, then create it
- * anew. Other functions will just ignore such inodes,
- * if appropriate. I_NEW is used for waiting.
- *
- * I_SYNC Writeback of inode is running. The bit is set during
- * data writeback, and cleared with a wakeup on the bit
- * address once it is done. The bit is also used to pin
- * the inode in memory for flusher thread.
- *
- * I_REFERENCED Marks the inode as recently references on the LRU list.
- *
- * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
- * synchronize competing switching instances and to tell
- * wb stat updates to grab the i_pages lock. See
- * inode_switch_wbs_work_fn() for details.
- *
- * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
- * and work dirs among overlayfs mounts.
- *
- * I_CREATING New object's inode in the middle of setting up.
- *
- * I_DONTCACHE Evict inode as soon as it is not used anymore.
- *
- * I_SYNC_QUEUED Inode is queued in b_io or b_more_io writeback lists.
- * Used to detect that mark_inode_dirty() should not move
- * inode between dirty lists.
- *
- * I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback.
- *
- * I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding
- * i_count.
- *
- * Q: What is the difference between I_WILL_FREE and I_FREEING?
- *
- * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
- * upon. There's one free address left.
- */
-#define __I_NEW 0
-#define I_NEW (1 << __I_NEW)
-#define __I_SYNC 1
-#define I_SYNC (1 << __I_SYNC)
-#define __I_LRU_ISOLATING 2
-#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
-
-#define I_DIRTY_SYNC (1 << 3)
-#define I_DIRTY_DATASYNC (1 << 4)
-#define I_DIRTY_PAGES (1 << 5)
-#define I_WILL_FREE (1 << 6)
-#define I_FREEING (1 << 7)
-#define I_CLEAR (1 << 8)
-#define I_REFERENCED (1 << 9)
-#define I_LINKABLE (1 << 10)
-#define I_DIRTY_TIME (1 << 11)
-#define I_WB_SWITCH (1 << 12)
-#define I_OVL_INUSE (1 << 13)
-#define I_CREATING (1 << 14)
-#define I_DONTCACHE (1 << 15)
-#define I_SYNC_QUEUED (1 << 16)
-#define I_PINNING_NETFS_WB (1 << 17)
-
-#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
-#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
-#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
-
extern void __mark_inode_dirty(struct inode *, int);
static inline void mark_inode_dirty(struct inode *inode)
{
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 02/54] fs: add an icount_read helper
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
2025-08-26 15:39 ` [PATCH v2 01/54] fs: make the i_state flags an enum Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 22:18 ` Mateusz Guzik
2025-08-27 11:25 ` (subset) " Christian Brauner
2025-08-26 15:39 ` [PATCH v2 03/54] fs: rework iput logic Josef Bacik
` (55 subsequent siblings)
57 siblings, 2 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Instead of doing direct access to ->i_count, add a helper to handle
this. This will make it easier to convert i_count to a refcount later.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
arch/powerpc/platforms/cell/spufs/file.c | 2 +-
fs/btrfs/inode.c | 2 +-
fs/ceph/mds_client.c | 2 +-
fs/ext4/ialloc.c | 4 ++--
fs/fs-writeback.c | 2 +-
fs/hpfs/inode.c | 2 +-
fs/inode.c | 8 ++++----
fs/nfs/inode.c | 4 ++--
fs/notify/fsnotify.c | 2 +-
fs/smb/client/inode.c | 2 +-
fs/ubifs/super.c | 2 +-
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_trace.h | 2 +-
include/linux/fs.h | 5 +++++
include/trace/events/filelock.h | 2 +-
security/landlock/fs.c | 2 +-
16 files changed, 25 insertions(+), 20 deletions(-)
diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c
index d5a2c77bc908..ce839783c0df 100644
--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -1430,7 +1430,7 @@ static int spufs_mfc_open(struct inode *inode, struct file *file)
if (ctx->owner != current->mm)
return -EINVAL;
- if (atomic_read(&inode->i_count) != 1)
+ if (icount_read(inode) != 1)
return -EBUSY;
mutex_lock(&ctx->mapping_lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 784bd48b4da9..ac00554e8479 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4557,7 +4557,7 @@ static void btrfs_prune_dentries(struct btrfs_root *root)
inode = btrfs_find_first_inode(root, min_ino);
while (inode) {
- if (atomic_read(&inode->vfs_inode.i_count) > 1)
+ if (icount_read(&inode->vfs_inode) > 1)
d_prune_aliases(&inode->vfs_inode);
min_ino = btrfs_ino(inode) + 1;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 0f497c39ff82..62dba710504d 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2221,7 +2221,7 @@ static int trim_caps_cb(struct inode *inode, int mds, void *arg)
int count;
dput(dentry);
d_prune_aliases(inode);
- count = atomic_read(&inode->i_count);
+ count = icount_read(inode);
if (count == 1)
(*remaining)--;
doutc(cl, "%p %llx.%llx cap %p pruned, count now %d\n",
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index df4051613b29..ba4fd9aba1c1 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -252,10 +252,10 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
"nonexistent device\n", __func__, __LINE__);
return;
}
- if (atomic_read(&inode->i_count) > 1) {
+ if (icount_read(inode) > 1) {
ext4_msg(sb, KERN_ERR, "%s:%d: inode #%lu: count=%d",
__func__, __LINE__, inode->i_ino,
- atomic_read(&inode->i_count));
+ icount_read(inode));
return;
}
if (inode->i_nlink) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a6cc3d305b84..b6768ef3daa6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1767,7 +1767,7 @@ static int writeback_single_inode(struct inode *inode,
int ret = 0;
spin_lock(&inode->i_lock);
- if (!atomic_read(&inode->i_count))
+ if (!icount_read(inode))
WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
else
WARN_ON(inode->i_state & I_WILL_FREE);
diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
index a59e8fa630db..34008442ee26 100644
--- a/fs/hpfs/inode.c
+++ b/fs/hpfs/inode.c
@@ -184,7 +184,7 @@ void hpfs_write_inode(struct inode *i)
struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
struct inode *parent;
if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
- if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
+ if (hpfs_inode->i_rddir_off && !icount_read(i)) {
if (*hpfs_inode->i_rddir_off)
pr_err("write_inode: some position still there\n");
kfree(hpfs_inode->i_rddir_off);
diff --git a/fs/inode.c b/fs/inode.c
index cc0f717a140d..a3673e1ed157 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -534,7 +534,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
{
if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE))
return;
- if (atomic_read(&inode->i_count))
+ if (icount_read(inode))
return;
if (!(inode->i_sb->s_flags & SB_ACTIVE))
return;
@@ -871,11 +871,11 @@ void evict_inodes(struct super_block *sb)
again:
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
- if (atomic_read(&inode->i_count))
+ if (icount_read(inode))
continue;
spin_lock(&inode->i_lock);
- if (atomic_read(&inode->i_count)) {
+ if (icount_read(inode)) {
spin_unlock(&inode->i_lock);
continue;
}
@@ -937,7 +937,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
* unreclaimable for a while. Remove them lazily here; iput,
* sync, or the last page cache deletion will requeue them.
*/
- if (atomic_read(&inode->i_count) ||
+ if (icount_read(inode) ||
(inode->i_state & ~I_REFERENCED) ||
!mapping_shrinkable(&inode->i_data)) {
list_lru_isolate(lru, &inode->i_lru);
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 338ef77ae423..b52805951856 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -608,7 +608,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
inode->i_sb->s_id,
(unsigned long long)NFS_FILEID(inode),
nfs_display_fhandle_hash(fh),
- atomic_read(&inode->i_count));
+ icount_read(inode));
out:
return inode;
@@ -2229,7 +2229,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
dfprintk(VFS, "NFS: %s(%s/%lu fh_crc=0x%08x ct=%d info=0x%llx)\n",
__func__, inode->i_sb->s_id, inode->i_ino,
nfs_display_fhandle_hash(NFS_FH(inode)),
- atomic_read(&inode->i_count), fattr->valid);
+ icount_read(inode), fattr->valid);
if (!(fattr->valid & NFS_ATTR_FATTR_FILEID)) {
/* Only a mounted-on-fileid? Just exit */
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 079b868552c2..46bfc543f946 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -66,7 +66,7 @@ static void fsnotify_unmount_inodes(struct super_block *sb)
* removed all zero refcount inodes, in any case. Test to
* be sure.
*/
- if (!atomic_read(&inode->i_count)) {
+ if (!icount_read(inode)) {
spin_unlock(&inode->i_lock);
continue;
}
diff --git a/fs/smb/client/inode.c b/fs/smb/client/inode.c
index fe453a4b3dc8..515b82540840 100644
--- a/fs/smb/client/inode.c
+++ b/fs/smb/client/inode.c
@@ -2779,7 +2779,7 @@ int cifs_revalidate_dentry_attr(struct dentry *dentry)
}
cifs_dbg(FYI, "Update attributes: %s inode 0x%p count %d dentry: 0x%p d_time %ld jiffies %ld\n",
- full_path, inode, inode->i_count.counter,
+ full_path, inode, icount_read(inode),
dentry, cifs_get_time(dentry), jiffies);
again:
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index f3e3b2068608..a0269ba96e3d 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -358,7 +358,7 @@ static void ubifs_evict_inode(struct inode *inode)
goto out;
dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
- ubifs_assert(c, !atomic_read(&inode->i_count));
+ ubifs_assert(c, !icount_read(inode));
truncate_inode_pages_final(&inode->i_data);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9c39251961a3..df8eab11dc48 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1035,7 +1035,7 @@ xfs_itruncate_extents_flags(
int error = 0;
xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
- if (atomic_read(&VFS_I(ip)->i_count))
+ if (icount_read(VFS_I(ip)))
xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
ASSERT(new_size <= XFS_ISIZE(ip));
ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index ac344e42846c..79b8641880ab 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1152,7 +1152,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
TP_fast_assign(
__entry->dev = VFS_I(ip)->i_sb->s_dev;
__entry->ino = ip->i_ino;
- __entry->count = atomic_read(&VFS_I(ip)->i_count);
+ __entry->count = icount_read(VFS_I(ip));
__entry->pincount = atomic_read(&ip->i_pincount);
__entry->iflags = ip->i_flags;
__entry->caller_ip = caller_ip;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3dbaf1ca1828..56041d3387fe 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3372,6 +3372,11 @@ static inline void __iget(struct inode *inode)
atomic_inc(&inode->i_count);
}
+static inline int icount_read(const struct inode *inode)
+{
+ return atomic_read(&inode->i_count);
+}
+
extern void iget_failed(struct inode *);
extern void clear_inode(struct inode *);
extern void __destroy_inode(struct inode *);
diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
index b8d1e00a7982..fdd36b1daa25 100644
--- a/include/trace/events/filelock.h
+++ b/include/trace/events/filelock.h
@@ -189,7 +189,7 @@ TRACE_EVENT(generic_add_lease,
__entry->i_ino = inode->i_ino;
__entry->wcount = atomic_read(&inode->i_writecount);
__entry->rcount = atomic_read(&inode->i_readcount);
- __entry->icount = atomic_read(&inode->i_count);
+ __entry->icount = icount_read(inode);
__entry->owner = fl->c.flc_owner;
__entry->flags = fl->c.flc_flags;
__entry->type = fl->c.flc_type;
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index c04f8879ad03..0bade2c5aa1d 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -1281,7 +1281,7 @@ static void hook_sb_delete(struct super_block *const sb)
struct landlock_object *object;
/* Only handles referenced inodes. */
- if (!atomic_read(&inode->i_count))
+ if (!icount_read(inode))
continue;
/*
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 03/54] fs: rework iput logic
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
2025-08-26 15:39 ` [PATCH v2 01/54] fs: make the i_state flags an enum Josef Bacik
2025-08-26 15:39 ` [PATCH v2 02/54] fs: add an icount_read helper Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-27 12:58 ` Mateusz Guzik
2025-08-26 15:39 ` [PATCH v2 04/54] fs: add an i_obj_count refcount to the inode Josef Bacik
` (54 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Currently, if we are the last iput, and we have the I_DIRTY_TIME bit
set, we will grab a reference on the inode again and then mark it dirty
and then redo the put. This is to make sure we delay the time update
for as long as possible.
We can rework this logic to simply dec i_count if it is not 1, and if it
is do the time update while still holding the i_count reference.
Then we can replace the atomic_dec_and_lock with locking the ->i_lock
and doing atomic_dec_and_test, since we did the atomic_add_unless above.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index a3673e1ed157..13e80b434323 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1911,16 +1911,21 @@ void iput(struct inode *inode)
if (!inode)
return;
BUG_ON(inode->i_state & I_CLEAR);
-retry:
- if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
- if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
- atomic_inc(&inode->i_count);
- spin_unlock(&inode->i_lock);
- trace_writeback_lazytime_iput(inode);
- mark_inode_dirty_sync(inode);
- goto retry;
- }
+
+ if (atomic_add_unless(&inode->i_count, -1, 1))
+ return;
+
+ if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
+ trace_writeback_lazytime_iput(inode);
+ mark_inode_dirty_sync(inode);
+ }
+
+ spin_lock(&inode->i_lock);
+ if (atomic_dec_and_test(&inode->i_count)) {
+ /* iput_final() drops i_lock */
iput_final(inode);
+ } else {
+ spin_unlock(&inode->i_lock);
}
}
EXPORT_SYMBOL(iput);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 04/54] fs: add an i_obj_count refcount to the inode
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (2 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 03/54] fs: rework iput logic Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 05/54] fs: hold an i_obj_count reference in wait_sb_inodes Josef Bacik
` (53 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Currently the inode's lifetime is controlled by it's main refcount,
i_count. We overload the eviction of the inode (the final unlink) with
the deletion of the in-memory object, which leads to some hilarity when
we iput() in unfortunate places.
In order to make this less of a footgun, we want to separate the notion
of "is this inode in use by a user" and "is this inode object currently
in use", since deleting an inode is a much heavier operation that
deleting the object and cleaning up its memory.
To that end, introduce ->i_obj_count to the inode. This will be used to
control the lifetime of the inode object itself. We will continue to use
the ->i_count refcount as normal to reduce the churn of adding a new
refcount to inode. Subsequent patches will expand the usage of
->i_obj_count for internal uses, and then I will separate out the
tear down operations of the inode, and then finally adjust the refount
rules for ->i_count to be more consistent with all other refcounts.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 20 ++++++++++++++++++++
include/linux/fs.h | 12 ++++++++++++
2 files changed, 32 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index 13e80b434323..d426f54c05d9 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -235,6 +235,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
inode->i_flags = 0;
inode->i_state = 0;
atomic64_set(&inode->i_sequence, 0);
+ refcount_set(&inode->i_obj_count, 1);
atomic_set(&inode->i_count, 1);
inode->i_op = &empty_iops;
inode->i_fop = &no_open_fops;
@@ -831,6 +832,11 @@ static void evict(struct inode *inode)
inode_wake_up_bit(inode, __I_NEW);
BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
+ /*
+ * refcount_dec_and_test must be used here to avoid the underflow
+ * warning.
+ */
+ WARN_ON(!refcount_dec_and_test(&inode->i_obj_count));
destroy_inode(inode);
}
@@ -1930,6 +1936,20 @@ void iput(struct inode *inode)
}
EXPORT_SYMBOL(iput);
+/**
+ * iobj_put - put a object reference on an inode
+ * @inode: inode to put
+ *
+ * Puts a object reference on an inode.
+ */
+void iobj_put(struct inode *inode)
+{
+ if (!inode)
+ return;
+ refcount_dec(&inode->i_obj_count);
+}
+EXPORT_SYMBOL(iobj_put);
+
#ifdef CONFIG_BLOCK
/**
* bmap - find a block number in a file
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 56041d3387fe..84f5218755c3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -871,6 +871,7 @@ struct inode {
#if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
atomic_t i_readcount; /* struct files open RO */
#endif
+ refcount_t i_obj_count;
union {
const struct file_operations *i_fop; /* former ->i_op->default_file_ops */
void (*free_inode)(struct inode *);
@@ -2809,6 +2810,7 @@ extern int current_umask(void);
extern void ihold(struct inode * inode);
extern void iput(struct inode *);
+extern void iobj_put(struct inode *inode);
int inode_update_timestamps(struct inode *inode, int flags);
int generic_update_time(struct inode *, int);
@@ -3364,6 +3366,16 @@ static inline bool is_zero_ino(ino_t ino)
return (u32)ino == 0;
}
+static inline void iobj_get(struct inode *inode)
+{
+ refcount_inc(&inode->i_obj_count);
+}
+
+static inline unsigned int iobj_count_read(const struct inode *inode)
+{
+ return refcount_read(&inode->i_obj_count);
+}
+
/*
* inode->i_lock must be held
*/
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 05/54] fs: hold an i_obj_count reference in wait_sb_inodes
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (3 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 04/54] fs: add an i_obj_count refcount to the inode Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 06/54] fs: hold an i_obj_count reference for the i_wb_list Josef Bacik
` (52 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
In wait_sb_inodes we need to hold a reference for the inode while we're
waiting on writeback to complete, hold a reference on the inode object
during this operation.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/fs-writeback.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b6768ef3daa6..acb229c194ac 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2704,6 +2704,7 @@ static void wait_sb_inodes(struct super_block *sb)
continue;
}
__iget(inode);
+ iobj_get(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
@@ -2717,6 +2718,7 @@ static void wait_sb_inodes(struct super_block *sb)
cond_resched();
iput(inode);
+ iobj_put(inode);
rcu_read_lock();
spin_lock_irq(&sb->s_inode_wblist_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 06/54] fs: hold an i_obj_count reference for the i_wb_list
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (4 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 05/54] fs: hold an i_obj_count reference in wait_sb_inodes Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 07/54] fs: hold an i_obj_count reference for the i_io_list Josef Bacik
` (51 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
If we're holding the inode on one of the writeback lists we need to have
a reference on that inode. Grab a reference when we add i_wb_list to
something, drop it when it's removed.
This is potentially dangerous, because we remove the inode from the
i_wb_list potentially under IRQ via folio_end_writeback(). This will be
mitigated by making sure all writeback is completed on the final iput,
before the final iobj_put, preventing a potential free under IRQ.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/fs-writeback.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index acb229c194ac..cb5e22169808 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1332,6 +1332,7 @@ void sb_mark_inode_writeback(struct inode *inode)
if (list_empty(&inode->i_wb_list)) {
spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
if (list_empty(&inode->i_wb_list)) {
+ iobj_get(inode);
list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
trace_sb_mark_inode_writeback(inode);
}
@@ -1345,16 +1346,27 @@ void sb_mark_inode_writeback(struct inode *inode)
void sb_clear_inode_writeback(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
+ struct inode *drop = NULL;
unsigned long flags;
if (!list_empty(&inode->i_wb_list)) {
spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
if (!list_empty(&inode->i_wb_list)) {
+ drop = inode;
list_del_init(&inode->i_wb_list);
trace_sb_clear_inode_writeback(inode);
}
spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
}
+
+ /*
+ * This can be called in IRQ context when we're clearing writeback on
+ * the folio. This should not be the last iobj_put() on the inode, we
+ * run all of the writeback before we free the inode in order to avoid
+ * this possibility.
+ */
+ VFS_WARN_ON_ONCE(drop && iobj_count_read(drop) < 2);
+ iobj_put(drop);
}
/*
@@ -2683,6 +2695,8 @@ static void wait_sb_inodes(struct super_block *sb)
* to preserve consistency between i_wb_list and the mapping
* writeback tag. Writeback completion is responsible to remove
* the inode from either list once the writeback tag is cleared.
+ * At that point the i_obj_count reference will be dropped for
+ * the i_wb_list reference.
*/
list_move_tail(&inode->i_wb_list, &sb->s_inodes_wb);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 07/54] fs: hold an i_obj_count reference for the i_io_list
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (5 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 06/54] fs: hold an i_obj_count reference for the i_wb_list Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 08/54] fs: hold an i_obj_count reference in writeback_sb_inodes Josef Bacik
` (50 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
While the inode is attached to a list with its i_io_list member we need
to hold a reference on the object.
The put is under the i_lock in some cases which could potentially be
unsafe. It isn't currently because we're holding the i_obj_count
throughout the entire lifetime of the inode, so it won't be the last
currently. Subsequent patches will make sure we're holding an
i_obj_count reference while we're manipulating this list to ensure this
continues to be safe.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/fs-writeback.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index cb5e22169808..cf7fab59e4d5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -72,6 +72,14 @@ static inline struct inode *wb_inode(struct list_head *head)
return list_entry(head, struct inode, i_io_list);
}
+static inline void inode_delete_from_io_list(struct inode *inode)
+{
+ if (!list_empty(&inode->i_io_list)) {
+ list_del_init(&inode->i_io_list);
+ iobj_put(inode);
+ }
+}
+
/*
* Include the creation of the trace points after defining the
* wb_writeback_work structure and inline functions so that the definition
@@ -123,6 +131,8 @@ static bool inode_io_list_move_locked(struct inode *inode,
assert_spin_locked(&inode->i_lock);
WARN_ON_ONCE(inode->i_state & I_FREEING);
+ if (list_empty(&inode->i_io_list))
+ iobj_get(inode);
list_move(&inode->i_io_list, head);
/* dirty_time doesn't count as dirty_io until expiration */
@@ -310,7 +320,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
if (wb != &wb->bdi->wb)
list_move(&inode->i_io_list, &wb->b_attached);
else
- list_del_init(&inode->i_io_list);
+ inode_delete_from_io_list(inode);
wb_io_lists_depopulated(wb);
}
@@ -1200,7 +1210,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
WARN_ON_ONCE(inode->i_state & I_FREEING);
inode->i_state &= ~I_SYNC_QUEUED;
- list_del_init(&inode->i_io_list);
+ inode_delete_from_io_list(inode);
wb_io_lists_depopulated(wb);
}
@@ -1308,16 +1318,23 @@ void wb_start_background_writeback(struct bdi_writeback *wb)
void inode_io_list_del(struct inode *inode)
{
struct bdi_writeback *wb;
+ bool drop = false;
wb = inode_to_wb_and_lock_list(inode);
spin_lock(&inode->i_lock);
inode->i_state &= ~I_SYNC_QUEUED;
- list_del_init(&inode->i_io_list);
+ if (!list_empty(&inode->i_io_list)) {
+ drop = true;
+ list_del_init(&inode->i_io_list);
+ }
wb_io_lists_depopulated(wb);
spin_unlock(&inode->i_lock);
spin_unlock(&wb->list_lock);
+
+ if (drop)
+ iobj_put(inode);
}
EXPORT_SYMBOL(inode_io_list_del);
@@ -1389,7 +1406,7 @@ static void redirty_tail_locked(struct inode *inode, struct bdi_writeback *wb)
* trigger assertions in inode_io_list_move_locked().
*/
if (inode->i_state & I_FREEING) {
- list_del_init(&inode->i_io_list);
+ inode_delete_from_io_list(inode);
wb_io_lists_depopulated(wb);
return;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 08/54] fs: hold an i_obj_count reference in writeback_sb_inodes
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (6 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 07/54] fs: hold an i_obj_count reference for the i_io_list Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 09/54] fs: hold an i_obj_count reference while on the hashtable Josef Bacik
` (49 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We drop the wb list_lock while writing back inodes, and we could
manipulate the i_io_list while this is happening and drop our reference
for the inode. Protect this by holding the i_obj_count reference during
the writeback.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/fs-writeback.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index cf7fab59e4d5..773b276328ec 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1977,6 +1977,7 @@ static long writeback_sb_inodes(struct super_block *sb,
trace_writeback_sb_inodes_requeue(inode);
continue;
}
+ iobj_get(inode);
spin_unlock(&wb->list_lock);
/*
@@ -1987,6 +1988,7 @@ static long writeback_sb_inodes(struct super_block *sb,
if (inode->i_state & I_SYNC) {
/* Wait for I_SYNC. This function drops i_lock... */
inode_sleep_on_writeback(inode);
+ iobj_put(inode);
/* Inode may be gone, start again */
spin_lock(&wb->list_lock);
continue;
@@ -2035,10 +2037,9 @@ static long writeback_sb_inodes(struct super_block *sb,
inode_sync_complete(inode);
spin_unlock(&inode->i_lock);
- if (unlikely(tmp_wb != wb)) {
- spin_unlock(&tmp_wb->list_lock);
- spin_lock(&wb->list_lock);
- }
+ spin_unlock(&tmp_wb->list_lock);
+ iobj_put(inode);
+ spin_lock(&wb->list_lock);
/*
* bail out to wb_writeback() often enough to check
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 09/54] fs: hold an i_obj_count reference while on the hashtable
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (7 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 08/54] fs: hold an i_obj_count reference in writeback_sb_inodes Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 10/54] fs: hold an i_obj_count reference while on the LRU list Josef Bacik
` (48 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
While the inode is on the hashtable we need to hold a reference to the
object itself.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index d426f54c05d9..0c063227d355 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -667,6 +667,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
spin_lock(&inode_hash_lock);
spin_lock(&inode->i_lock);
+ iobj_get(inode);
hlist_add_head_rcu(&inode->i_hash, b);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_hash_lock);
@@ -681,11 +682,16 @@ EXPORT_SYMBOL(__insert_inode_hash);
*/
void __remove_inode_hash(struct inode *inode)
{
+ bool putref;
+
spin_lock(&inode_hash_lock);
spin_lock(&inode->i_lock);
+ putref = !hlist_unhashed(&inode->i_hash) && !hlist_fake(&inode->i_hash);
hlist_del_init_rcu(&inode->i_hash);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_hash_lock);
+ if (putref)
+ iobj_put(inode);
}
EXPORT_SYMBOL(__remove_inode_hash);
@@ -1314,6 +1320,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
* caller is responsible for filling in the contents
*/
spin_lock(&inode->i_lock);
+ iobj_get(inode);
inode->i_state |= I_NEW;
hlist_add_head_rcu(&inode->i_hash, head);
spin_unlock(&inode->i_lock);
@@ -1451,6 +1458,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
if (!old) {
inode->i_ino = ino;
spin_lock(&inode->i_lock);
+ iobj_get(inode);
inode->i_state = I_NEW;
hlist_add_head_rcu(&inode->i_hash, head);
spin_unlock(&inode->i_lock);
@@ -1803,6 +1811,7 @@ int insert_inode_locked(struct inode *inode)
}
if (likely(!old)) {
spin_lock(&inode->i_lock);
+ iobj_get(inode);
inode->i_state |= I_NEW | I_CREATING;
hlist_add_head_rcu(&inode->i_hash, head);
spin_unlock(&inode->i_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 10/54] fs: hold an i_obj_count reference while on the LRU list
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (8 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 09/54] fs: hold an i_obj_count reference while on the hashtable Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 11/54] fs: hold an i_obj_count reference while on the sb inode list Josef Bacik
` (47 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
While on the LRU list we need to make sure the object itself does not
disappear, so hold an i_obj_count reference.
This is a little wonky currently as we're dropping the reference before
we call evict(), because currently we drop the last reference right
before we free the inode. This will be fixed in a future patch when the
freeing of the inode is moved under the control of the i_obj_count
reference.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 0c063227d355..0ca0a1725b3c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -542,10 +542,12 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
if (!mapping_shrinkable(&inode->i_data))
return;
- if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru))
+ if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
+ iobj_get(inode);
this_cpu_inc(nr_unused);
- else if (rotate)
+ } else if (rotate) {
inode->i_state |= I_REFERENCED;
+ }
}
struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
@@ -571,8 +573,10 @@ void inode_add_lru(struct inode *inode)
static void inode_lru_list_del(struct inode *inode)
{
- if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru))
+ if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
+ iobj_put(inode);
this_cpu_dec(nr_unused);
+ }
}
static void inode_pin_lru_isolating(struct inode *inode)
@@ -861,6 +865,15 @@ static void dispose_list(struct list_head *head)
inode = list_first_entry(head, struct inode, i_lru);
list_del_init(&inode->i_lru);
+ /*
+ * This is going right here for now only because we are
+ * currently not using the i_obj_count reference for anything,
+ * and it needs to hit 0 when we call evict().
+ *
+ * This will be moved when we change the lifetime rules in a
+ * future patch.
+ */
+ iobj_put(inode);
evict(inode);
cond_resched();
}
@@ -897,6 +910,7 @@ void evict_inodes(struct super_block *sb)
}
inode->i_state |= I_FREEING;
+ iobj_get(inode);
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
list_add(&inode->i_lru, &dispose);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 11/54] fs: hold an i_obj_count reference while on the sb inode list
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (9 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 10/54] fs: hold an i_obj_count reference while on the LRU list Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 12/54] fs: stop accessing ->i_count directly in f2fs and gfs2 Josef Bacik
` (46 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We are holding this inode on a sb list, make sure we're holding an
i_obj_count reference while it exists on the list.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index 0ca0a1725b3c..b146b37f7097 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -630,6 +630,7 @@ void inode_sb_list_add(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
+ iobj_get(inode);
spin_lock(&sb->s_inode_list_lock);
list_add(&inode->i_sb_list, &sb->s_inodes);
spin_unlock(&sb->s_inode_list_lock);
@@ -644,6 +645,7 @@ static inline void inode_sb_list_del(struct inode *inode)
spin_lock(&sb->s_inode_list_lock);
list_del_init(&inode->i_sb_list);
spin_unlock(&sb->s_inode_list_lock);
+ iobj_put(inode);
}
}
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 12/54] fs: stop accessing ->i_count directly in f2fs and gfs2
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (10 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 11/54] fs: hold an i_obj_count reference while on the sb inode list Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 13/54] fs: hold an i_obj_count when we have an i_count reference Josef Bacik
` (45 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Instead of accessing ->i_count directly in these file systems, use the
appropriate __iget and iput helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/f2fs/super.c | 4 ++--
fs/gfs2/ops_fstype.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 1db024b20e29..2045642cfe3b 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1750,7 +1750,7 @@ static int f2fs_drop_inode(struct inode *inode)
if ((!inode_unhashed(inode) && inode->i_state & I_SYNC)) {
if (!inode->i_nlink && !is_bad_inode(inode)) {
/* to avoid evict_inode call simultaneously */
- atomic_inc(&inode->i_count);
+ __iget(inode);
spin_unlock(&inode->i_lock);
/* should remain fi->extent_tree for writepage */
@@ -1769,7 +1769,7 @@ static int f2fs_drop_inode(struct inode *inode)
sb_end_intwrite(inode->i_sb);
spin_lock(&inode->i_lock);
- atomic_dec(&inode->i_count);
+ iput(inode);
}
trace_f2fs_drop_inode(inode, 0);
return 0;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index efe99b732551..c770006f8889 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1754,7 +1754,7 @@ static void gfs2_evict_inodes(struct super_block *sb)
spin_unlock(&inode->i_lock);
continue;
}
- atomic_inc(&inode->i_count);
+ __iget(inode);
spin_unlock(&inode->i_lock);
spin_unlock(&sb->s_inode_list_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 13/54] fs: hold an i_obj_count when we have an i_count reference
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (11 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 12/54] fs: stop accessing ->i_count directly in f2fs and gfs2 Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 14/54] fs: add an I_LRU flag to the inode Josef Bacik
` (44 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
This is the start of the semantic changes of inode lifetimes.
Unfortunately we have to do two things in one patch to be properly safe,
but this is the only case where this happens.
First we take and drop an i_obj_count reference every time we get an
i_count reference. This is because we will be changing the i_count
reference to be the indicator of a "live" inode.
The second thing we do is move the life time of the memory allocation
for the inode under the control of the i_obj_count reference.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/inode.c | 4 +++-
fs/fs-writeback.c | 2 --
fs/inode.c | 28 +++++++++-------------------
include/linux/fs.h | 1 +
4 files changed, 13 insertions(+), 22 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ac00554e8479..e16df38e0eef 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3418,8 +3418,10 @@ void btrfs_add_delayed_iput(struct btrfs_inode *inode)
struct btrfs_fs_info *fs_info = inode->root->fs_info;
unsigned long flags;
- if (atomic_add_unless(&inode->vfs_inode.i_count, -1, 1))
+ if (atomic_add_unless(&inode->vfs_inode.i_count, -1, 1)) {
+ iobj_put(&inode->vfs_inode);
return;
+ }
WARN_ON_ONCE(test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state));
atomic_inc(&fs_info->nr_delayed_iputs);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 773b276328ec..b83d556d7ffe 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2736,7 +2736,6 @@ static void wait_sb_inodes(struct super_block *sb)
continue;
}
__iget(inode);
- iobj_get(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
@@ -2750,7 +2749,6 @@ static void wait_sb_inodes(struct super_block *sb)
cond_resched();
iput(inode);
- iobj_put(inode);
rcu_read_lock();
spin_lock_irq(&sb->s_inode_wblist_lock);
diff --git a/fs/inode.c b/fs/inode.c
index b146b37f7097..ddaf282f7c25 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -527,6 +527,7 @@ static void init_once(void *foo)
*/
void ihold(struct inode *inode)
{
+ iobj_get(inode);
WARN_ON(atomic_inc_return(&inode->i_count) < 2);
}
EXPORT_SYMBOL(ihold);
@@ -843,13 +844,6 @@ static void evict(struct inode *inode)
*/
inode_wake_up_bit(inode, __I_NEW);
BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
-
- /*
- * refcount_dec_and_test must be used here to avoid the underflow
- * warning.
- */
- WARN_ON(!refcount_dec_and_test(&inode->i_obj_count));
- destroy_inode(inode);
}
/*
@@ -867,16 +861,8 @@ static void dispose_list(struct list_head *head)
inode = list_first_entry(head, struct inode, i_lru);
list_del_init(&inode->i_lru);
- /*
- * This is going right here for now only because we are
- * currently not using the i_obj_count reference for anything,
- * and it needs to hit 0 when we call evict().
- *
- * This will be moved when we change the lifetime rules in a
- * future patch.
- */
- iobj_put(inode);
evict(inode);
+ iobj_put(inode);
cond_resched();
}
}
@@ -1943,8 +1929,10 @@ void iput(struct inode *inode)
return;
BUG_ON(inode->i_state & I_CLEAR);
- if (atomic_add_unless(&inode->i_count, -1, 1))
+ if (atomic_add_unless(&inode->i_count, -1, 1)) {
+ iobj_put(inode);
return;
+ }
if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
trace_writeback_lazytime_iput(inode);
@@ -1958,6 +1946,7 @@ void iput(struct inode *inode)
} else {
spin_unlock(&inode->i_lock);
}
+ iobj_put(inode);
}
EXPORT_SYMBOL(iput);
@@ -1965,13 +1954,14 @@ EXPORT_SYMBOL(iput);
* iobj_put - put a object reference on an inode
* @inode: inode to put
*
- * Puts a object reference on an inode.
+ * Puts a object reference on an inode, free's it if we get to zero.
*/
void iobj_put(struct inode *inode)
{
if (!inode)
return;
- refcount_dec(&inode->i_obj_count);
+ if (refcount_dec_and_test(&inode->i_obj_count))
+ destroy_inode(inode);
}
EXPORT_SYMBOL(iobj_put);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 84f5218755c3..023ad47685be 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3381,6 +3381,7 @@ static inline unsigned int iobj_count_read(const struct inode *inode)
*/
static inline void __iget(struct inode *inode)
{
+ iobj_get(inode);
atomic_inc(&inode->i_count);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 14/54] fs: add an I_LRU flag to the inode
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (12 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 13/54] fs: hold an i_obj_count when we have an i_count reference Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 15/54] fs: maintain a list of pinned inodes Josef Bacik
` (43 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We will be adding another list for the inode to keep track of inodes
that are being cached for other reasons. This is necessary to make sure
we know which list the inode is on, and to differentiate it from the
private dispose lists.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 7 +++++++
include/linux/fs.h | 8 +++++++-
include/trace/events/writeback.h | 3 ++-
3 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index ddaf282f7c25..15ff3a0ff7ee 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -545,6 +545,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
iobj_get(inode);
+ inode->i_state |= I_LRU;
this_cpu_inc(nr_unused);
} else if (rotate) {
inode->i_state |= I_REFERENCED;
@@ -574,7 +575,11 @@ void inode_add_lru(struct inode *inode)
static void inode_lru_list_del(struct inode *inode)
{
+ if (!(inode->i_state & I_LRU))
+ return;
+
if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
+ inode->i_state &= ~I_LRU;
iobj_put(inode);
this_cpu_dec(nr_unused);
}
@@ -955,6 +960,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
(inode->i_state & ~I_REFERENCED) ||
!mapping_shrinkable(&inode->i_data)) {
list_lru_isolate(lru, &inode->i_lru);
+ inode->i_state &= ~I_LRU;
spin_unlock(&inode->i_lock);
this_cpu_dec(nr_unused);
return LRU_REMOVED;
@@ -991,6 +997,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
+ inode->i_state &= ~I_LRU;
list_lru_isolate_move(lru, &inode->i_lru, freeable);
spin_unlock(&inode->i_lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 023ad47685be..e12c09b9fcaf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -744,6 +744,11 @@ is_uncached_acl(struct posix_acl *acl)
* I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding
* i_count.
*
+ * I_LRU Inode is on the LRU list and has an associated LRU
+ * reference count. Used to distinguish inodes where
+ * ->i_lru is on the LRU and those that are using ->i_lru
+ * for some other means.
+ *
* Q: What is the difference between I_WILL_FREE and I_FREEING?
*
* __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
@@ -774,7 +779,8 @@ enum inode_state_flags_t {
I_CREATING = (1U << 14),
I_DONTCACHE = (1U << 15),
I_SYNC_QUEUED = (1U << 16),
- I_PINNING_NETFS_WB = (1U << 17)
+ I_PINNING_NETFS_WB = (1U << 17),
+ I_LRU = (1U << 18)
};
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 1e23919c0da9..486f85aca84d 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -28,7 +28,8 @@
{I_DONTCACHE, "I_DONTCACHE"}, \
{I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \
{I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \
- {I_LRU_ISOLATING, "I_LRU_ISOLATING"} \
+ {I_LRU_ISOLATING, "I_LRU_ISOLATING"}, \
+ {I_LRU, "I_LRU"} \
)
/* enums need to be exported to user space */
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 15/54] fs: maintain a list of pinned inodes
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (13 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 14/54] fs: add an I_LRU flag to the inode Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-27 15:20 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 16/54] fs: delete the inode from the LRU list on lookup Josef Bacik
` (42 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Currently we have relied on dirty inodes and inodes with cache on them
to simply be left hanging around on the system outside of an LRU. The
only way to make sure these inodes are eventually reclaimed is because
dirty writeback will grab a reference on the inode and then iput it when
it's done, potentially getting it on the LRU. For the cached case the
page cache deletion path will call inode_add_lru when the inode no
longer has cached pages in order to make sure the inode object can be
freed eventually. In the unmount case we walk all inodes and free them
so this all works out fine.
But we want to eliminate 0 i_count objects as a concept, so we need a
mechanism to hold a reference on these pinned inodes. To that end, add a
list to the super block that contains any inodes that are cached for one
reason or another.
When we call inode_add_lru(), if the inode falls into one of these
categories, we will add it to the cached inode list and hold an
i_obj_count reference. If the inode does not fall into one of these
categories it will be moved to the normal LRU, which is already holds an
i_obj_count reference.
The dirty case we will delete it from the LRU if it is on one, and then
the iput after the writeout will make sure it's placed onto the correct
list at that point.
The page cache case will migrate it when it calls inode_add_lru() when
deleting pages from the page cache.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/fs-writeback.c | 8 +++
fs/inode.c | 102 +++++++++++++++++++++++++++++--
fs/internal.h | 1 +
fs/super.c | 3 +
include/linux/fs.h | 13 +++-
include/trace/events/writeback.h | 3 +-
6 files changed, 122 insertions(+), 8 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b83d556d7ffe..dbcb317e7113 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2736,6 +2736,14 @@ static void wait_sb_inodes(struct super_block *sb)
continue;
}
__iget(inode);
+
+ /*
+ * We could have potentially ended up on the cached LRU list, so
+ * remove ourselves from this list now that we have a reference,
+ * the iput will handle placing it back on the appropriate LRU
+ * list if necessary.
+ */
+ inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
diff --git a/fs/inode.c b/fs/inode.c
index 15ff3a0ff7ee..4d39f260901f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -319,6 +319,23 @@ void free_inode_nonrcu(struct inode *inode)
}
EXPORT_SYMBOL(free_inode_nonrcu);
+/*
+ * Some inodes need to stay pinned in memory because they are dirty or there are
+ * cached pages that the VM wants to keep around to avoid thrashing. This does
+ * the appropriate checks to see if we want to sheild this inode from periodic
+ * reclaim. Must be called with ->i_lock held.
+ */
+static bool inode_needs_cached(struct inode *inode)
+{
+ lockdep_assert_held(&inode->i_lock);
+
+ if (inode->i_state & (I_DIRTY_ALL | I_SYNC))
+ return true;
+ if (!mapping_shrinkable(&inode->i_data))
+ return true;
+ return false;
+}
+
static void i_callback(struct rcu_head *head)
{
struct inode *inode = container_of(head, struct inode, i_rcu);
@@ -532,20 +549,67 @@ void ihold(struct inode *inode)
}
EXPORT_SYMBOL(ihold);
+static void inode_add_cached_lru(struct inode *inode)
+{
+ lockdep_assert_held(&inode->i_lock);
+
+ if (inode->i_state & I_CACHED_LRU)
+ return;
+ if (!list_empty(&inode->i_lru))
+ return;
+
+ inode->i_state |= I_CACHED_LRU;
+ iobj_get(inode);
+ spin_lock(&inode->i_sb->s_cached_inodes_lock);
+ list_add(&inode->i_lru, &inode->i_sb->s_cached_inodes);
+ spin_unlock(&inode->i_sb->s_cached_inodes_lock);
+}
+
+static bool __inode_del_cached_lru(struct inode *inode)
+{
+ lockdep_assert_held(&inode->i_lock);
+
+ if (!(inode->i_state & I_CACHED_LRU))
+ return false;
+
+ inode->i_state &= ~I_CACHED_LRU;
+ spin_lock(&inode->i_sb->s_cached_inodes_lock);
+ list_del_init(&inode->i_lru);
+ spin_unlock(&inode->i_sb->s_cached_inodes_lock);
+ return true;
+}
+
+static bool inode_del_cached_lru(struct inode *inode)
+{
+ if (__inode_del_cached_lru(inode)) {
+ iobj_put(inode);
+ return true;
+ }
+ return false;
+}
+
static void __inode_add_lru(struct inode *inode, bool rotate)
{
- if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE))
+ bool need_ref = true;
+
+ lockdep_assert_held(&inode->i_lock);
+
+ if (inode->i_state & (I_FREEING | I_WILL_FREE))
return;
if (icount_read(inode))
return;
if (!(inode->i_sb->s_flags & SB_ACTIVE))
return;
- if (!mapping_shrinkable(&inode->i_data))
+ if (inode_needs_cached(inode)) {
+ inode_add_cached_lru(inode);
return;
+ }
+ need_ref = __inode_del_cached_lru(inode) == false;
if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
- iobj_get(inode);
inode->i_state |= I_LRU;
+ if (need_ref)
+ iobj_get(inode);
this_cpu_inc(nr_unused);
} else if (rotate) {
inode->i_state |= I_REFERENCED;
@@ -573,8 +637,19 @@ void inode_add_lru(struct inode *inode)
__inode_add_lru(inode, false);
}
-static void inode_lru_list_del(struct inode *inode)
+/*
+ * Caller must be holding it's own i_count reference on this inode in order to
+ * prevent this being the final iput.
+ *
+ * Needs inode->i_lock held.
+ */
+void inode_lru_list_del(struct inode *inode)
{
+ lockdep_assert_held(&inode->i_lock);
+
+ if (inode_del_cached_lru(inode))
+ return;
+
if (!(inode->i_state & I_LRU))
return;
@@ -950,6 +1025,22 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
if (!spin_trylock(&inode->i_lock))
return LRU_SKIP;
+ /*
+ * This inode is either dirty or has page cache we want to keep around,
+ * so move it to the cached list.
+ *
+ * We drop the extra i_obj_count reference we grab when adding it to the
+ * cached lru.
+ */
+ if (inode_needs_cached(inode)) {
+ list_lru_isolate(lru, &inode->i_lru);
+ inode_add_cached_lru(inode);
+ iobj_put(inode);
+ spin_unlock(&inode->i_lock);
+ this_cpu_dec(nr_unused);
+ return LRU_REMOVED;
+ }
+
/*
* Inodes can get referenced, redirtied, or repopulated while
* they're already on the LRU, and this can make them
@@ -957,8 +1048,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
* sync, or the last page cache deletion will requeue them.
*/
if (icount_read(inode) ||
- (inode->i_state & ~I_REFERENCED) ||
- !mapping_shrinkable(&inode->i_data)) {
+ (inode->i_state & ~I_REFERENCED)) {
list_lru_isolate(lru, &inode->i_lru);
inode->i_state &= ~I_LRU;
spin_unlock(&inode->i_lock);
diff --git a/fs/internal.h b/fs/internal.h
index 38e8aab27bbd..17ecee7056d5 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -207,6 +207,7 @@ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
int dentry_needs_remove_privs(struct mnt_idmap *, struct dentry *dentry);
bool in_group_or_capable(struct mnt_idmap *idmap,
const struct inode *inode, vfsgid_t vfsgid);
+void inode_lru_list_del(struct inode *inode);
/*
* fs-writeback.c
diff --git a/fs/super.c b/fs/super.c
index a038848e8d1f..bf3e6d9055af 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -364,6 +364,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
spin_lock_init(&s->s_inode_list_lock);
INIT_LIST_HEAD(&s->s_inodes_wb);
spin_lock_init(&s->s_inode_wblist_lock);
+ INIT_LIST_HEAD(&s->s_cached_inodes);
+ spin_lock_init(&s->s_cached_inodes_lock);
s->s_count = 1;
atomic_set(&s->s_active, 1);
@@ -409,6 +411,7 @@ static void __put_super(struct super_block *s)
WARN_ON(s->s_dentry_lru.node);
WARN_ON(s->s_inode_lru.node);
WARN_ON(!list_empty(&s->s_mounts));
+ WARN_ON(!list_empty(&s->s_cached_inodes));
call_rcu(&s->rcu, destroy_super_rcu);
}
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e12c09b9fcaf..999ffea2aac1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -749,6 +749,9 @@ is_uncached_acl(struct posix_acl *acl)
* ->i_lru is on the LRU and those that are using ->i_lru
* for some other means.
*
+ * I_CACHED_LRU Inode is cached because it is dirty or isn't shrinkable,
+ * and thus is on the s_cached_inode_lru list.
+ *
* Q: What is the difference between I_WILL_FREE and I_FREEING?
*
* __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
@@ -780,7 +783,8 @@ enum inode_state_flags_t {
I_DONTCACHE = (1U << 15),
I_SYNC_QUEUED = (1U << 16),
I_PINNING_NETFS_WB = (1U << 17),
- I_LRU = (1U << 18)
+ I_LRU = (1U << 18),
+ I_CACHED_LRU = (1U << 19)
};
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
@@ -1579,6 +1583,13 @@ struct super_block {
spinlock_t s_inode_wblist_lock;
struct list_head s_inodes_wb; /* writeback inodes */
+
+ /*
+ * Cached inodes, any inodes that their reference is held by another
+ * mechanism, such as dirty inodes or unshrinkable inodes.
+ */
+ spinlock_t s_cached_inodes_lock;
+ struct list_head s_cached_inodes;
} __randomize_layout;
static inline struct user_namespace *i_user_ns(const struct inode *inode)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 486f85aca84d..6949329c744a 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -29,7 +29,8 @@
{I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \
{I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \
{I_LRU_ISOLATING, "I_LRU_ISOLATING"}, \
- {I_LRU, "I_LRU"} \
+ {I_LRU, "I_LRU"}, \
+ {I_CACHED_LRU, "I_CACHED_LRU"} \
)
/* enums need to be exported to user space */
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 16/54] fs: delete the inode from the LRU list on lookup
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (14 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 15/54] fs: maintain a list of pinned inodes Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-27 21:46 ` Dave Chinner
2025-08-26 15:39 ` [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir Josef Bacik
` (41 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
When we move to holding a full reference on the inode when it is on an
LRU list we need to have a mechanism to re-run the LRU add logic. The
use case for this is btrfs's snapshot delete, we will lookup all the
inodes and try to drop them, but if they're on the LRU we will not call
->drop_inode() because their refcount will be elevated, so we won't know
that we need to drop the inode.
Fix this by simply removing the inode from it's respective LRU list when
we grab a reference to it in a way that we have active users. This will
ensure that the logic to add the inode to the LRU or drop the inode will
be run on the final iput from the user.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index 4d39f260901f..399598e90693 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1146,6 +1146,7 @@ static struct inode *find_inode(struct super_block *sb,
return ERR_PTR(-ESTALE);
}
__iget(inode);
+ inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
return inode;
@@ -1187,6 +1188,7 @@ static struct inode *find_inode_fast(struct super_block *sb,
return ERR_PTR(-ESTALE);
}
__iget(inode);
+ inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
return inode;
@@ -1653,6 +1655,7 @@ struct inode *igrab(struct inode *inode)
spin_lock(&inode->i_lock);
if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
__iget(inode);
+ inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
} else {
spin_unlock(&inode->i_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (15 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 16/54] fs: delete the inode from the LRU list on lookup Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-27 12:32 ` Christian Brauner
` (2 more replies)
2025-08-26 15:39 ` [PATCH v2 18/54] fs: change evict_inodes to use iput instead of evict directly Josef Bacik
` (40 subsequent siblings)
57 siblings, 3 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We can end up with an inode on the LRU list or the cached list, then at
some point in the future go to unlink that inode and then still have an
elevated i_count reference for that inode because it is on one of these
lists.
The more common case is the cached list. We open a file, write to it,
truncate some of it which triggers the inode_add_lru code in the
pagecache, adding it to the cached LRU. Then we unlink this inode, and
it exists until writeback or reclaim kicks in and removes the inode.
To handle this case, delete the inode from the LRU list when it is
unlinked, so we have the best case scenario for immediately freeing the
inode.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/namei.c | 30 +++++++++++++++++++++++++-----
1 file changed, 25 insertions(+), 5 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 138a693c2346..e56dcb5747e4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4438,6 +4438,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
struct dentry *dentry)
{
+ struct inode *inode = dentry->d_inode;
int error = may_delete(idmap, dir, dentry, 1);
if (error)
@@ -4447,11 +4448,11 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
return -EPERM;
dget(dentry);
- inode_lock(dentry->d_inode);
+ inode_lock(inode);
error = -EBUSY;
if (is_local_mountpoint(dentry) ||
- (dentry->d_inode->i_flags & S_KERNEL_FILE))
+ (inode->i_flags & S_KERNEL_FILE))
goto out;
error = security_inode_rmdir(dir, dentry);
@@ -4463,12 +4464,21 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
goto out;
shrink_dcache_parent(dentry);
- dentry->d_inode->i_flags |= S_DEAD;
+ inode->i_flags |= S_DEAD;
dont_mount(dentry);
detach_mounts(dentry);
out:
- inode_unlock(dentry->d_inode);
+ /*
+ * The inode may be on the LRU list, so delete it from the LRU at this
+ * point in order to make sure that the inode is freed as soon as
+ * possible.
+ */
+ spin_lock(&inode->i_lock);
+ inode_lru_list_del(inode);
+ spin_unlock(&inode->i_lock);
+
+ inode_unlock(inode);
dput(dentry);
if (!error)
d_delete_notify(dir, dentry);
@@ -4653,8 +4663,18 @@ int do_unlinkat(int dfd, struct filename *name)
dput(dentry);
}
inode_unlock(path.dentry->d_inode);
- if (inode)
+ if (inode) {
+ /*
+ * The LRU may be holding a reference, remove the inode from the
+ * LRU here before dropping our hopefully final reference on the
+ * inode.
+ */
+ spin_lock(&inode->i_lock);
+ inode_lru_list_del(inode);
+ spin_unlock(&inode->i_lock);
+
iput(inode); /* truncate the inode here */
+ }
inode = NULL;
if (delegated_inode) {
error = break_deleg_wait(&delegated_inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 18/54] fs: change evict_inodes to use iput instead of evict directly
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (16 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-28 10:18 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 19/54] fs: hold a full ref while the inode is on a LRU Josef Bacik
` (39 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
At evict_inodes() time, we no longer have SB_ACTIVE set, so we can
easily go through the normal iput path to clear any inodes. Update
dispose_list() to check how we need to free the inode, and then grab a
full reference to the inode while we're looping through the remaining
inodes, and simply iput them at the end.
Since we're just calling iput we don't really care about the i_count on
the inode at the current time. Remove the i_count checks and just call
iput on every inode we find.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 26 +++++++++++---------------
1 file changed, 11 insertions(+), 15 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 399598e90693..ede9118bb649 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -933,7 +933,7 @@ static void evict(struct inode *inode)
* Dispose-list gets a local list with local inodes in it, so it doesn't
* need to worry about list corruption and SMP locks.
*/
-static void dispose_list(struct list_head *head)
+static void dispose_list(struct list_head *head, bool for_lru)
{
while (!list_empty(head)) {
struct inode *inode;
@@ -941,8 +941,12 @@ static void dispose_list(struct list_head *head)
inode = list_first_entry(head, struct inode, i_lru);
list_del_init(&inode->i_lru);
- evict(inode);
- iobj_put(inode);
+ if (for_lru) {
+ evict(inode);
+ iobj_put(inode);
+ } else {
+ iput(inode);
+ }
cond_resched();
}
}
@@ -964,21 +968,13 @@ void evict_inodes(struct super_block *sb)
again:
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
- if (icount_read(inode))
- continue;
-
spin_lock(&inode->i_lock);
- if (icount_read(inode)) {
- spin_unlock(&inode->i_lock);
- continue;
- }
if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
spin_unlock(&inode->i_lock);
continue;
}
- inode->i_state |= I_FREEING;
- iobj_get(inode);
+ __iget(inode);
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
list_add(&inode->i_lru, &dispose);
@@ -991,13 +987,13 @@ void evict_inodes(struct super_block *sb)
if (need_resched()) {
spin_unlock(&sb->s_inode_list_lock);
cond_resched();
- dispose_list(&dispose);
+ dispose_list(&dispose, false);
goto again;
}
}
spin_unlock(&sb->s_inode_list_lock);
- dispose_list(&dispose);
+ dispose_list(&dispose, false);
}
EXPORT_SYMBOL_GPL(evict_inodes);
@@ -1108,7 +1104,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
inode_lru_isolate, &freeable);
- dispose_list(&freeable);
+ dispose_list(&freeable, true);
return freed;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 19/54] fs: hold a full ref while the inode is on a LRU
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (17 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 18/54] fs: change evict_inodes to use iput instead of evict directly Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-28 10:51 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 20/54] fs: disallow 0 reference count inodes Josef Bacik
` (38 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We want to eliminate 0 refcount inodes that can be used. To that end,
make the LRU's hold a full reference on the inode while it is on an LRU
list. From there we can change the eviction code to always just iput the
inode, and the LRU operations will just add or drop a full reference
where appropriate.
We also now must take into account unlink, and avoid adding the inode to
the LRU if it has an nlink of 0.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 87 +++++++++++++++++++++++++-----------------------------
1 file changed, 40 insertions(+), 47 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index ede9118bb649..9001f809add0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -555,11 +555,13 @@ static void inode_add_cached_lru(struct inode *inode)
if (inode->i_state & I_CACHED_LRU)
return;
+ if (inode->__i_nlink == 0)
+ return;
if (!list_empty(&inode->i_lru))
return;
inode->i_state |= I_CACHED_LRU;
- iobj_get(inode);
+ __iget(inode);
spin_lock(&inode->i_sb->s_cached_inodes_lock);
list_add(&inode->i_lru, &inode->i_sb->s_cached_inodes);
spin_unlock(&inode->i_sb->s_cached_inodes_lock);
@@ -582,7 +584,7 @@ static bool __inode_del_cached_lru(struct inode *inode)
static bool inode_del_cached_lru(struct inode *inode)
{
if (__inode_del_cached_lru(inode)) {
- iobj_put(inode);
+ iput(inode);
return true;
}
return false;
@@ -598,6 +600,8 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
return;
if (icount_read(inode))
return;
+ if (inode->__i_nlink == 0)
+ return;
if (!(inode->i_sb->s_flags & SB_ACTIVE))
return;
if (inode_needs_cached(inode)) {
@@ -609,7 +613,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
inode->i_state |= I_LRU;
if (need_ref)
- iobj_get(inode);
+ __iget(inode);
this_cpu_inc(nr_unused);
} else if (rotate) {
inode->i_state |= I_REFERENCED;
@@ -655,7 +659,7 @@ void inode_lru_list_del(struct inode *inode)
if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
inode->i_state &= ~I_LRU;
- iobj_put(inode);
+ iput(inode);
this_cpu_dec(nr_unused);
}
}
@@ -926,6 +930,7 @@ static void evict(struct inode *inode)
BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
}
+static void iput_evict(struct inode *inode);
/*
* dispose_list - dispose of the contents of a local list
* @head: the head of the list to free
@@ -933,20 +938,14 @@ static void evict(struct inode *inode)
* Dispose-list gets a local list with local inodes in it, so it doesn't
* need to worry about list corruption and SMP locks.
*/
-static void dispose_list(struct list_head *head, bool for_lru)
+static void dispose_list(struct list_head *head)
{
while (!list_empty(head)) {
struct inode *inode;
inode = list_first_entry(head, struct inode, i_lru);
list_del_init(&inode->i_lru);
-
- if (for_lru) {
- evict(inode);
- iobj_put(inode);
- } else {
- iput(inode);
- }
+ iput_evict(inode);
cond_resched();
}
}
@@ -987,13 +986,13 @@ void evict_inodes(struct super_block *sb)
if (need_resched()) {
spin_unlock(&sb->s_inode_list_lock);
cond_resched();
- dispose_list(&dispose, false);
+ dispose_list(&dispose);
goto again;
}
}
spin_unlock(&sb->s_inode_list_lock);
- dispose_list(&dispose, false);
+ dispose_list(&dispose);
}
EXPORT_SYMBOL_GPL(evict_inodes);
@@ -1031,22 +1030,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
if (inode_needs_cached(inode)) {
list_lru_isolate(lru, &inode->i_lru);
inode_add_cached_lru(inode);
- iobj_put(inode);
- spin_unlock(&inode->i_lock);
- this_cpu_dec(nr_unused);
- return LRU_REMOVED;
- }
-
- /*
- * Inodes can get referenced, redirtied, or repopulated while
- * they're already on the LRU, and this can make them
- * unreclaimable for a while. Remove them lazily here; iput,
- * sync, or the last page cache deletion will requeue them.
- */
- if (icount_read(inode) ||
- (inode->i_state & ~I_REFERENCED)) {
- list_lru_isolate(lru, &inode->i_lru);
- inode->i_state &= ~I_LRU;
+ iput(inode);
spin_unlock(&inode->i_lock);
this_cpu_dec(nr_unused);
return LRU_REMOVED;
@@ -1082,7 +1066,6 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
}
WARN_ON(inode->i_state & I_NEW);
- inode->i_state |= I_FREEING;
inode->i_state &= ~I_LRU;
list_lru_isolate_move(lru, &inode->i_lru, freeable);
spin_unlock(&inode->i_lock);
@@ -1104,7 +1087,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
inode_lru_isolate, &freeable);
- dispose_list(&freeable, true);
+ dispose_list(&freeable);
return freed;
}
@@ -1967,7 +1950,7 @@ EXPORT_SYMBOL(generic_delete_inode);
* in cache if fs is alive, sync and evict if fs is
* shutting down.
*/
-static void iput_final(struct inode *inode)
+static void iput_final(struct inode *inode, bool skip_lru)
{
struct super_block *sb = inode->i_sb;
const struct super_operations *op = inode->i_sb->s_op;
@@ -1981,7 +1964,7 @@ static void iput_final(struct inode *inode)
else
drop = generic_drop_inode(inode);
- if (!drop &&
+ if (!drop && !skip_lru &&
!(inode->i_state & I_DONTCACHE) &&
(sb->s_flags & SB_ACTIVE)) {
__inode_add_lru(inode, true);
@@ -1989,6 +1972,8 @@ static void iput_final(struct inode *inode)
return;
}
+ WARN_ON(!list_empty(&inode->i_lru));
+
state = inode->i_state;
if (!drop) {
WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
@@ -2003,23 +1988,12 @@ static void iput_final(struct inode *inode)
}
WRITE_ONCE(inode->i_state, state | I_FREEING);
- if (!list_empty(&inode->i_lru))
- inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
evict(inode);
}
-/**
- * iput - put an inode
- * @inode: inode to put
- *
- * Puts an inode, dropping its usage count. If the inode use count hits
- * zero, the inode is then freed and may also be destroyed.
- *
- * Consequently, iput() can sleep.
- */
-void iput(struct inode *inode)
+static void __iput(struct inode *inode, bool skip_lru)
{
if (!inode)
return;
@@ -2038,12 +2012,31 @@ void iput(struct inode *inode)
spin_lock(&inode->i_lock);
if (atomic_dec_and_test(&inode->i_count)) {
/* iput_final() drops i_lock */
- iput_final(inode);
+ iput_final(inode, skip_lru);
} else {
spin_unlock(&inode->i_lock);
}
iobj_put(inode);
}
+
+static void iput_evict(struct inode *inode)
+{
+ __iput(inode, true);
+}
+
+/**
+ * iput - put an inode
+ * @inode: inode to put
+ *
+ * Puts an inode, dropping its usage count. If the inode use count hits
+ * zero, the inode is then freed and may also be destroyed.
+ *
+ * Consequently, iput() can sleep.
+ */
+void iput(struct inode *inode)
+{
+ __iput(inode, false);
+}
EXPORT_SYMBOL(iput);
/**
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 20/54] fs: disallow 0 reference count inodes
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (18 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 19/54] fs: hold a full ref while the inode is on a LRU Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-28 11:02 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 21/54] fs: make evict_inodes add to the dispose list under the i_lock Josef Bacik
` (37 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Now that we take a full reference for inodes on the LRU, move the logic
to add the inode to the LRU to before we drop our last reference. This
allows us to ensure that if the inode has a reference count it can be
used, and we no longer hold onto inodes that have a 0 reference count.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 61 ++++++++++++++++++++++++++++++++++++------------------
1 file changed, 41 insertions(+), 20 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 9001f809add0..d1668f7fb73e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -598,7 +598,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
if (inode->i_state & (I_FREEING | I_WILL_FREE))
return;
- if (icount_read(inode))
+ if (icount_read(inode) != 1)
return;
if (inode->__i_nlink == 0)
return;
@@ -1950,28 +1950,11 @@ EXPORT_SYMBOL(generic_delete_inode);
* in cache if fs is alive, sync and evict if fs is
* shutting down.
*/
-static void iput_final(struct inode *inode, bool skip_lru)
+static void iput_final(struct inode *inode, bool drop)
{
- struct super_block *sb = inode->i_sb;
- const struct super_operations *op = inode->i_sb->s_op;
unsigned long state;
- int drop;
WARN_ON(inode->i_state & I_NEW);
-
- if (op->drop_inode)
- drop = op->drop_inode(inode);
- else
- drop = generic_drop_inode(inode);
-
- if (!drop && !skip_lru &&
- !(inode->i_state & I_DONTCACHE) &&
- (sb->s_flags & SB_ACTIVE)) {
- __inode_add_lru(inode, true);
- spin_unlock(&inode->i_lock);
- return;
- }
-
WARN_ON(!list_empty(&inode->i_lru));
state = inode->i_state;
@@ -1993,8 +1976,37 @@ static void iput_final(struct inode *inode, bool skip_lru)
evict(inode);
}
+static bool maybe_add_lru(struct inode *inode, bool skip_lru)
+{
+ const struct super_operations *op = inode->i_sb->s_op;
+ const struct super_block *sb = inode->i_sb;
+ bool drop = false;
+
+ if (op->drop_inode)
+ drop = op->drop_inode(inode);
+ else
+ drop = generic_drop_inode(inode);
+
+ if (drop)
+ return drop;
+
+ if (skip_lru)
+ return drop;
+
+ if (inode->i_state & I_DONTCACHE)
+ return drop;
+
+ if (!(sb->s_flags & SB_ACTIVE))
+ return drop;
+
+ __inode_add_lru(inode, true);
+ return drop;
+}
+
static void __iput(struct inode *inode, bool skip_lru)
{
+ bool drop;
+
if (!inode)
return;
BUG_ON(inode->i_state & I_CLEAR);
@@ -2010,9 +2022,18 @@ static void __iput(struct inode *inode, bool skip_lru)
}
spin_lock(&inode->i_lock);
+
+ /*
+ * If we want to keep the inode around on an LRU we will grab a ref to
+ * the inode when we add it to the LRU list, so we can safely drop the
+ * callers reference after this. If we didn't add the inode to the LRU
+ * then the refcount will still be 1 and we can do the final iput.
+ */
+ drop = maybe_add_lru(inode, skip_lru);
+
if (atomic_dec_and_test(&inode->i_count)) {
/* iput_final() drops i_lock */
- iput_final(inode, skip_lru);
+ iput_final(inode, drop);
} else {
spin_unlock(&inode->i_lock);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 21/54] fs: make evict_inodes add to the dispose list under the i_lock
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (19 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 20/54] fs: disallow 0 reference count inodes Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 22/54] fs: convert i_count to refcount_t Josef Bacik
` (36 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
In the future when we only serialize the freeing of the inode on the
reference count we could potentially be relying on ->i_lru to be
consistent, which means we need it to be consistent under the ->i_lock.
Move the list_add in evict_inodes() to under the ->i_lock to prevent
potential races where we think the inode isn't on a list but is going to
be added to the private dispose list.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/inode.c b/fs/inode.c
index d1668f7fb73e..1992db5cd70a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -975,8 +975,8 @@ void evict_inodes(struct super_block *sb)
__iget(inode);
inode_lru_list_del(inode);
- spin_unlock(&inode->i_lock);
list_add(&inode->i_lru, &dispose);
+ spin_unlock(&inode->i_lock);
/*
* We can have a ton of inodes to evict at unmount time given
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 22/54] fs: convert i_count to refcount_t
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (20 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 21/54] fs: make evict_inodes add to the dispose list under the i_lock Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-28 12:00 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 23/54] fs: use refcount_inc_not_zero in igrab Josef Bacik
` (35 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Now that we do not allow i_count to drop to 0 and be used we can convert
it to a refcount_t and benefit from the protections those helpers add.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/inode.c | 2 +-
fs/inode.c | 9 +++++----
include/linux/fs.h | 6 +++---
3 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e16df38e0eef..eb9496342346 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3418,7 +3418,7 @@ void btrfs_add_delayed_iput(struct btrfs_inode *inode)
struct btrfs_fs_info *fs_info = inode->root->fs_info;
unsigned long flags;
- if (atomic_add_unless(&inode->vfs_inode.i_count, -1, 1)) {
+ if (refcount_dec_not_one(&inode->vfs_inode.i_count)) {
iobj_put(&inode->vfs_inode);
return;
}
diff --git a/fs/inode.c b/fs/inode.c
index 1992db5cd70a..0be1c137bf1e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -236,7 +236,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
inode->i_state = 0;
atomic64_set(&inode->i_sequence, 0);
refcount_set(&inode->i_obj_count, 1);
- atomic_set(&inode->i_count, 1);
+ refcount_set(&inode->i_count, 1);
inode->i_op = &empty_iops;
inode->i_fop = &no_open_fops;
inode->i_ino = 0;
@@ -545,7 +545,8 @@ static void init_once(void *foo)
void ihold(struct inode *inode)
{
iobj_get(inode);
- WARN_ON(atomic_inc_return(&inode->i_count) < 2);
+ refcount_inc(&inode->i_count);
+ WARN_ON(icount_read(inode) < 2);
}
EXPORT_SYMBOL(ihold);
@@ -2011,7 +2012,7 @@ static void __iput(struct inode *inode, bool skip_lru)
return;
BUG_ON(inode->i_state & I_CLEAR);
- if (atomic_add_unless(&inode->i_count, -1, 1)) {
+ if (refcount_dec_not_one(&inode->i_count)) {
iobj_put(inode);
return;
}
@@ -2031,7 +2032,7 @@ static void __iput(struct inode *inode, bool skip_lru)
*/
drop = maybe_add_lru(inode, skip_lru);
- if (atomic_dec_and_test(&inode->i_count)) {
+ if (refcount_dec_and_test(&inode->i_count)) {
/* iput_final() drops i_lock */
iput_final(inode, drop);
} else {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 999ffea2aac1..fc23e37ca250 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -875,7 +875,7 @@ struct inode {
};
atomic64_t i_version;
atomic64_t i_sequence; /* see futex */
- atomic_t i_count;
+ refcount_t i_count;
atomic_t i_dio_count;
atomic_t i_writecount;
#if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
@@ -3399,12 +3399,12 @@ static inline unsigned int iobj_count_read(const struct inode *inode)
static inline void __iget(struct inode *inode)
{
iobj_get(inode);
- atomic_inc(&inode->i_count);
+ refcount_inc(&inode->i_count);
}
static inline int icount_read(const struct inode *inode)
{
- return atomic_read(&inode->i_count);
+ return refcount_read(&inode->i_count);
}
extern void iget_failed(struct inode *);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 23/54] fs: use refcount_inc_not_zero in igrab
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (21 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 22/54] fs: convert i_count to refcount_t Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-28 22:08 ` Eric Biggers
2025-08-26 15:39 ` [PATCH v2 24/54] fs: use inode_tryget in find_inode* Josef Bacik
` (34 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We are going to use igrab everywhere we want to acquire a live inode.
Update it to do a refcount_inc_not_zero on the i_count, and if
successful grab an reference to i_obj_count. Add a comment explaining
why we do this and the safety.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 26 +++++++++++++-------------
include/linux/fs.h | 30 ++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+), 13 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 0be1c137bf1e..66402786cf8f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1632,20 +1632,20 @@ EXPORT_SYMBOL(iunique);
struct inode *igrab(struct inode *inode)
{
+ lockdep_assert_not_held(&inode->i_lock);
+
+ inode = inode_tryget(inode);
+ if (!inode)
+ return NULL;
+
+ /*
+ * If this inode is on the LRU, take it off so that we can re-run the
+ * LRU logic on the next iput().
+ */
spin_lock(&inode->i_lock);
- if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
- __iget(inode);
- inode_lru_list_del(inode);
- spin_unlock(&inode->i_lock);
- } else {
- spin_unlock(&inode->i_lock);
- /*
- * Handle the case where s_op->clear_inode is not been
- * called yet, and somebody is calling igrab
- * while the inode is getting freed.
- */
- inode = NULL;
- }
+ inode_lru_list_del(inode);
+ spin_unlock(&inode->i_lock);
+
return inode;
}
EXPORT_SYMBOL(igrab);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fc23e37ca250..b13d057ad0d7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3393,6 +3393,36 @@ static inline unsigned int iobj_count_read(const struct inode *inode)
return refcount_read(&inode->i_obj_count);
}
+static inline struct inode *inode_tryget(struct inode *inode)
+{
+ /*
+ * We are using inode_tryget() because we're interested in getting a
+ * live reference to the inode, which is ->i_count. Normally we would
+ * grab i_obj_count first, as it is the higher priority reference.
+ * However we're only interested in making sure we have a live inode,
+ * and we know that if we get a reference for i_count then we can safely
+ * acquire i_obj_count because we always drop i_obj_count after dropping
+ * an i_count reference.
+ *
+ * This is meant to be used either in a place where we have an existing
+ * i_obj_count reference on the inode, or under rcu_read_lock() so we
+ * know we're safe in accessing this inode still.
+ */
+ VFS_WARN_ON_ONCE(!iobj_count_read(inode) && !rcu_read_lock_held());
+
+ if (refcount_inc_not_zero(&inode->i_count)) {
+ iobj_get(inode);
+ return inode;
+ }
+
+ /*
+ * If we failed to increment the reference count, then the
+ * inode is being freed or has been freed. We return NULL
+ * in this case.
+ */
+ return NULL;
+}
+
/*
* inode->i_lock must be held
*/
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 24/54] fs: use inode_tryget in find_inode*
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (22 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 23/54] fs: use refcount_inc_not_zero in igrab Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 25/54] fs: update find_inode_*rcu to check the i_count count Josef Bacik
` (33 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Now that we never drop the i_count to 0 for valid objects, rework the
logic in the find_inode* helpers to use inode_tryget() to see if they
have a live inode. If this fails we can wait for the inode to be freed
as we know it's currently being evicted.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 66402786cf8f..4ed2e8ff5334 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1093,6 +1093,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
}
static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked);
+
/*
* Called with the inode lock held.
*/
@@ -1116,16 +1117,15 @@ static struct inode *find_inode(struct super_block *sb,
if (!test(inode, data))
continue;
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
- __wait_on_freeing_inode(inode, is_inode_hash_locked);
- goto repeat;
- }
if (unlikely(inode->i_state & I_CREATING)) {
spin_unlock(&inode->i_lock);
rcu_read_unlock();
return ERR_PTR(-ESTALE);
}
- __iget(inode);
+ if (!inode_tryget(inode)) {
+ __wait_on_freeing_inode(inode, is_inode_hash_locked);
+ goto repeat;
+ }
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
@@ -1158,16 +1158,15 @@ static struct inode *find_inode_fast(struct super_block *sb,
if (inode->i_sb != sb)
continue;
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
- __wait_on_freeing_inode(inode, is_inode_hash_locked);
- goto repeat;
- }
if (unlikely(inode->i_state & I_CREATING)) {
spin_unlock(&inode->i_lock);
rcu_read_unlock();
return ERR_PTR(-ESTALE);
}
- __iget(inode);
+ if (!inode_tryget(inode)) {
+ __wait_on_freeing_inode(inode, is_inode_hash_locked);
+ goto repeat;
+ }
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 25/54] fs: update find_inode_*rcu to check the i_count count
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (23 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 24/54] fs: use inode_tryget in find_inode* Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 26/54] fs: use igrab in insert_inode_locked Josef Bacik
` (32 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
These two helpers are always used under the RCU and don't appear to mind
if the inode state changes in between time of check and time of use.
Update them to use the i_count refcount instead of I_WILL_FREE or
I_FREEING.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 4ed2e8ff5334..8ae9ed9605ef 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1823,7 +1823,7 @@ struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
hlist_for_each_entry_rcu(inode, head, i_hash) {
if (inode->i_sb == sb &&
- !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)) &&
+ icount_read(inode) > 0 &&
test(inode, data))
return inode;
}
@@ -1862,8 +1862,8 @@ struct inode *find_inode_by_ino_rcu(struct super_block *sb,
hlist_for_each_entry_rcu(inode, head, i_hash) {
if (inode->i_ino == ino &&
inode->i_sb == sb &&
- !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)))
- return inode;
+ icount_read(inode) > 0)
+ return inode;
}
return NULL;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 26/54] fs: use igrab in insert_inode_locked
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (24 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 25/54] fs: update find_inode_*rcu to check the i_count count Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-28 12:15 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 27/54] fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru Josef Bacik
` (31 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Follow the same pattern in find_inode*. Instead of checking for
I_WILL_FREE|I_FREEING simply call igrab() and if it succeeds we're done.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 8ae9ed9605ef..d34da95a3295 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1883,11 +1883,8 @@ int insert_inode_locked(struct inode *inode)
continue;
if (old->i_sb != sb)
continue;
- spin_lock(&old->i_lock);
- if (old->i_state & (I_FREEING|I_WILL_FREE)) {
- spin_unlock(&old->i_lock);
+ if (!igrab(old))
continue;
- }
break;
}
if (likely(!old)) {
@@ -1899,12 +1896,13 @@ int insert_inode_locked(struct inode *inode)
spin_unlock(&inode_hash_lock);
return 0;
}
+ spin_lock(&old->i_lock);
if (unlikely(old->i_state & I_CREATING)) {
spin_unlock(&old->i_lock);
spin_unlock(&inode_hash_lock);
+ iput(old);
return -EBUSY;
}
- __iget(old);
spin_unlock(&old->i_lock);
spin_unlock(&inode_hash_lock);
wait_on_inode(old);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 27/54] fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (25 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 26/54] fs: use igrab in insert_inode_locked Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 28/54] fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating Josef Bacik
` (30 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We only want to add to the LRU if the current caller is potentially the
last one dropping a reference, so if our refcount is 0 we're being
deleted, and if the refcount is > 1 then there is another ref holder and
they can add the inode to the LRU list.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index d34da95a3295..082addba546c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -597,8 +597,6 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
lockdep_assert_held(&inode->i_lock);
- if (inode->i_state & (I_FREEING | I_WILL_FREE))
- return;
if (icount_read(inode) != 1)
return;
if (inode->__i_nlink == 0)
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 28/54] fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (26 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 27/54] fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 29/54] fs: use inode_tryget in evict_inodes Josef Bacik
` (29 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
If the inode is on the LRU list then it has a valid reference and we do
not need to check for these flags.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/inode.c b/fs/inode.c
index 082addba546c..2ceceb30be4d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -666,7 +666,7 @@ void inode_lru_list_del(struct inode *inode)
static void inode_pin_lru_isolating(struct inode *inode)
{
lockdep_assert_held(&inode->i_lock);
- WARN_ON(inode->i_state & (I_LRU_ISOLATING | I_FREEING | I_WILL_FREE));
+ WARN_ON(inode->i_state & I_LRU_ISOLATING);
inode->i_state |= I_LRU_ISOLATING;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 29/54] fs: use inode_tryget in evict_inodes
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (27 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 28/54] fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 30/54] fs: change evict_dentries_for_decrypted_inodes to use refcount Josef Bacik
` (28 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Instead of checking I_WILL_FREE|I_FREEING we can simply use
inode_tryget() to determine if we have a live inode that can be evicted.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 2ceceb30be4d..eb74f7b5e967 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -967,12 +967,16 @@ void evict_inodes(struct super_block *sb)
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+ if (inode->i_state & I_NEW) {
+ spin_unlock(&inode->i_lock);
+ continue;
+ }
+
+ if (!inode_tryget(inode)) {
spin_unlock(&inode->i_lock);
continue;
}
- __iget(inode);
inode_lru_list_del(inode);
list_add(&inode->i_lru, &dispose);
spin_unlock(&inode->i_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 30/54] fs: change evict_dentries_for_decrypted_inodes to use refcount
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (28 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 29/54] fs: use inode_tryget in evict_inodes Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-28 12:25 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 31/54] block: use igrab in sync_bdevs Josef Bacik
` (27 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Instead of checking for I_WILL_FREE|I_FREEING simply use the refcount to
make sure we have a live inode.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/crypto/keyring.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/crypto/keyring.c b/fs/crypto/keyring.c
index 7557f6a88b8f..969db498149a 100644
--- a/fs/crypto/keyring.c
+++ b/fs/crypto/keyring.c
@@ -956,13 +956,16 @@ static void evict_dentries_for_decrypted_inodes(struct fscrypt_master_key *mk)
list_for_each_entry(ci, &mk->mk_decrypted_inodes, ci_master_key_link) {
inode = ci->ci_inode;
+
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) {
+ if (inode->i_state & I_NEW) {
spin_unlock(&inode->i_lock);
continue;
}
- __iget(inode);
spin_unlock(&inode->i_lock);
+
+ if (!igrab(inode))
+ continue;
spin_unlock(&mk->mk_decrypted_inodes_lock);
shrink_dcache_inode(inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 31/54] block: use igrab in sync_bdevs
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (29 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 30/54] fs: change evict_dentries_for_decrypted_inodes to use refcount Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 32/54] bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING Josef Bacik
` (26 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Instead of checking I_WILL_FREE or I_FREEING simply grab a reference to
the inode, as it will only succeed if the inode is live.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
block/bdev.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index b77ddd12dc06..94ffc0b5a68c 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1265,13 +1265,15 @@ void sync_bdevs(bool wait)
struct block_device *bdev;
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
- mapping->nrpages == 0) {
+ if (inode->i_state & I_NEW || mapping->nrpages == 0) {
spin_unlock(&inode->i_lock);
continue;
}
- __iget(inode);
spin_unlock(&inode->i_lock);
+
+ if (!igrab(inode))
+ continue;
+
spin_unlock(&blockdev_superblock->s_inode_list_lock);
/*
* We hold a reference to 'inode' so it couldn't have been
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 32/54] bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (30 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 31/54] block: use igrab in sync_bdevs Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 33/54] btrfs: don't check I_WILL_FREE|I_FREEING Josef Bacik
` (25 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We can use the refcount to decide if the inode is alive instead of these
flags.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/bcachefs/fs.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 687af0eea0c2..7244c5a4b4cb 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -347,7 +347,7 @@ static struct bch_inode_info *bch2_inode_hash_find(struct bch_fs *c, struct btre
spin_unlock(&inode->v.i_lock);
return NULL;
}
- if ((inode->v.i_state & (I_FREEING|I_WILL_FREE))) {
+ if (!icount_read(inode)) {
if (!trans) {
__wait_on_freeing_inode(c, inode, inum);
} else {
@@ -2225,7 +2225,6 @@ void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s)
continue;
if (!(inode->v.i_state & I_DONTCACHE) &&
- !(inode->v.i_state & I_FREEING) &&
igrab(&inode->v)) {
this_pass_clean = false;
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 33/54] btrfs: don't check I_WILL_FREE|I_FREEING
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (31 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 32/54] bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 34/54] fs: use igrab in drop_pagecache_sb Josef Bacik
` (24 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
btrfs has it's own per-root inode list for snapshot uses, and it has a
sanity check to make sure we're not overwriting a live inode when we add
one to the root's xarray. Change this to check the refcount to validate
it's not a live inode.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index eb9496342346..69aab55648b9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3860,7 +3860,7 @@ static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc)
ASSERT(ret != -ENOMEM);
return ret;
} else if (existing) {
- WARN_ON(!(existing->vfs_inode.i_state & (I_WILL_FREE | I_FREEING)));
+ WARN_ON(!icount_read(&existing->vfs_inode));
}
return 0;
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 34/54] fs: use igrab in drop_pagecache_sb
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (32 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 33/54] btrfs: don't check I_WILL_FREE|I_FREEING Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 35/54] fs: stop checking I_FREEING in d_find_alias_rcu Josef Bacik
` (23 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Just use igrab to see if the inode is valid instead of checking
I_FREEING|I_WILL_FREE.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/drop_caches.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 019a8b4eaaf9..852ccf8e84cb 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -23,18 +23,15 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
- /*
- * We must skip inodes in unusual state. We may also skip
- * inodes without pages but we deliberately won't in case
- * we need to reschedule to avoid softlockups.
- */
- if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+ if ((inode->i_state & I_NEW) ||
(mapping_empty(inode->i_mapping) && !need_resched())) {
spin_unlock(&inode->i_lock);
continue;
}
- __iget(inode);
spin_unlock(&inode->i_lock);
+
+ if (!igrab(inode))
+ continue;
spin_unlock(&sb->s_inode_list_lock);
invalidate_mapping_pages(inode->i_mapping, 0, -1);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 35/54] fs: stop checking I_FREEING in d_find_alias_rcu
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (33 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 34/54] fs: use igrab in drop_pagecache_sb Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 36/54] ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env Josef Bacik
` (22 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Instead of checking for I_FREEING, check the refcount of the inode to
see if it is alive.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/dcache.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 60046ae23d51..3f3bd1373d92 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1072,8 +1072,8 @@ struct dentry *d_find_alias_rcu(struct inode *inode)
spin_lock(&inode->i_lock);
// ->i_dentry and ->i_rcu are colocated, but the latter won't be
- // used without having I_FREEING set, which means no aliases left
- if (likely(!(inode->i_state & I_FREEING) && !hlist_empty(l))) {
+ // used without having an i_count reference, which means no aliases left
+ if (likely(icount_read(inode) && !hlist_empty(l))) {
if (S_ISDIR(inode->i_mode)) {
de = hlist_entry(l->first, struct dentry, d_u.d_alias);
} else {
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 36/54] ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (34 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 35/54] fs: stop checking I_FREEING in d_find_alias_rcu Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 37/54] fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c Josef Bacik
` (21 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Instead check the refcount to see if the inode is alive.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/ext4/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5b7a15db4953..2c777b0f225b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -425,7 +425,7 @@ void ext4_check_map_extents_env(struct inode *inode)
if (!S_ISREG(inode->i_mode) ||
IS_NOQUOTA(inode) || IS_VERITY(inode) ||
is_special_ino(inode->i_sb, inode->i_ino) ||
- (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) ||
+ ((inode->i_state & I_NEW) || !icount_read(inode) ||
ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE) ||
ext4_verity_in_progress(inode))
return;
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 37/54] fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (35 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 36/54] ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 38/54] gfs2: remove I_WILL_FREE|I_FREEING usage Josef Bacik
` (20 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Now that we have the reference count to indicate live inodes, we can
remove all of the uses of I_WILL_FREE and I_FREEING in fs-writeback.c
and use the appropriate reference count checks.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/fs-writeback.c | 47 ++++++++++++++++++++++++++---------------------
1 file changed, 26 insertions(+), 21 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index dbcb317e7113..1594cb09be72 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -129,7 +129,7 @@ static bool inode_io_list_move_locked(struct inode *inode,
{
assert_spin_locked(&wb->list_lock);
assert_spin_locked(&inode->i_lock);
- WARN_ON_ONCE(inode->i_state & I_FREEING);
+ WARN_ON_ONCE(!icount_read(inode));
if (list_empty(&inode->i_io_list))
iobj_get(inode);
@@ -314,7 +314,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
{
assert_spin_locked(&wb->list_lock);
assert_spin_locked(&inode->i_lock);
- WARN_ON_ONCE(inode->i_state & I_FREEING);
+ WARN_ON_ONCE(!icount_read(inode));
inode->i_state &= ~I_SYNC_QUEUED;
if (wb != &wb->bdi->wb)
@@ -415,10 +415,10 @@ static bool inode_do_switch_wbs(struct inode *inode,
xa_lock_irq(&mapping->i_pages);
/*
- * Once I_FREEING or I_WILL_FREE are visible under i_lock, the eviction
- * path owns the inode and we shouldn't modify ->i_io_list.
+ * Once the refcount is 0, the eviction path owns the inode and we
+ * shouldn't modify ->i_io_list.
*/
- if (unlikely(inode->i_state & (I_FREEING | I_WILL_FREE)))
+ if (unlikely(!icount_read(inode)))
goto skip_switch;
trace_inode_switch_wbs(inode, old_wb, new_wb);
@@ -570,13 +570,16 @@ static bool inode_prepare_wbs_switch(struct inode *inode,
/* while holding I_WB_SWITCH, no one else can update the association */
spin_lock(&inode->i_lock);
if (!(inode->i_sb->s_flags & SB_ACTIVE) ||
- inode->i_state & (I_WB_SWITCH | I_FREEING | I_WILL_FREE) ||
+ inode->i_state & I_WB_SWITCH ||
inode_to_wb(inode) == new_wb) {
spin_unlock(&inode->i_lock);
return false;
}
+ if (!inode_tryget(inode)) {
+ spin_unlock(&inode->i_lock);
+ return false;
+ }
inode->i_state |= I_WB_SWITCH;
- __iget(inode);
spin_unlock(&inode->i_lock);
return true;
@@ -1207,7 +1210,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
{
assert_spin_locked(&wb->list_lock);
assert_spin_locked(&inode->i_lock);
- WARN_ON_ONCE(inode->i_state & I_FREEING);
+ WARN_ON_ONCE(!icount_read(inode));
inode->i_state &= ~I_SYNC_QUEUED;
inode_delete_from_io_list(inode);
@@ -1405,7 +1408,7 @@ static void redirty_tail_locked(struct inode *inode, struct bdi_writeback *wb)
* tracking. Flush worker will ignore this inode anyway and it will
* trigger assertions in inode_io_list_move_locked().
*/
- if (inode->i_state & I_FREEING) {
+ if (!icount_read(inode)) {
inode_delete_from_io_list(inode);
wb_io_lists_depopulated(wb);
return;
@@ -1621,7 +1624,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
struct writeback_control *wbc,
unsigned long dirtied_before)
{
- if (inode->i_state & I_FREEING)
+ if (!icount_read(inode))
return;
/*
@@ -1787,7 +1790,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
* whether it is a data-integrity sync (%WB_SYNC_ALL) or not (%WB_SYNC_NONE).
*
* To prevent the inode from going away, either the caller must have a reference
- * to the inode, or the inode must have I_WILL_FREE or I_FREEING set.
+ * to the inode, or the inode must have a zero refcount.
*/
static int writeback_single_inode(struct inode *inode,
struct writeback_control *wbc)
@@ -1797,9 +1800,7 @@ static int writeback_single_inode(struct inode *inode,
spin_lock(&inode->i_lock);
if (!icount_read(inode))
- WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
- else
- WARN_ON(inode->i_state & I_WILL_FREE);
+ WARN_ON(inode->i_state & I_NEW);
if (inode->i_state & I_SYNC) {
/*
@@ -1837,7 +1838,7 @@ static int writeback_single_inode(struct inode *inode,
* If the inode is freeing, its i_io_list shoudn't be updated
* as it can be finally deleted at this moment.
*/
- if (!(inode->i_state & I_FREEING)) {
+ if (icount_read(inode)) {
/*
* If the inode is now fully clean, then it can be safely
* removed from its writeback list (if any). Otherwise the
@@ -1957,7 +1958,7 @@ static long writeback_sb_inodes(struct super_block *sb,
* kind writeout is handled by the freer.
*/
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+ if ((inode->i_state & I_NEW) || !icount_read(inode)) {
redirty_tail_locked(inode, wb);
spin_unlock(&inode->i_lock);
continue;
@@ -2615,7 +2616,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (inode_unhashed(inode))
goto out_unlock;
}
- if (inode->i_state & I_FREEING)
+ if (!icount_read(inode))
goto out_unlock;
/*
@@ -2729,13 +2730,17 @@ static void wait_sb_inodes(struct super_block *sb)
spin_unlock_irq(&sb->s_inode_wblist_lock);
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+ if (inode->i_state & I_NEW) {
+ spin_unlock(&inode->i_lock);
+ spin_lock_irq(&sb->s_inode_wblist_lock);
+ continue;
+ }
+
+ if (!inode_tryget(inode)) {
spin_unlock(&inode->i_lock);
-
spin_lock_irq(&sb->s_inode_wblist_lock);
continue;
}
- __iget(inode);
/*
* We could have potentially ended up on the cached LRU list, so
@@ -2886,7 +2891,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
* This function commits an inode to disk immediately if it is dirty. This is
* primarily needed by knfsd.
*
- * The caller must either have a ref on the inode or must have set I_WILL_FREE.
+ * The caller must either have a ref on the inode or the inode must have a zero refcount.
*/
int write_inode_now(struct inode *inode, int sync)
{
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 38/54] gfs2: remove I_WILL_FREE|I_FREEING usage
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (36 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 37/54] fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 39/54] fs: remove I_WILL_FREE|I_FREEING check from dquot.c Josef Bacik
` (19 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Now that we have the reference count to check if the inode is live, use
that instead of checking I_WILL_FREE|I_FREEING.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/gfs2/ops_fstype.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index c770006f8889..2b481fdc903d 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1745,17 +1745,26 @@ static void gfs2_evict_inodes(struct super_block *sb)
struct gfs2_sbd *sdp = sb->s_fs_info;
set_bit(SDF_EVICTING, &sdp->sd_flags);
-
+again:
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
- if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) &&
- !need_resched()) {
+ if ((inode->i_state & I_NEW) && !need_resched()) {
spin_unlock(&inode->i_lock);
continue;
}
- __iget(inode);
spin_unlock(&inode->i_lock);
+
+ if (!igrab(inode)) {
+ if (need_resched()) {
+ spin_unlock(&sb->s_inode_list_lock);
+ iput(toput_inode);
+ toput_inode = NULL;
+ cond_resched();
+ goto again;
+ }
+ continue;
+ }
spin_unlock(&sb->s_inode_list_lock);
iput(toput_inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 39/54] fs: remove I_WILL_FREE|I_FREEING check from dquot.c
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (37 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 38/54] gfs2: remove I_WILL_FREE|I_FREEING usage Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-28 12:35 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 40/54] notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes Josef Bacik
` (18 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We can use the reference count to see if the inode is live.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/quota/dquot.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index df4a9b348769..90e69653c261 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -1030,14 +1030,16 @@ static int add_dquot_ref(struct super_block *sb, int type)
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
- if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+ if ((inode->i_state & I_NEW) ||
!atomic_read(&inode->i_writecount) ||
!dqinit_needed(inode, type)) {
spin_unlock(&inode->i_lock);
continue;
}
- __iget(inode);
spin_unlock(&inode->i_lock);
+
+ if (!igrab(inode))
+ continue;
spin_unlock(&sb->s_inode_list_lock);
#ifdef CONFIG_QUOTA_DEBUG
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 40/54] notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (38 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 39/54] fs: remove I_WILL_FREE|I_FREEING check from dquot.c Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 41/54] xfs: remove I_FREEING check Josef Bacik
` (17 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We can now just use igrab() to make sure we've got a live inode and
remove the I_WILL_FREE|I_FREEING checks.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/notify/fsnotify.c | 26 ++++----------------------
1 file changed, 4 insertions(+), 22 deletions(-)
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 46bfc543f946..25996ad2a130 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -46,33 +46,15 @@ static void fsnotify_unmount_inodes(struct super_block *sb)
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
- /*
- * We cannot __iget() an inode in state I_FREEING,
- * I_WILL_FREE, or I_NEW which is fine because by that point
- * the inode cannot have any associated watches.
- */
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+ if (inode->i_state & I_NEW) {
spin_unlock(&inode->i_lock);
continue;
}
-
- /*
- * If i_count is zero, the inode cannot have any watches and
- * doing an __iget/iput with SB_ACTIVE clear would actually
- * evict all inodes with zero i_count from icache which is
- * unnecessarily violent and may in fact be illegal to do.
- * However, we should have been called /after/ evict_inodes
- * removed all zero refcount inodes, in any case. Test to
- * be sure.
- */
- if (!icount_read(inode)) {
- spin_unlock(&inode->i_lock);
- continue;
- }
-
- __iget(inode);
spin_unlock(&inode->i_lock);
+
+ if (!igrab(inode))
+ continue;
spin_unlock(&sb->s_inode_list_lock);
iput(iput_inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 41/54] xfs: remove I_FREEING check
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (39 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 40/54] notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 42/54] landlock: remove I_FREEING|I_WILL_FREE check Josef Bacik
` (16 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We can simply use the reference count to see if this inode is alive.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/xfs/xfs_bmap_util.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 06ca11731e43..cf6d915e752e 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -514,7 +514,7 @@ xfs_can_free_eofblocks(
* Caller must either hold the exclusive io lock; or be inactivating
* the inode, which guarantees there are no other users of the inode.
*/
- if (!(VFS_I(ip)->i_state & I_FREEING))
+ if (icount_read(VFS_I(ip)))
xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
/* prealloc/delalloc exists only on regular files */
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 42/54] landlock: remove I_FREEING|I_WILL_FREE check
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (40 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 41/54] xfs: remove I_FREEING check Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 43/54] fs: change inode_is_dirtytime_only to use refcount Josef Bacik
` (15 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We have the reference count that we can use to see if the inode is
alive.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
security/landlock/fs.c | 22 ++++------------------
1 file changed, 4 insertions(+), 18 deletions(-)
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index 0bade2c5aa1d..fc7e577b56e1 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -1280,23 +1280,8 @@ static void hook_sb_delete(struct super_block *const sb)
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
struct landlock_object *object;
- /* Only handles referenced inodes. */
- if (!icount_read(inode))
- continue;
-
- /*
- * Protects against concurrent modification of inode (e.g.
- * from get_inode_object()).
- */
spin_lock(&inode->i_lock);
- /*
- * Checks I_FREEING and I_WILL_FREE to protect against a race
- * condition when release_inode() just called iput(), which
- * could lead to a NULL dereference of inode->security or a
- * second call to iput() for the same Landlock object. Also
- * checks I_NEW because such inode cannot be tied to an object.
- */
- if (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) {
+ if (inode->i_state & I_NEW) {
spin_unlock(&inode->i_lock);
continue;
}
@@ -1308,10 +1293,11 @@ static void hook_sb_delete(struct super_block *const sb)
spin_unlock(&inode->i_lock);
continue;
}
- /* Keeps a reference to this inode until the next loop walk. */
- __iget(inode);
spin_unlock(&inode->i_lock);
+ if (!igrab(inode))
+ continue;
+
/*
* If there is no concurrent release_inode() ongoing, then we
* are in charge of calling iput() on this inode, otherwise we
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 43/54] fs: change inode_is_dirtytime_only to use refcount
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (41 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 42/54] landlock: remove I_FREEING|I_WILL_FREE check Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 22:06 ` Mateusz Guzik
2025-08-26 15:39 ` [PATCH v2 44/54] btrfs: remove references to I_FREEING Josef Bacik
` (14 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We don't need the I_WILL_FREE|I_FREEING check, we can use the refcount
to see if the inode is valid.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
include/linux/fs.h | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b13d057ad0d7..531a6d0afa75 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2628,6 +2628,11 @@ static inline void mark_inode_dirty_sync(struct inode *inode)
__mark_inode_dirty(inode, I_DIRTY_SYNC);
}
+static inline int icount_read(const struct inode *inode)
+{
+ return refcount_read(&inode->i_count);
+}
+
/*
* Returns true if the given inode itself only has dirty timestamps (its pages
* may still be dirty) and isn't currently being allocated or freed.
@@ -2639,8 +2644,8 @@ static inline void mark_inode_dirty_sync(struct inode *inode)
*/
static inline bool inode_is_dirtytime_only(struct inode *inode)
{
- return (inode->i_state & (I_DIRTY_TIME | I_NEW |
- I_FREEING | I_WILL_FREE)) == I_DIRTY_TIME;
+ return (inode->i_state & (I_DIRTY_TIME | I_NEW)) == I_DIRTY_TIME &&
+ icount_read(inode);
}
extern void inc_nlink(struct inode *inode);
@@ -3432,11 +3437,6 @@ static inline void __iget(struct inode *inode)
refcount_inc(&inode->i_count);
}
-static inline int icount_read(const struct inode *inode)
-{
- return refcount_read(&inode->i_count);
-}
-
extern void iget_failed(struct inode *);
extern void clear_inode(struct inode *);
extern void __destroy_inode(struct inode *);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 44/54] btrfs: remove references to I_FREEING
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (42 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 43/54] fs: change inode_is_dirtytime_only to use refcount Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 45/54] ext4: remove reference to I_FREEING in inode.c Josef Bacik
` (13 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We do not need the BUG_ON in our evict truncate code, and in the
invalidate path we can simply check for the i_count == 0 to see if we're
evicting the inode.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/inode.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 69aab55648b9..ea6884e5c791 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5338,7 +5338,6 @@ static void evict_inode_truncate_pages(struct inode *inode)
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
struct rb_node *node;
- ASSERT(inode->i_state & I_FREEING);
truncate_inode_pages_final(&inode->i_data);
btrfs_drop_extent_map_range(BTRFS_I(inode), 0, (u64)-1, false);
@@ -7448,7 +7447,7 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
u64 page_start = folio_pos(folio);
u64 page_end = page_start + folio_size(folio) - 1;
u64 cur;
- int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
+ int inode_evicting = icount_read(&inode->vfs_inode) == 0;
/*
* We have folio locked so no new ordered extent can be created on this
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 45/54] ext4: remove reference to I_FREEING in inode.c
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (43 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 44/54] btrfs: remove references to I_FREEING Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 46/54] ext4: remove reference to I_FREEING in orphan.c Josef Bacik
` (12 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Instead of checking I_FREEING, simply check the i_count reference to see
if this inode is going away.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/ext4/inode.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2c777b0f225b..178448fb73df 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -199,8 +199,8 @@ void ext4_evict_inode(struct inode *inode)
* For inodes with journalled data, transaction commit could have
* dirtied the inode. And for inodes with dioread_nolock, unwritten
* extents converting worker could merge extents and also have dirtied
- * the inode. Flush worker is ignoring it because of I_FREEING flag but
- * we still need to remove the inode from the writeback lists.
+ * the inode. Flush worker is ignoring it because the of the 0 i_count
+ * but we still need to remove the inode from the writeback lists.
*/
if (!list_empty_careful(&inode->i_io_list))
inode_io_list_del(inode);
@@ -4581,7 +4581,7 @@ int ext4_truncate(struct inode *inode)
* or it's a completely new inode. In those cases we might not
* have i_rwsem locked because it's not necessary.
*/
- if (!(inode->i_state & (I_NEW|I_FREEING)))
+ if (!(inode->i_state & I_NEW) && icount_read(inode) > 0)
WARN_ON(!inode_is_locked(inode));
trace_ext4_truncate_enter(inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 46/54] ext4: remove reference to I_FREEING in orphan.c
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (44 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 45/54] ext4: remove reference to I_FREEING in inode.c Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 47/54] pnfs: use i_count refcount to determine if the inode is going away Josef Bacik
` (11 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
We can use the i_count refcount to see if this inode is being freed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/ext4/orphan.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/orphan.c b/fs/ext4/orphan.c
index 524d4658fa40..9ef693b9ad06 100644
--- a/fs/ext4/orphan.c
+++ b/fs/ext4/orphan.c
@@ -107,7 +107,8 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
if (!sbi->s_journal || is_bad_inode(inode))
return 0;
- WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
+ WARN_ON_ONCE(!(inode->i_state & I_NEW) &&
+ icount_read(inode) > 0 &&
!inode_is_locked(inode));
/*
* Inode orphaned in orphan file or in orphan list?
@@ -236,7 +237,8 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
return 0;
- WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
+ WARN_ON_ONCE(!(inode->i_state & I_NEW) &&
+ icount_read(inode) > 0 &&
!inode_is_locked(inode));
if (ext4_test_inode_state(inode, EXT4_STATE_ORPHAN_FILE))
return ext4_orphan_file_del(handle, inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 47/54] pnfs: use i_count refcount to determine if the inode is going away
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (45 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 46/54] ext4: remove reference to I_FREEING in orphan.c Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 48/54] fs: remove some spurious I_FREEING references in inode.c Josef Bacik
` (10 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Remove the I_FREEING and I_CLEAR check in PNFS and replace it with a
i_count reference check, which will indicate that the inode is going
away.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/nfs/pnfs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index a3135b5af7ee..e400e3334c75 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -317,7 +317,7 @@ pnfs_put_layout_hdr(struct pnfs_layout_hdr *lo)
WARN_ONCE(1, "NFS: BUG unfreed layout segments.\n");
pnfs_detach_layout_hdr(lo);
/* Notify pnfs_destroy_layout_final() that we're done */
- if (inode->i_state & (I_FREEING | I_CLEAR))
+ if (icount_read(inode) == 0)
wake_up_var_locked(lo, &inode->i_lock);
spin_unlock(&inode->i_lock);
pnfs_free_layout_hdr(lo);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 48/54] fs: remove some spurious I_FREEING references in inode.c
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (46 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 47/54] pnfs: use i_count refcount to determine if the inode is going away Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-28 12:40 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 49/54] xfs: remove reference to I_FREEING|I_WILL_FREE Josef Bacik
` (9 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Now that we have the i_count reference count rules set so that we only
go into these evict paths with a 0 count, update the sanity checks to
check that instead of I_FREEING.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index eb74f7b5e967..da38c9fbb9a7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -858,7 +858,7 @@ void clear_inode(struct inode *inode)
*/
xa_unlock_irq(&inode->i_data.i_pages);
BUG_ON(!list_empty(&inode->i_data.i_private_list));
- BUG_ON(!(inode->i_state & I_FREEING));
+ BUG_ON(icount_read(inode) != 0);
BUG_ON(inode->i_state & I_CLEAR);
BUG_ON(!list_empty(&inode->i_wb_list));
/* don't need i_lock here, no concurrent mods to i_state */
@@ -871,19 +871,19 @@ EXPORT_SYMBOL(clear_inode);
* to. We remove any pages still attached to the inode and wait for any IO that
* is still in progress before finally destroying the inode.
*
- * An inode must already be marked I_FREEING so that we avoid the inode being
+ * An inode must already have an i_count of 0 so that we avoid the inode being
* moved back onto lists if we race with other code that manipulates the lists
* (e.g. writeback_single_inode). The caller is responsible for setting this.
*
* An inode must already be removed from the LRU list before being evicted from
- * the cache. This should occur atomically with setting the I_FREEING state
- * flag, so no inodes here should ever be on the LRU when being evicted.
+ * the cache. This should always be the case as the LRU list holds an i_count
+ * reference on the inode, and we only evict inodes with an i_count of 0.
*/
static void evict(struct inode *inode)
{
const struct super_operations *op = inode->i_sb->s_op;
- BUG_ON(!(inode->i_state & I_FREEING));
+ BUG_ON(icount_read(inode) != 0);
BUG_ON(!list_empty(&inode->i_lru));
if (!list_empty(&inode->i_io_list))
@@ -897,8 +897,8 @@ static void evict(struct inode *inode)
/*
* Wait for flusher thread to be done with the inode so that filesystem
* does not start destroying it while writeback is still running. Since
- * the inode has I_FREEING set, flusher thread won't start new work on
- * the inode. We just have to wait for running writeback to finish.
+ * the inode has a 0 i_count, flusher thread won't start new work on the
+ * inode. We just have to wait for running writeback to finish.
*/
inode_wait_for_writeback(inode);
spin_unlock(&inode->i_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 49/54] xfs: remove reference to I_FREEING|I_WILL_FREE
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (47 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 48/54] fs: remove some spurious I_FREEING references in inode.c Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 50/54] ocfs2: do not set I_WILL_FREE Josef Bacik
` (8 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
xfs already is using igrab which will do the correct thing, simply
remove the reference to these flags.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/xfs/scrub/common.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 2ef7742be7d3..b0bb40490493 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -1086,8 +1086,7 @@ xchk_install_handle_inode(
/*
* Install an already-referenced inode for scrubbing. Get our own reference to
- * the inode to make disposal simpler. The inode must not be in I_FREEING or
- * I_WILL_FREE state!
+ * the inode to make disposal simpler. The inode must be live.
*/
int
xchk_install_live_inode(
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 50/54] ocfs2: do not set I_WILL_FREE
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (48 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 49/54] xfs: remove reference to I_FREEING|I_WILL_FREE Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 51/54] fs: remove I_FREEING|I_WILL_FREE Josef Bacik
` (7 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
This is a subtle behavior change. Before this change ocfs2 would keep
this inode from being discovered and used while it was doing this
because of I_WILL_FREE being set. However now we call ->drop_inode()
before we drop the last i_count refcount, so we could potentially race
here with somebody else and grab a reference to this inode.
This isn't bad, the inode is still live and concurrent accesses will be
safe. But we could potentially end up writing this inode multiple times
if there are concurrent accesses while we're trying to drop the inode.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/ocfs2/inode.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 14bf440ea4df..d3c79d9a9635 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1306,13 +1306,9 @@ int ocfs2_drop_inode(struct inode *inode)
trace_ocfs2_drop_inode((unsigned long long)oi->ip_blkno,
inode->i_nlink, oi->ip_flags);
- assert_spin_locked(&inode->i_lock);
- inode->i_state |= I_WILL_FREE;
spin_unlock(&inode->i_lock);
write_inode_now(inode, 1);
spin_lock(&inode->i_lock);
- WARN_ON(inode->i_state & I_NEW);
- inode->i_state &= ~I_WILL_FREE;
return 1;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 51/54] fs: remove I_FREEING|I_WILL_FREE
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (49 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 50/54] ocfs2: do not set I_WILL_FREE Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-28 12:42 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 52/54] fs: remove I_REFERENCED Josef Bacik
` (6 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Now that we're using the i_count reference count as the ultimate arbiter
of whether or not an inode is life we can remove the I_FREEING and
I_WILL_FREE flags.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 8 ++---
include/linux/fs.h | 56 +++++++++++++-------------------
include/trace/events/writeback.h | 2 --
3 files changed, 25 insertions(+), 41 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index da38c9fbb9a7..8f61761ca021 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -862,7 +862,7 @@ void clear_inode(struct inode *inode)
BUG_ON(inode->i_state & I_CLEAR);
BUG_ON(!list_empty(&inode->i_wb_list));
/* don't need i_lock here, no concurrent mods to i_state */
- inode->i_state = I_FREEING | I_CLEAR;
+ inode->i_state = I_CLEAR;
}
EXPORT_SYMBOL(clear_inode);
@@ -926,7 +926,7 @@ static void evict(struct inode *inode)
* This also means we don't need any fences for the call below.
*/
inode_wake_up_bit(inode, __I_NEW);
- BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
+ BUG_ON(inode->i_state != I_CLEAR);
}
static void iput_evict(struct inode *inode);
@@ -1959,7 +1959,6 @@ static void iput_final(struct inode *inode, bool drop)
state = inode->i_state;
if (!drop) {
- WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
spin_unlock(&inode->i_lock);
write_inode_now(inode, 1);
@@ -1967,10 +1966,7 @@ static void iput_final(struct inode *inode, bool drop)
spin_lock(&inode->i_lock);
state = inode->i_state;
WARN_ON(state & I_NEW);
- state &= ~I_WILL_FREE;
}
-
- WRITE_ONCE(inode->i_state, state | I_FREEING);
spin_unlock(&inode->i_lock);
evict(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 531a6d0afa75..2a7e7fc96431 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -672,8 +672,8 @@ is_uncached_acl(struct posix_acl *acl)
* I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME.
*
* Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
- * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
- * various stages of removing an inode.
+ * until that flag is cleared. I_CLEAR is set when the inode is clean and ready
+ * to be freed.
*
* Two bits are used for locking and completion notification, I_NEW and I_SYNC.
*
@@ -697,24 +697,18 @@ is_uncached_acl(struct posix_acl *acl)
* New inodes set I_NEW. If two processes both create
* the same inode, one of them will release its inode and
* wait for I_NEW to be released before returning.
- * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
- * also cause waiting on I_NEW, without I_NEW actually
- * being set. find_inode() uses this to prevent returning
+ * Inodes with an i_count == 0 or I_CLEAR state can also
+ * cause waiting on I_NEW, without I_NEW actually being
+ * set. find_inode() uses this to prevent returning
* nearly-dead inodes.
- * I_WILL_FREE Must be set when calling write_inode_now() if i_count
- * is zero. I_FREEING must be set when I_WILL_FREE is
- * cleared.
- * I_FREEING Set when inode is about to be freed but still has dirty
- * pages or buffers attached or the inode itself is still
- * dirty.
* I_CLEAR Added by clear_inode(). In this state the inode is
- * clean and can be destroyed. Inode keeps I_FREEING.
+ * clean and can be destroyed.
*
- * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
- * prohibited for many purposes. iget() must wait for
- * the inode to be completely released, then create it
- * anew. Other functions will just ignore such inodes,
- * if appropriate. I_NEW is used for waiting.
+ * Inodes that have i_count == 0 or I_CLEAR are prohibited
+ * for many purposes. iget() must wait for the inode to be
+ * completely released, then create it anew. Other
+ * functions will just ignore such inodes, if appropriate.
+ * I_NEW is used for waiting.
*
* I_SYNC Writeback of inode is running. The bit is set during
* data writeback, and cleared with a wakeup on the bit
@@ -752,8 +746,6 @@ is_uncached_acl(struct posix_acl *acl)
* I_CACHED_LRU Inode is cached because it is dirty or isn't shrinkable,
* and thus is on the s_cached_inode_lru list.
*
- * Q: What is the difference between I_WILL_FREE and I_FREEING?
- *
* __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
* upon. There's one free address left.
*/
@@ -771,20 +763,18 @@ enum inode_state_flags_t {
I_DIRTY_SYNC = (1U << 3),
I_DIRTY_DATASYNC = (1U << 4),
I_DIRTY_PAGES = (1U << 5),
- I_WILL_FREE = (1U << 6),
- I_FREEING = (1U << 7),
- I_CLEAR = (1U << 8),
- I_REFERENCED = (1U << 9),
- I_LINKABLE = (1U << 10),
- I_DIRTY_TIME = (1U << 11),
- I_WB_SWITCH = (1U << 12),
- I_OVL_INUSE = (1U << 13),
- I_CREATING = (1U << 14),
- I_DONTCACHE = (1U << 15),
- I_SYNC_QUEUED = (1U << 16),
- I_PINNING_NETFS_WB = (1U << 17),
- I_LRU = (1U << 18),
- I_CACHED_LRU = (1U << 19)
+ I_CLEAR = (1U << 6),
+ I_REFERENCED = (1U << 7),
+ I_LINKABLE = (1U << 8),
+ I_DIRTY_TIME = (1U << 9),
+ I_WB_SWITCH = (1U << 10),
+ I_OVL_INUSE = (1U << 11),
+ I_CREATING = (1U << 12),
+ I_DONTCACHE = (1U << 13),
+ I_SYNC_QUEUED = (1U << 14),
+ I_PINNING_NETFS_WB = (1U << 15),
+ I_LRU = (1U << 16),
+ I_CACHED_LRU = (1U << 17)
};
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 6949329c744a..58ee61f3d91d 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -15,8 +15,6 @@
{I_DIRTY_DATASYNC, "I_DIRTY_DATASYNC"}, \
{I_DIRTY_PAGES, "I_DIRTY_PAGES"}, \
{I_NEW, "I_NEW"}, \
- {I_WILL_FREE, "I_WILL_FREE"}, \
- {I_FREEING, "I_FREEING"}, \
{I_CLEAR, "I_CLEAR"}, \
{I_SYNC, "I_SYNC"}, \
{I_DIRTY_TIME, "I_DIRTY_TIME"}, \
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 52/54] fs: remove I_REFERENCED
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (50 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 51/54] fs: remove I_FREEING|I_WILL_FREE Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-28 12:47 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 53/54] fs: remove I_LRU_ISOLATING flag Josef Bacik
` (5 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Because we have referenced inodes on the LRU we've had to change the
behavior to make sure we remove the inode from the LRU when we reference
it.
We do this to account for the fact that we may end up with an inode on
the LRU list, and then unlink the inode. We want the last iput() in the
unlink() to actually evict the inode ideally, so we don't want it to
stick around on the LRU and be evicted that way.
With that behavior change we no longer need I_REFERENCED, as we're
always removing the inode from the LRU list on a subsequent access if
it's on the LRU.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 36 +++++++-------------------------
include/linux/fs.h | 22 +++++++++----------
include/trace/events/writeback.h | 1 -
3 files changed, 17 insertions(+), 42 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 8f61761ca021..4f77db7aca75 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -591,7 +591,12 @@ static bool inode_del_cached_lru(struct inode *inode)
return false;
}
-static void __inode_add_lru(struct inode *inode, bool rotate)
+/*
+ * Add inode to LRU if needed (inode is unused and clean).
+ *
+ * Needs inode->i_lock held.
+ */
+void inode_add_lru(struct inode *inode)
{
bool need_ref = true;
@@ -614,8 +619,6 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
if (need_ref)
__iget(inode);
this_cpu_inc(nr_unused);
- } else if (rotate) {
- inode->i_state |= I_REFERENCED;
}
}
@@ -630,16 +633,6 @@ struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
}
EXPORT_SYMBOL(inode_bit_waitqueue);
-/*
- * Add inode to LRU if needed (inode is unused and clean).
- *
- * Needs inode->i_lock held.
- */
-void inode_add_lru(struct inode *inode)
-{
- __inode_add_lru(inode, false);
-}
-
/*
* Caller must be holding it's own i_count reference on this inode in order to
* prevent this being the final iput.
@@ -1001,14 +994,6 @@ EXPORT_SYMBOL_GPL(evict_inodes);
/*
* Isolate the inode from the LRU in preparation for freeing it.
- *
- * If the inode has the I_REFERENCED flag set, then it means that it has been
- * used recently - the flag is set in iput_final(). When we encounter such an
- * inode, clear the flag and move it to the back of the LRU so it gets another
- * pass through the LRU before it gets reclaimed. This is necessary because of
- * the fact we are doing lazy LRU updates to minimise lock contention so the
- * LRU does not have strict ordering. Hence we don't want to reclaim inodes
- * with this flag set because they are the inodes that are out of order.
*/
static enum lru_status inode_lru_isolate(struct list_head *item,
struct list_lru_one *lru, void *arg)
@@ -1039,13 +1024,6 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
return LRU_REMOVED;
}
- /* Recently referenced inodes get one more pass */
- if (inode->i_state & I_REFERENCED) {
- inode->i_state &= ~I_REFERENCED;
- spin_unlock(&inode->i_lock);
- return LRU_ROTATE;
- }
-
/*
* On highmem systems, mapping_shrinkable() permits dropping
* page cache in order to free up struct inodes: lowmem might
@@ -1995,7 +1973,7 @@ static bool maybe_add_lru(struct inode *inode, bool skip_lru)
if (!(sb->s_flags & SB_ACTIVE))
return drop;
- __inode_add_lru(inode, true);
+ inode_add_lru(inode);
return drop;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2a7e7fc96431..39cde53c1b3b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -715,7 +715,6 @@ is_uncached_acl(struct posix_acl *acl)
* address once it is done. The bit is also used to pin
* the inode in memory for flusher thread.
*
- * I_REFERENCED Marks the inode as recently references on the LRU list.
*
* I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
* synchronize competing switching instances and to tell
@@ -764,17 +763,16 @@ enum inode_state_flags_t {
I_DIRTY_DATASYNC = (1U << 4),
I_DIRTY_PAGES = (1U << 5),
I_CLEAR = (1U << 6),
- I_REFERENCED = (1U << 7),
- I_LINKABLE = (1U << 8),
- I_DIRTY_TIME = (1U << 9),
- I_WB_SWITCH = (1U << 10),
- I_OVL_INUSE = (1U << 11),
- I_CREATING = (1U << 12),
- I_DONTCACHE = (1U << 13),
- I_SYNC_QUEUED = (1U << 14),
- I_PINNING_NETFS_WB = (1U << 15),
- I_LRU = (1U << 16),
- I_CACHED_LRU = (1U << 17)
+ I_LINKABLE = (1U << 7),
+ I_DIRTY_TIME = (1U << 8),
+ I_WB_SWITCH = (1U << 9),
+ I_OVL_INUSE = (1U << 10),
+ I_CREATING = (1U << 11),
+ I_DONTCACHE = (1U << 12),
+ I_SYNC_QUEUED = (1U << 13),
+ I_PINNING_NETFS_WB = (1U << 14),
+ I_LRU = (1U << 15),
+ I_CACHED_LRU = (1U << 16)
};
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 58ee61f3d91d..b419b8060dda 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -18,7 +18,6 @@
{I_CLEAR, "I_CLEAR"}, \
{I_SYNC, "I_SYNC"}, \
{I_DIRTY_TIME, "I_DIRTY_TIME"}, \
- {I_REFERENCED, "I_REFERENCED"}, \
{I_LINKABLE, "I_LINKABLE"}, \
{I_WB_SWITCH, "I_WB_SWITCH"}, \
{I_OVL_INUSE, "I_OVL_INUSE"}, \
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 53/54] fs: remove I_LRU_ISOLATING flag
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (51 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 52/54] fs: remove I_REFERENCED Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-28 0:26 ` Dave Chinner
2025-08-26 15:39 ` [PATCH v2 54/54] fs: add documentation explaining the reference count rules for inodes Josef Bacik
` (4 subsequent siblings)
57 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
If the inode is on the LRU it has a full reference and thus no longer
needs to be pinned while it is being isolated.
Remove the I_LRU_ISOLATING flag and associated helper functions
(inode_pin_lru_isolating, inode_unpin_lru_isolating, and
inode_wait_for_lru_isolating) as they are no longer needed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 46 --------------------------------
include/linux/fs.h | 39 ++++++++++++---------------
include/trace/events/writeback.h | 1 -
3 files changed, 17 insertions(+), 69 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 4f77db7aca75..77f009edd5df 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -656,49 +656,6 @@ void inode_lru_list_del(struct inode *inode)
}
}
-static void inode_pin_lru_isolating(struct inode *inode)
-{
- lockdep_assert_held(&inode->i_lock);
- WARN_ON(inode->i_state & I_LRU_ISOLATING);
- inode->i_state |= I_LRU_ISOLATING;
-}
-
-static void inode_unpin_lru_isolating(struct inode *inode)
-{
- spin_lock(&inode->i_lock);
- WARN_ON(!(inode->i_state & I_LRU_ISOLATING));
- inode->i_state &= ~I_LRU_ISOLATING;
- /* Called with inode->i_lock which ensures memory ordering. */
- inode_wake_up_bit(inode, __I_LRU_ISOLATING);
- spin_unlock(&inode->i_lock);
-}
-
-static void inode_wait_for_lru_isolating(struct inode *inode)
-{
- struct wait_bit_queue_entry wqe;
- struct wait_queue_head *wq_head;
-
- lockdep_assert_held(&inode->i_lock);
- if (!(inode->i_state & I_LRU_ISOLATING))
- return;
-
- wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING);
- for (;;) {
- prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
- /*
- * Checking I_LRU_ISOLATING with inode->i_lock guarantees
- * memory ordering.
- */
- if (!(inode->i_state & I_LRU_ISOLATING))
- break;
- spin_unlock(&inode->i_lock);
- schedule();
- spin_lock(&inode->i_lock);
- }
- finish_wait(wq_head, &wqe.wq_entry);
- WARN_ON(inode->i_state & I_LRU_ISOLATING);
-}
-
/**
* inode_sb_list_add - add inode to the superblock list of inodes
* @inode: inode to add
@@ -885,7 +842,6 @@ static void evict(struct inode *inode)
inode_sb_list_del(inode);
spin_lock(&inode->i_lock);
- inode_wait_for_lru_isolating(inode);
/*
* Wait for flusher thread to be done with the inode so that filesystem
@@ -1030,7 +986,6 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
* be under pressure before the cache inside the highmem zone.
*/
if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
- inode_pin_lru_isolating(inode);
spin_unlock(&inode->i_lock);
spin_unlock(&lru->lock);
if (remove_inode_buffers(inode)) {
@@ -1042,7 +997,6 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
__count_vm_events(PGINODESTEAL, reap);
mm_account_reclaimed_pages(reap);
}
- inode_unpin_lru_isolating(inode);
return LRU_RETRY;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 39cde53c1b3b..61113026efd5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -734,9 +734,6 @@ is_uncached_acl(struct posix_acl *acl)
*
* I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback.
*
- * I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding
- * i_count.
- *
* I_LRU Inode is on the LRU list and has an associated LRU
* reference count. Used to distinguish inodes where
* ->i_lru is on the LRU and those that are using ->i_lru
@@ -745,34 +742,32 @@ is_uncached_acl(struct posix_acl *acl)
* I_CACHED_LRU Inode is cached because it is dirty or isn't shrinkable,
* and thus is on the s_cached_inode_lru list.
*
- * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
- * upon. There's one free address left.
+ * __I_{SYNC,NEW} are used to derive unique addresses to wait upon. There are
+ * two free address left.
*/
enum inode_state_bits {
__I_NEW = 0U,
- __I_SYNC = 1U,
- __I_LRU_ISOLATING = 2U
+ __I_SYNC = 1U
};
enum inode_state_flags_t {
I_NEW = (1U << __I_NEW),
I_SYNC = (1U << __I_SYNC),
- I_LRU_ISOLATING = (1U << __I_LRU_ISOLATING),
- I_DIRTY_SYNC = (1U << 3),
- I_DIRTY_DATASYNC = (1U << 4),
- I_DIRTY_PAGES = (1U << 5),
- I_CLEAR = (1U << 6),
- I_LINKABLE = (1U << 7),
- I_DIRTY_TIME = (1U << 8),
- I_WB_SWITCH = (1U << 9),
- I_OVL_INUSE = (1U << 10),
- I_CREATING = (1U << 11),
- I_DONTCACHE = (1U << 12),
- I_SYNC_QUEUED = (1U << 13),
- I_PINNING_NETFS_WB = (1U << 14),
- I_LRU = (1U << 15),
- I_CACHED_LRU = (1U << 16)
+ I_DIRTY_SYNC = (1U << 2),
+ I_DIRTY_DATASYNC = (1U << 3),
+ I_DIRTY_PAGES = (1U << 4),
+ I_CLEAR = (1U << 5),
+ I_LINKABLE = (1U << 6),
+ I_DIRTY_TIME = (1U << 7),
+ I_WB_SWITCH = (1U << 8),
+ I_OVL_INUSE = (1U << 9),
+ I_CREATING = (1U << 10),
+ I_DONTCACHE = (1U << 11),
+ I_SYNC_QUEUED = (1U << 12),
+ I_PINNING_NETFS_WB = (1U << 13),
+ I_LRU = (1U << 14),
+ I_CACHED_LRU = (1U << 15)
};
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index b419b8060dda..a5b73d25eda6 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -25,7 +25,6 @@
{I_DONTCACHE, "I_DONTCACHE"}, \
{I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \
{I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \
- {I_LRU_ISOLATING, "I_LRU_ISOLATING"}, \
{I_LRU, "I_LRU"}, \
{I_CACHED_LRU, "I_CACHED_LRU"} \
)
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* [PATCH v2 54/54] fs: add documentation explaining the reference count rules for inodes
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (52 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 53/54] fs: remove I_LRU_ISOLATING flag Josef Bacik
@ 2025-08-26 15:39 ` Josef Bacik
2025-08-27 8:03 ` [syzbot ci] Re: fs: rework inode reference counting syzbot ci
` (3 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-26 15:39 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
Now that we've made these changes to the inode, document the reference
count rules in the vfs documentation.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
Documentation/filesystems/vfs.rst | 86 +++++++++++++++++++++++++++++++
1 file changed, 86 insertions(+)
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 229eb90c96f2..e285cf0499ab 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -457,6 +457,92 @@ The Inode Object
An inode object represents an object within the filesystem.
+Reference counting rules
+------------------------
+
+The inode is reference counted in two distinct ways, an i_obj_count refcount and
+an i_count refcount. These control two different lifetimes of the inode. The
+i_obj_count is the simplest, think of it as a reference count on the object
+itself. When the i_obj_count reaches zero, the inode is freed. Inode freeing
+happens in the RCU context, so the inode is not freed immediately, but rather
+after a grace period.
+
+The i_count reference is the indicator that the inode is "alive". That is to
+say, it is available for use by all the ways that a user can access the inode.
+Once this count reaches zero, we begin the process of evicting the inode. This
+is where the final truncate of an unlinked inode will normally occur. Once
+i_count has reached 0, only the final iput() is allowed to do things like
+writeback, truncate, etc. All users that want to do these style of operation
+must use igrab() or, in very rare and specific circumstances, use
+inode_tryget().
+
+Every access to an inode must include one of these two references. Generally
+i_obj_count is reserved for internal VFS references, the s_inode_list for
+example. All file systems should use igrab()/lookup() to get a live reference on
+the inode, with very few exceptions.
+
+LRU rules
+---------
+
+This is tightly coupled with the reference counting rules above. If the inode is
+being held on an LRU it must be holding both an i_count and an i_obj_count
+reference. This is because we need the inode to be "live" while it is on the LRU
+so it can be accessed again in the future.
+
+This is different how we traditionally operated. Traditionally we put 0 refcount
+objects on the LRU, and then when eviction happened we would remove the inode
+from the LRU if it had a non-zero refcount, or evict it if it had a zero
+refcount.
+
+Now the rules are much simpler. The LRU has a live reference on the inode. That
+means that eviction simply has to remove the LRU and call iput_evict(), which
+will make sure the inode is not re-added to the LRU when putting the reference.
+If there are other active references to the inode, then when those references
+are dropped the inode will be added back to the LRU.
+
+We have two uses for i_lru, one is for the normal inactive inode LRU, and the
+other is for pinned inodes that are pinned because they are dirty or because
+they have pagecache attached to them.
+
+The dirty case is easy to reason about. If the inode is dirty we cannot reclaim
+it until it has been written back. The inode gets added to super block's cached
+inode list when it is dirty, and removed when it is clean.
+
+The pagecache case is a little more complex. The VM wants to pin inodes into
+memory as long as they have pagecache. This is because the pagecache has much
+better reclaim logic, it accounts for thrashing and refaulting, so it needs to
+be the ultimate arbiter of when an inode can be reclaimed. The inode remains on
+the cached list as long as it has pagecache to account for this. When pages are
+removed from the inode the VM calls inode_add_lru() to see if the inode still
+needs to be on the cached list or on the inactive LRU.
+
+Holding a live reference on the inode has one drawback. We must remove the inode
+from the LRU in more cases that previously, which can increase contention on the
+LRU. In practice this won't be a problem, because we only put the inode on the
+LRU that doesn't have a dentry associated with it. When we grab a live reference
+to an inode we must delete it from the LRU in order to make sure that any unlink
+operation results in the inode being removed on the final iput().
+
+Consider the case where we've removed the last dentry from an inode and the
+inode is added to the LRU list. We then lookup the inode to do an unlink. The
+final iput in the unlink path will just reduce the i_count to 1, and the inode
+will not be truly removed until eviction or unmount. To avoid this we have two
+choices, make sure we delete the inode from the LRU at
+drop_nlink()/clear_nlink() time, or make sure we delete the inode from the LRU
+when we grab a live reference to it. We cannot do the drop at
+drop_nlink()/clear_nlink() time because we could be holding the i_lock.
+Additionally there are awkward things like BTRFS subvolume delete that do not
+use the nlink of the subvolume as the indicator that it needs to be removed, and
+so we would have to audit all of the possible unlink paths to make sure we
+properly deleted the inode from the LRU. Instead, to provide a more robust
+system, we remove an inode from the LRU at igrab() time. Internally where we're
+already holding the i_lock and use inode_tryget() we will delete the inode from
+the LRU at this point.
+
+The other case is in the unlink path itself. If there was a truncate at all we
+could have ended up on the cached list, so we already have an elevated i_count.
+Removing the inode from the LRU explicitly at this stage is necessary to make
+sure the inode is freed as soon as possible.
struct inode_operations
-----------------------
--
2.49.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* Re: [PATCH v2 43/54] fs: change inode_is_dirtytime_only to use refcount
2025-08-26 15:39 ` [PATCH v2 43/54] fs: change inode_is_dirtytime_only to use refcount Josef Bacik
@ 2025-08-26 22:06 ` Mateusz Guzik
2025-08-28 12:38 ` Christian Brauner
0 siblings, 1 reply; 105+ messages in thread
From: Mateusz Guzik @ 2025-08-26 22:06 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Tue, Aug 26, 2025 at 11:39:43AM -0400, Josef Bacik wrote:
> We don't need the I_WILL_FREE|I_FREEING check, we can use the refcount
> to see if the inode is valid.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> include/linux/fs.h | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b13d057ad0d7..531a6d0afa75 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2628,6 +2628,11 @@ static inline void mark_inode_dirty_sync(struct inode *inode)
> __mark_inode_dirty(inode, I_DIRTY_SYNC);
> }
>
> +static inline int icount_read(const struct inode *inode)
> +{
> + return refcount_read(&inode->i_count);
> +}
> +
> /*
> * Returns true if the given inode itself only has dirty timestamps (its pages
> * may still be dirty) and isn't currently being allocated or freed.
> @@ -2639,8 +2644,8 @@ static inline void mark_inode_dirty_sync(struct inode *inode)
> */
> static inline bool inode_is_dirtytime_only(struct inode *inode)
> {
> - return (inode->i_state & (I_DIRTY_TIME | I_NEW |
> - I_FREEING | I_WILL_FREE)) == I_DIRTY_TIME;
> + return (inode->i_state & (I_DIRTY_TIME | I_NEW)) == I_DIRTY_TIME &&
> + icount_read(inode);
> }
>
> extern void inc_nlink(struct inode *inode);
> @@ -3432,11 +3437,6 @@ static inline void __iget(struct inode *inode)
> refcount_inc(&inode->i_count);
> }
>
> -static inline int icount_read(const struct inode *inode)
> -{
> - return refcount_read(&inode->i_count);
> -}
> -
> extern void iget_failed(struct inode *);
> extern void clear_inode(struct inode *);
> extern void __destroy_inode(struct inode *);
> --
> 2.49.0
>
nit: I would change the diff introducing icount_read() to already place
it in the right spot. As is this is going to mess with blame for no good
reason.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 02/54] fs: add an icount_read helper
2025-08-26 15:39 ` [PATCH v2 02/54] fs: add an icount_read helper Josef Bacik
@ 2025-08-26 22:18 ` Mateusz Guzik
2025-08-27 11:25 ` (subset) " Christian Brauner
1 sibling, 0 replies; 105+ messages in thread
From: Mateusz Guzik @ 2025-08-26 22:18 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Tue, Aug 26, 2025 at 11:39:02AM -0400, Josef Bacik wrote:
> Instead of doing direct access to ->i_count, add a helper to handle
> this. This will make it easier to convert i_count to a refcount later.
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 079b868552c2..46bfc543f946 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -66,7 +66,7 @@ static void fsnotify_unmount_inodes(struct super_block *sb)
> * removed all zero refcount inodes, in any case. Test to
> * be sure.
> */
> - if (!atomic_read(&inode->i_count)) {
> + if (!icount_read(inode)) {
> spin_unlock(&inode->i_lock);
> continue;
> }
[snip]
> +static inline int icount_read(const struct inode *inode)
> +{
> + return atomic_read(&inode->i_count);
> +}
> +
> extern void iget_failed(struct inode *);
> extern void clear_inode(struct inode *);
> extern void __destroy_inode(struct inode *);
The placement issue I mentioned in another e-mail aside, I would
recommend further error-proofing this.
Above I quoted an example user which treats i_count == 0 as special.
While moving this into helpers is definitely a step in the right
direction, I think having consumer open-code this check is avoidably
error-prone.
Notably, as is there is nothing to indicate whether the consumer expects
the value to remain stable or is perhaps doing a quick check for other
reasons.
As such, specific naming aside, I would create 2 variants:
1. icount_read_unstable() -- the value can change from under you
arbitrarily. I don't there are any consumers for this sucker atm.
2. icount_read() -- the caller expects the transition 0<->1 is
guaranteed to not take place, notably if the value is found to be 0, it
stay at 0. to that end the caller is expected to hold the inode spinlock
*and* the fact that the lock is held is asserted on with lockdep.
All that aside, I think open-coding "is the inode unused" with an
explicit count check is bad form -- a dedicated helper for that would
also be nice.
My 3 CZK.
^ permalink raw reply [flat|nested] 105+ messages in thread
* [syzbot ci] Re: fs: rework inode reference counting
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (53 preceding siblings ...)
2025-08-26 15:39 ` [PATCH v2 54/54] fs: add documentation explaining the reference count rules for inodes Josef Bacik
@ 2025-08-27 8:03 ` syzbot ci
2025-08-27 11:14 ` (subset) [PATCH v2 00/54] " Christian Brauner
` (2 subsequent siblings)
57 siblings, 0 replies; 105+ messages in thread
From: syzbot ci @ 2025-08-27 8:03 UTC (permalink / raw)
To: amir73il, brauner, josef, kernel-team, linux-btrfs, linux-ext4,
linux-fsdevel, linux-xfs, viro
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v2] fs: rework inode reference counting
https://lore.kernel.org/all/cover.1756222464.git.josef@toxicpanda.com
* [PATCH v2 01/54] fs: make the i_state flags an enum
* [PATCH v2 02/54] fs: add an icount_read helper
* [PATCH v2 03/54] fs: rework iput logic
* [PATCH v2 04/54] fs: add an i_obj_count refcount to the inode
* [PATCH v2 05/54] fs: hold an i_obj_count reference in wait_sb_inodes
* [PATCH v2 06/54] fs: hold an i_obj_count reference for the i_wb_list
* [PATCH v2 07/54] fs: hold an i_obj_count reference for the i_io_list
* [PATCH v2 08/54] fs: hold an i_obj_count reference in writeback_sb_inodes
* [PATCH v2 09/54] fs: hold an i_obj_count reference while on the hashtable
* [PATCH v2 10/54] fs: hold an i_obj_count reference while on the LRU list
* [PATCH v2 11/54] fs: hold an i_obj_count reference while on the sb inode list
* [PATCH v2 12/54] fs: stop accessing ->i_count directly in f2fs and gfs2
* [PATCH v2 13/54] fs: hold an i_obj_count when we have an i_count reference
* [PATCH v2 14/54] fs: add an I_LRU flag to the inode
* [PATCH v2 15/54] fs: maintain a list of pinned inodes
* [PATCH v2 16/54] fs: delete the inode from the LRU list on lookup
* [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir
* [PATCH v2 18/54] fs: change evict_inodes to use iput instead of evict directly
* [PATCH v2 19/54] fs: hold a full ref while the inode is on a LRU
* [PATCH v2 20/54] fs: disallow 0 reference count inodes
* [PATCH v2 21/54] fs: make evict_inodes add to the dispose list under the i_lock
* [PATCH v2 22/54] fs: convert i_count to refcount_t
* [PATCH v2 23/54] fs: use refcount_inc_not_zero in igrab
* [PATCH v2 24/54] fs: use inode_tryget in find_inode*
* [PATCH v2 25/54] fs: update find_inode_*rcu to check the i_count count
* [PATCH v2 26/54] fs: use igrab in insert_inode_locked
* [PATCH v2 27/54] fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru
* [PATCH v2 28/54] fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating
* [PATCH v2 29/54] fs: use inode_tryget in evict_inodes
* [PATCH v2 30/54] fs: change evict_dentries_for_decrypted_inodes to use refcount
* [PATCH v2 31/54] block: use igrab in sync_bdevs
* [PATCH v2 32/54] bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING
* [PATCH v2 33/54] btrfs: don't check I_WILL_FREE|I_FREEING
* [PATCH v2 34/54] fs: use igrab in drop_pagecache_sb
* [PATCH v2 35/54] fs: stop checking I_FREEING in d_find_alias_rcu
* [PATCH v2 36/54] ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env
* [PATCH v2 37/54] fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c
* [PATCH v2 38/54] gfs2: remove I_WILL_FREE|I_FREEING usage
* [PATCH v2 39/54] fs: remove I_WILL_FREE|I_FREEING check from dquot.c
* [PATCH v2 40/54] notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes
* [PATCH v2 41/54] xfs: remove I_FREEING check
* [PATCH v2 42/54] landlock: remove I_FREEING|I_WILL_FREE check
* [PATCH v2 43/54] fs: change inode_is_dirtytime_only to use refcount
* [PATCH v2 44/54] btrfs: remove references to I_FREEING
* [PATCH v2 45/54] ext4: remove reference to I_FREEING in inode.c
* [PATCH v2 46/54] ext4: remove reference to I_FREEING in orphan.c
* [PATCH v2 47/54] pnfs: use i_count refcount to determine if the inode is going away
* [PATCH v2 48/54] fs: remove some spurious I_FREEING references in inode.c
* [PATCH v2 49/54] xfs: remove reference to I_FREEING|I_WILL_FREE
* [PATCH v2 50/54] ocfs2: do not set I_WILL_FREE
* [PATCH v2 51/54] fs: remove I_FREEING|I_WILL_FREE
* [PATCH v2 52/54] fs: remove I_REFERENCED
* [PATCH v2 53/54] fs: remove I_LRU_ISOLATING flag
* [PATCH v2 54/54] fs: add documentation explaining the reference count rules for inodes
and found the following issue:
kernel build error
Full report is available here:
https://ci.syzbot.org/series/ccd4eafa-7a13-48d6-93b6-f40c03262bea
***
kernel build error
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: fab1beda7597fac1cecc01707d55eadb6bbe773c
arch: amd64
compiler: Debian clang version 20.1.7 (++20250616065708+6146a88f6049-1~exp1~20250616065826.132), Debian LLD 20.1.7
config: https://ci.syzbot.org/builds/162c03ae-2d30-4085-ab1e-a2dd1c8403eb/config
fs/bcachefs/fs.c:350:20: error: incompatible pointer types passing 'struct bch_inode_info *' to parameter of type 'const struct inode *' [-Werror,-Wincompatible-pointer-types]
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: (subset) [PATCH v2 00/54] fs: rework inode reference counting
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (54 preceding siblings ...)
2025-08-27 8:03 ` [syzbot ci] Re: fs: rework inode reference counting syzbot ci
@ 2025-08-27 11:14 ` Christian Brauner
2025-08-28 12:51 ` Christian Brauner
2025-09-02 10:06 ` Mateusz Guzik
57 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-27 11:14 UTC (permalink / raw)
To: Josef Bacik
Cc: Christian Brauner, linux-fsdevel, linux-btrfs, kernel-team,
linux-ext4, linux-xfs, viro, amir73il
On Tue, 26 Aug 2025 11:39:00 -0400, Josef Bacik wrote:
> v1: https://lore.kernel.org/linux-fsdevel/cover.1755806649.git.josef@toxicpanda.com/
>
> v1->v2:
> - Fixed all the things that Christian pointed out.
> - Re-ordered some of the patches to the front in case Christian wants to take
> those first.
> - Added a new patch for reading the current i_count and propagated that
> everywhere.
> - Fixed the cifs build breakage.
> - Removed I_REFERENCED since it's no longer needed.
> - Remove I_LRU_ISOLATING since it's no longer needed.
> - Reworked the drop_nlink/clear_nlink part to simply remove the inode from the
> LRU in the unlink path, and made this its own patch to make the behavior
> change clear.
> - NOTE: I'm re-running fstests on this now, there was a slight issue with
> removing the drop_nlink/clear_nlink patch and so I had to add the unlink/rmdir
> patch to resolve it. I assume everything will be fine but just an FYI.
> - NOTE #2: I reordered stuff, and I did a rebase and rebuild at every step, but
> I noticed this morning I still missed an odd rebase artifact, so by all means
> validate I didn't make any silly mistakes on the in-between patches.
>
> [...]
Applied to the vfs-6.18.inode.refcount.preliminaries branch of the vfs/vfs.git tree.
Patches in the vfs-6.18.inode.refcount.preliminaries branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.18.inode.refcount.preliminaries
[01/54] fs: make the i_state flags an enum
https://git.kernel.org/vfs/vfs/c/02ec4868cf6f
[03/54] fs: rework iput logic
https://git.kernel.org/vfs/vfs/c/b92e5104e8a9
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: (subset) [PATCH v2 02/54] fs: add an icount_read helper
2025-08-26 15:39 ` [PATCH v2 02/54] fs: add an icount_read helper Josef Bacik
2025-08-26 22:18 ` Mateusz Guzik
@ 2025-08-27 11:25 ` Christian Brauner
1 sibling, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-27 11:25 UTC (permalink / raw)
To: Josef Bacik
Cc: Christian Brauner, linux-fsdevel, linux-btrfs, kernel-team,
linux-ext4, linux-xfs, viro, amir73il
On Tue, 26 Aug 2025 11:39:02 -0400, Josef Bacik wrote:
> Instead of doing direct access to ->i_count, add a helper to handle
> this. This will make it easier to convert i_count to a refcount later.
>
>
I moved icount_read() right after mark_inode_dirty_sync() as suggested somewhere in the thread.
---
Applied to the vfs-6.18.inode.refcount.preliminaries branch of the vfs/vfs.git tree.
Patches in the vfs-6.18.inode.refcount.preliminaries branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.18.inode.refcount.preliminaries
[02/54] fs: add an icount_read helper
https://git.kernel.org/vfs/vfs/c/2927b1487093
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir
2025-08-26 15:39 ` [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir Josef Bacik
@ 2025-08-27 12:32 ` Christian Brauner
2025-08-27 16:08 ` Josef Bacik
2025-08-27 22:01 ` Dave Chinner
2025-08-28 9:00 ` Christian Brauner
2025-08-28 9:06 ` Christian Brauner
2 siblings, 2 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-27 12:32 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:17AM -0400, Josef Bacik wrote:
> We can end up with an inode on the LRU list or the cached list, then at
> some point in the future go to unlink that inode and then still have an
> elevated i_count reference for that inode because it is on one of these
> lists.
>
> The more common case is the cached list. We open a file, write to it,
> truncate some of it which triggers the inode_add_lru code in the
> pagecache, adding it to the cached LRU. Then we unlink this inode, and
> it exists until writeback or reclaim kicks in and removes the inode.
>
> To handle this case, delete the inode from the LRU list when it is
> unlinked, so we have the best case scenario for immediately freeing the
> inode.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
I'm not too fond of this particular change I think it's really misplaced
and the correct place is indeed drop_nlink() and clear_nlink().
I'm pretty sure that the number of callers that hold i_lock around
drop_nlink() and clear_nlink() is relatively small. So it might just be
preferable to drop_nlink_locked() and clear_nlink_locked() and just
switch the few places over to it. I think you have tooling to give you a
preliminary glimpse what and how many callers do this...
> fs/namei.c | 30 +++++++++++++++++++++++++-----
> 1 file changed, 25 insertions(+), 5 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 138a693c2346..e56dcb5747e4 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -4438,6 +4438,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
> int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> struct dentry *dentry)
> {
> + struct inode *inode = dentry->d_inode;
> int error = may_delete(idmap, dir, dentry, 1);
>
> if (error)
> @@ -4447,11 +4448,11 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> return -EPERM;
>
> dget(dentry);
> - inode_lock(dentry->d_inode);
> + inode_lock(inode);
>
> error = -EBUSY;
> if (is_local_mountpoint(dentry) ||
> - (dentry->d_inode->i_flags & S_KERNEL_FILE))
> + (inode->i_flags & S_KERNEL_FILE))
> goto out;
>
> error = security_inode_rmdir(dir, dentry);
> @@ -4463,12 +4464,21 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> goto out;
>
> shrink_dcache_parent(dentry);
> - dentry->d_inode->i_flags |= S_DEAD;
> + inode->i_flags |= S_DEAD;
> dont_mount(dentry);
> detach_mounts(dentry);
>
> out:
> - inode_unlock(dentry->d_inode);
> + /*
> + * The inode may be on the LRU list, so delete it from the LRU at this
> + * point in order to make sure that the inode is freed as soon as
> + * possible.
> + */
> + spin_lock(&inode->i_lock);
> + inode_lru_list_del(inode);
> + spin_unlock(&inode->i_lock);
> +
> + inode_unlock(inode);
> dput(dentry);
> if (!error)
> d_delete_notify(dir, dentry);
> @@ -4653,8 +4663,18 @@ int do_unlinkat(int dfd, struct filename *name)
> dput(dentry);
> }
> inode_unlock(path.dentry->d_inode);
> - if (inode)
> + if (inode) {
> + /*
> + * The LRU may be holding a reference, remove the inode from the
> + * LRU here before dropping our hopefully final reference on the
> + * inode.
> + */
> + spin_lock(&inode->i_lock);
> + inode_lru_list_del(inode);
> + spin_unlock(&inode->i_lock);
> +
> iput(inode); /* truncate the inode here */
> + }
> inode = NULL;
> if (delegated_inode) {
> error = break_deleg_wait(&delegated_inode);
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 03/54] fs: rework iput logic
2025-08-26 15:39 ` [PATCH v2 03/54] fs: rework iput logic Josef Bacik
@ 2025-08-27 12:58 ` Mateusz Guzik
2025-08-27 14:18 ` Mateusz Guzik
0 siblings, 1 reply; 105+ messages in thread
From: Mateusz Guzik @ 2025-08-27 12:58 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Tue, Aug 26, 2025 at 11:39:03AM -0400, Josef Bacik wrote:
> Currently, if we are the last iput, and we have the I_DIRTY_TIME bit
> set, we will grab a reference on the inode again and then mark it dirty
> and then redo the put. This is to make sure we delay the time update
> for as long as possible.
>
> We can rework this logic to simply dec i_count if it is not 1, and if it
> is do the time update while still holding the i_count reference.
>
> Then we can replace the atomic_dec_and_lock with locking the ->i_lock
> and doing atomic_dec_and_test, since we did the atomic_add_unless above.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 23 ++++++++++++++---------
> 1 file changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index a3673e1ed157..13e80b434323 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1911,16 +1911,21 @@ void iput(struct inode *inode)
> if (!inode)
> return;
> BUG_ON(inode->i_state & I_CLEAR);
> -retry:
> - if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> - if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> - atomic_inc(&inode->i_count);
> - spin_unlock(&inode->i_lock);
> - trace_writeback_lazytime_iput(inode);
> - mark_inode_dirty_sync(inode);
> - goto retry;
> - }
> +
> + if (atomic_add_unless(&inode->i_count, -1, 1))
> + return;
> +
> + if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> + trace_writeback_lazytime_iput(inode);
> + mark_inode_dirty_sync(inode);
> + }
> +
> + spin_lock(&inode->i_lock);
> + if (atomic_dec_and_test(&inode->i_count)) {
> + /* iput_final() drops i_lock */
> iput_final(inode);
> + } else {
> + spin_unlock(&inode->i_lock);
> }
> }
> EXPORT_SYMBOL(iput);
> --
> 2.49.0
>
This changes semantics though.
In the stock kernel the I_DIRTY_TIME business is guaranteed to be sorted
out before the call to iput_final().
In principle the flag may reappear after mark_inode_dirty_sync() returns
and before the retried atomic_dec_and_lock succeeds, in which case it
will get cleared again.
With your change the flag is only handled once and should it reappear
before you take the ->i_lock, it will stay there.
I agree the stock handling is pretty crap though.
Your change should test the flag again after taking the spin lock but
before messing with the refcount and if need be unlock + retry.
I would not hurt to assert in iput_final that the spin lock held and
that this flag is not set.
Here is my diff to your diff to illustrate + a cosmetic change, not even
compile-tested:
diff --git a/fs/inode.c b/fs/inode.c
index 421e248b690f..a9ae0c790b5d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1911,7 +1911,7 @@ void iput(struct inode *inode)
if (!inode)
return;
BUG_ON(inode->i_state & I_CLEAR);
-
+retry:
if (atomic_add_unless(&inode->i_count, -1, 1))
return;
@@ -1921,12 +1921,19 @@ void iput(struct inode *inode)
}
spin_lock(&inode->i_lock);
+
+ if (inode->i_count == 1 && inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
+ spin_unlock(&inode->i_lock);
+ goto retry;
+ }
+
if (atomic_dec_and_test(&inode->i_count)) {
- /* iput_final() drops i_lock */
- iput_final(inode);
- } else {
spin_unlock(&inode->i_lock);
+ return;
}
+
+ /* iput_final() drops i_lock */
+ iput_final(inode);
}
EXPORT_SYMBOL(iput);
^ permalink raw reply related [flat|nested] 105+ messages in thread
* Re: [PATCH v2 03/54] fs: rework iput logic
2025-08-27 12:58 ` Mateusz Guzik
@ 2025-08-27 14:18 ` Mateusz Guzik
2025-08-27 14:54 ` Josef Bacik
2025-08-27 14:57 ` Christian Brauner
0 siblings, 2 replies; 105+ messages in thread
From: Mateusz Guzik @ 2025-08-27 14:18 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Wed, Aug 27, 2025 at 02:58:51PM +0200, Mateusz Guzik wrote:
> On Tue, Aug 26, 2025 at 11:39:03AM -0400, Josef Bacik wrote:
> > Currently, if we are the last iput, and we have the I_DIRTY_TIME bit
> > set, we will grab a reference on the inode again and then mark it dirty
> > and then redo the put. This is to make sure we delay the time update
> > for as long as possible.
> >
> > We can rework this logic to simply dec i_count if it is not 1, and if it
> > is do the time update while still holding the i_count reference.
> >
> > Then we can replace the atomic_dec_and_lock with locking the ->i_lock
> > and doing atomic_dec_and_test, since we did the atomic_add_unless above.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > fs/inode.c | 23 ++++++++++++++---------
> > 1 file changed, 14 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index a3673e1ed157..13e80b434323 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1911,16 +1911,21 @@ void iput(struct inode *inode)
> > if (!inode)
> > return;
> > BUG_ON(inode->i_state & I_CLEAR);
> > -retry:
> > - if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> > - if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > - atomic_inc(&inode->i_count);
> > - spin_unlock(&inode->i_lock);
> > - trace_writeback_lazytime_iput(inode);
> > - mark_inode_dirty_sync(inode);
> > - goto retry;
> > - }
> > +
> > + if (atomic_add_unless(&inode->i_count, -1, 1))
> > + return;
> > +
> > + if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > + trace_writeback_lazytime_iput(inode);
> > + mark_inode_dirty_sync(inode);
> > + }
> > +
> > + spin_lock(&inode->i_lock);
> > + if (atomic_dec_and_test(&inode->i_count)) {
> > + /* iput_final() drops i_lock */
> > iput_final(inode);
> > + } else {
> > + spin_unlock(&inode->i_lock);
> > }
> > }
> > EXPORT_SYMBOL(iput);
> > --
> > 2.49.0
> >
>
> This changes semantics though.
>
> In the stock kernel the I_DIRTY_TIME business is guaranteed to be sorted
> out before the call to iput_final().
>
> In principle the flag may reappear after mark_inode_dirty_sync() returns
> and before the retried atomic_dec_and_lock succeeds, in which case it
> will get cleared again.
>
> With your change the flag is only handled once and should it reappear
> before you take the ->i_lock, it will stay there.
>
> I agree the stock handling is pretty crap though.
>
> Your change should test the flag again after taking the spin lock but
> before messing with the refcount and if need be unlock + retry.
>
> I would not hurt to assert in iput_final that the spin lock held and
> that this flag is not set.
>
> Here is my diff to your diff to illustrate + a cosmetic change, not even
> compile-tested:
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 421e248b690f..a9ae0c790b5d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1911,7 +1911,7 @@ void iput(struct inode *inode)
> if (!inode)
> return;
> BUG_ON(inode->i_state & I_CLEAR);
> -
> +retry:
> if (atomic_add_unless(&inode->i_count, -1, 1))
> return;
>
> @@ -1921,12 +1921,19 @@ void iput(struct inode *inode)
> }
>
> spin_lock(&inode->i_lock);
> +
> + if (inode->i_count == 1 && inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> + spin_unlock(&inode->i_lock);
> + goto retry;
> + }
> +
> if (atomic_dec_and_test(&inode->i_count)) {
> - /* iput_final() drops i_lock */
> - iput_final(inode);
> - } else {
> spin_unlock(&inode->i_lock);
> + return;
> }
> +
> + /* iput_final() drops i_lock */
> + iput_final(inode);
> }
> EXPORT_SYMBOL(iput);
>
Sorry for spam, but the more I look at this the more fucky the entire
ordeal appears to me.
Before I get to the crux, as a side note I did a quick check if atomics
for i_count make any sense to begin with and I think they do, here is a
sample output from a friend tracing the ref value on iput:
bpftrace -e 'kprobe:iput /arg0 != 0/ { @[((struct inode *)arg0)->i_count.counter] = count(); }'
@[5]: 66
@[4]: 4625
@[3]: 11086
@[2]: 30937
@[1]: 151785
... so plenty of non-last refs after all.
I completely agree the mandatory ref trip to handle I_DIRTY_TIME is lame
and needs to be addressed.
But I'm uneasy about maintaining the invariant that iput_final() does
not see the flag if i_nlink != 0 and my proposal as pasted is dodgy af
on this front.
While here some nits:
1. it makes sense to try mere atomics just in case someone else messed
with the count between handling of the dirty flag and taking the spin lock
2. according to my quick test with bpftrace the I_DIRTY_TIME flag is
seen way less frequently than i_nlink != 0, so it makes sense to swap
the order in which they are checked. Interested parties can try it out
with:
bpftrace -e 'kprobe:iput /arg0 != 0/ { @[((struct inode *)arg0)->i_nlink != 0, ((struct inode *)arg0)->i_state & (1 << 11)] = count(); }'
3. touch up the iput_final() unlock comment
All that said, how about something like the thing below as the final
routine building off of your change. I can't submit a proper patch and
can't even compile-test. I don't need any credit should this get
grabbed.
void iput(struct inode *inode)
{
if (!inode)
return;
BUG_ON(inode->i_state & I_CLEAR);
retry:
if (atomic_add_unless(&inode->i_count, -1, 1))
return;
if ((inode->i_state & I_DIRTY_TIME) && inode->i_nlink) {
trace_writeback_lazytime_iput(inode);
mark_inode_dirty_sync(inode);
goto retry;
}
spin_lock(&inode->i_lock);
if ((inode->i_state & I_DIRTY_TIME) && inode->i_nlink) {
spin_unlock(&inode->i_lock);
goto retry;
}
if (!atomic_dec_and_test(&inode->i_count)) {
spin_unlock(&inode->i_lock);
return;
}
/*
* iput_final() drops ->i_lock, we can't assert on it as the inode may
* be deallocated by the time it returns
*/
iput_final(inode);
}
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 03/54] fs: rework iput logic
2025-08-27 14:18 ` Mateusz Guzik
@ 2025-08-27 14:54 ` Josef Bacik
2025-08-27 14:57 ` Christian Brauner
1 sibling, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-27 14:54 UTC (permalink / raw)
To: Mateusz Guzik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Wed, Aug 27, 2025 at 04:18:55PM +0200, Mateusz Guzik wrote:
> On Wed, Aug 27, 2025 at 02:58:51PM +0200, Mateusz Guzik wrote:
> > On Tue, Aug 26, 2025 at 11:39:03AM -0400, Josef Bacik wrote:
> > > Currently, if we are the last iput, and we have the I_DIRTY_TIME bit
> > > set, we will grab a reference on the inode again and then mark it dirty
> > > and then redo the put. This is to make sure we delay the time update
> > > for as long as possible.
> > >
> > > We can rework this logic to simply dec i_count if it is not 1, and if it
> > > is do the time update while still holding the i_count reference.
> > >
> > > Then we can replace the atomic_dec_and_lock with locking the ->i_lock
> > > and doing atomic_dec_and_test, since we did the atomic_add_unless above.
> > >
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> > > fs/inode.c | 23 ++++++++++++++---------
> > > 1 file changed, 14 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index a3673e1ed157..13e80b434323 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -1911,16 +1911,21 @@ void iput(struct inode *inode)
> > > if (!inode)
> > > return;
> > > BUG_ON(inode->i_state & I_CLEAR);
> > > -retry:
> > > - if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> > > - if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > > - atomic_inc(&inode->i_count);
> > > - spin_unlock(&inode->i_lock);
> > > - trace_writeback_lazytime_iput(inode);
> > > - mark_inode_dirty_sync(inode);
> > > - goto retry;
> > > - }
> > > +
> > > + if (atomic_add_unless(&inode->i_count, -1, 1))
> > > + return;
> > > +
> > > + if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > > + trace_writeback_lazytime_iput(inode);
> > > + mark_inode_dirty_sync(inode);
> > > + }
> > > +
> > > + spin_lock(&inode->i_lock);
> > > + if (atomic_dec_and_test(&inode->i_count)) {
> > > + /* iput_final() drops i_lock */
> > > iput_final(inode);
> > > + } else {
> > > + spin_unlock(&inode->i_lock);
> > > }
> > > }
> > > EXPORT_SYMBOL(iput);
> > > --
> > > 2.49.0
> > >
> >
> > This changes semantics though.
> >
> > In the stock kernel the I_DIRTY_TIME business is guaranteed to be sorted
> > out before the call to iput_final().
> >
> > In principle the flag may reappear after mark_inode_dirty_sync() returns
> > and before the retried atomic_dec_and_lock succeeds, in which case it
> > will get cleared again.
> >
> > With your change the flag is only handled once and should it reappear
> > before you take the ->i_lock, it will stay there.
> >
> > I agree the stock handling is pretty crap though.
> >
> > Your change should test the flag again after taking the spin lock but
> > before messing with the refcount and if need be unlock + retry.
> >
> > I would not hurt to assert in iput_final that the spin lock held and
> > that this flag is not set.
> >
> > Here is my diff to your diff to illustrate + a cosmetic change, not even
> > compile-tested:
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 421e248b690f..a9ae0c790b5d 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1911,7 +1911,7 @@ void iput(struct inode *inode)
> > if (!inode)
> > return;
> > BUG_ON(inode->i_state & I_CLEAR);
> > -
> > +retry:
> > if (atomic_add_unless(&inode->i_count, -1, 1))
> > return;
> >
> > @@ -1921,12 +1921,19 @@ void iput(struct inode *inode)
> > }
> >
> > spin_lock(&inode->i_lock);
> > +
> > + if (inode->i_count == 1 && inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > + spin_unlock(&inode->i_lock);
> > + goto retry;
> > + }
> > +
> > if (atomic_dec_and_test(&inode->i_count)) {
> > - /* iput_final() drops i_lock */
> > - iput_final(inode);
> > - } else {
> > spin_unlock(&inode->i_lock);
> > + return;
> > }
> > +
> > + /* iput_final() drops i_lock */
> > + iput_final(inode);
> > }
> > EXPORT_SYMBOL(iput);
> >
>
> Sorry for spam, but the more I look at this the more fucky the entire
> ordeal appears to me.
>
> Before I get to the crux, as a side note I did a quick check if atomics
> for i_count make any sense to begin with and I think they do, here is a
> sample output from a friend tracing the ref value on iput:
>
> bpftrace -e 'kprobe:iput /arg0 != 0/ { @[((struct inode *)arg0)->i_count.counter] = count(); }'
>
> @[5]: 66
> @[4]: 4625
> @[3]: 11086
> @[2]: 30937
> @[1]: 151785
>
> ... so plenty of non-last refs after all.
>
> I completely agree the mandatory ref trip to handle I_DIRTY_TIME is lame
> and needs to be addressed.
>
> But I'm uneasy about maintaining the invariant that iput_final() does
> not see the flag if i_nlink != 0 and my proposal as pasted is dodgy af
> on this front.
>
> While here some nits:
> 1. it makes sense to try mere atomics just in case someone else messed
> with the count between handling of the dirty flag and taking the spin lock
> 2. according to my quick test with bpftrace the I_DIRTY_TIME flag is
> seen way less frequently than i_nlink != 0, so it makes sense to swap
> the order in which they are checked. Interested parties can try it out
> with:
> bpftrace -e 'kprobe:iput /arg0 != 0/ { @[((struct inode *)arg0)->i_nlink != 0, ((struct inode *)arg0)->i_state & (1 << 11)] = count(); }'
> 3. touch up the iput_final() unlock comment
>
> All that said, how about something like the thing below as the final
> routine building off of your change. I can't submit a proper patch and
> can't even compile-test. I don't need any credit should this get
> grabbed.
Thanks for this Mateusz, you're right I completely changed the logic by not
doing this under the i_lock. This update looks reasonable to me, thank you for
the analysis and review!
Josef
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 03/54] fs: rework iput logic
2025-08-27 14:18 ` Mateusz Guzik
2025-08-27 14:54 ` Josef Bacik
@ 2025-08-27 14:57 ` Christian Brauner
2025-08-27 16:24 ` [PATCH] fs: revamp iput() Mateusz Guzik
1 sibling, 1 reply; 105+ messages in thread
From: Christian Brauner @ 2025-08-27 14:57 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Josef Bacik, linux-fsdevel, linux-btrfs, kernel-team, linux-ext4,
linux-xfs, viro, amir73il
On Wed, Aug 27, 2025 at 04:18:55PM +0200, Mateusz Guzik wrote:
> On Wed, Aug 27, 2025 at 02:58:51PM +0200, Mateusz Guzik wrote:
> > On Tue, Aug 26, 2025 at 11:39:03AM -0400, Josef Bacik wrote:
> > > Currently, if we are the last iput, and we have the I_DIRTY_TIME bit
> > > set, we will grab a reference on the inode again and then mark it dirty
> > > and then redo the put. This is to make sure we delay the time update
> > > for as long as possible.
> > >
> > > We can rework this logic to simply dec i_count if it is not 1, and if it
> > > is do the time update while still holding the i_count reference.
> > >
> > > Then we can replace the atomic_dec_and_lock with locking the ->i_lock
> > > and doing atomic_dec_and_test, since we did the atomic_add_unless above.
> > >
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> > > fs/inode.c | 23 ++++++++++++++---------
> > > 1 file changed, 14 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index a3673e1ed157..13e80b434323 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -1911,16 +1911,21 @@ void iput(struct inode *inode)
> > > if (!inode)
> > > return;
> > > BUG_ON(inode->i_state & I_CLEAR);
> > > -retry:
> > > - if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> > > - if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > > - atomic_inc(&inode->i_count);
> > > - spin_unlock(&inode->i_lock);
> > > - trace_writeback_lazytime_iput(inode);
> > > - mark_inode_dirty_sync(inode);
> > > - goto retry;
> > > - }
> > > +
> > > + if (atomic_add_unless(&inode->i_count, -1, 1))
> > > + return;
> > > +
> > > + if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > > + trace_writeback_lazytime_iput(inode);
> > > + mark_inode_dirty_sync(inode);
> > > + }
> > > +
> > > + spin_lock(&inode->i_lock);
> > > + if (atomic_dec_and_test(&inode->i_count)) {
> > > + /* iput_final() drops i_lock */
> > > iput_final(inode);
> > > + } else {
> > > + spin_unlock(&inode->i_lock);
> > > }
> > > }
> > > EXPORT_SYMBOL(iput);
> > > --
> > > 2.49.0
> > >
> >
> > This changes semantics though.
> >
> > In the stock kernel the I_DIRTY_TIME business is guaranteed to be sorted
> > out before the call to iput_final().
> >
> > In principle the flag may reappear after mark_inode_dirty_sync() returns
> > and before the retried atomic_dec_and_lock succeeds, in which case it
> > will get cleared again.
> >
> > With your change the flag is only handled once and should it reappear
> > before you take the ->i_lock, it will stay there.
Yeah, good spotting.
> >
> > I agree the stock handling is pretty crap though.
> >
> > Your change should test the flag again after taking the spin lock but
> > before messing with the refcount and if need be unlock + retry.
> >
> > I would not hurt to assert in iput_final that the spin lock held and
> > that this flag is not set.
> >
> > Here is my diff to your diff to illustrate + a cosmetic change, not even
> > compile-tested:
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 421e248b690f..a9ae0c790b5d 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1911,7 +1911,7 @@ void iput(struct inode *inode)
> > if (!inode)
> > return;
> > BUG_ON(inode->i_state & I_CLEAR);
> > -
> > +retry:
> > if (atomic_add_unless(&inode->i_count, -1, 1))
> > return;
> >
> > @@ -1921,12 +1921,19 @@ void iput(struct inode *inode)
> > }
> >
> > spin_lock(&inode->i_lock);
> > +
> > + if (inode->i_count == 1 && inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > + spin_unlock(&inode->i_lock);
> > + goto retry;
> > + }
> > +
> > if (atomic_dec_and_test(&inode->i_count)) {
> > - /* iput_final() drops i_lock */
> > - iput_final(inode);
> > - } else {
> > spin_unlock(&inode->i_lock);
> > + return;
> > }
> > +
> > + /* iput_final() drops i_lock */
> > + iput_final(inode);
> > }
> > EXPORT_SYMBOL(iput);
> >
>
> Sorry for spam, but the more I look at this the more fucky the entire
> ordeal appears to me.
>
> Before I get to the crux, as a side note I did a quick check if atomics
> for i_count make any sense to begin with and I think they do, here is a
> sample output from a friend tracing the ref value on iput:
>
> bpftrace -e 'kprobe:iput /arg0 != 0/ { @[((struct inode *)arg0)->i_count.counter] = count(); }'
>
> @[5]: 66
> @[4]: 4625
> @[3]: 11086
> @[2]: 30937
> @[1]: 151785
>
> ... so plenty of non-last refs after all.
>
> I completely agree the mandatory ref trip to handle I_DIRTY_TIME is lame
> and needs to be addressed.
>
> But I'm uneasy about maintaining the invariant that iput_final() does
> not see the flag if i_nlink != 0 and my proposal as pasted is dodgy af
> on this front.
>
> While here some nits:
> 1. it makes sense to try mere atomics just in case someone else messed
> with the count between handling of the dirty flag and taking the spin lock
Which on mainline is a thing for sure.
> 2. according to my quick test with bpftrace the I_DIRTY_TIME flag is
> seen way less frequently than i_nlink != 0, so it makes sense to swap
> the order in which they are checked. Interested parties can try it out
> with:
> bpftrace -e 'kprobe:iput /arg0 != 0/ { @[((struct inode *)arg0)->i_nlink != 0, ((struct inode *)arg0)->i_state & (1 << 11)] = count(); }'
> 3. touch up the iput_final() unlock comment
>
> All that said, how about something like the thing below as the final
> routine building off of your change. I can't submit a proper patch and
> can't even compile-test. I don't need any credit should this get
> grabbed.
>
> void iput(struct inode *inode)
> {
> if (!inode)
> return;
> BUG_ON(inode->i_state & I_CLEAR);
> retry:
> if (atomic_add_unless(&inode->i_count, -1, 1))
> return;
>
> if ((inode->i_state & I_DIRTY_TIME) && inode->i_nlink) {
> trace_writeback_lazytime_iput(inode);
> mark_inode_dirty_sync(inode);
> goto retry;
> }
>
> spin_lock(&inode->i_lock);
> if ((inode->i_state & I_DIRTY_TIME) && inode->i_nlink) {
> spin_unlock(&inode->i_lock);
> goto retry;
> }
>
> if (!atomic_dec_and_test(&inode->i_count)) {
> spin_unlock(&inode->i_lock);
> return;
> }
>
> /*
> * iput_final() drops ->i_lock, we can't assert on it as the inode may
> * be deallocated by the time it returns
> */
> iput_final(inode);
> }
I've taken this. Though I had Josef convince me that the retry is sane
and doesn't end up stealing a ref. Thanks.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 15/54] fs: maintain a list of pinned inodes
2025-08-26 15:39 ` [PATCH v2 15/54] fs: maintain a list of pinned inodes Josef Bacik
@ 2025-08-27 15:20 ` Christian Brauner
2025-08-27 16:07 ` Josef Bacik
0 siblings, 1 reply; 105+ messages in thread
From: Christian Brauner @ 2025-08-27 15:20 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:15AM -0400, Josef Bacik wrote:
> Currently we have relied on dirty inodes and inodes with cache on them
> to simply be left hanging around on the system outside of an LRU. The
> only way to make sure these inodes are eventually reclaimed is because
> dirty writeback will grab a reference on the inode and then iput it when
> it's done, potentially getting it on the LRU. For the cached case the
> page cache deletion path will call inode_add_lru when the inode no
> longer has cached pages in order to make sure the inode object can be
> freed eventually. In the unmount case we walk all inodes and free them
> so this all works out fine.
>
> But we want to eliminate 0 i_count objects as a concept, so we need a
> mechanism to hold a reference on these pinned inodes. To that end, add a
> list to the super block that contains any inodes that are cached for one
> reason or another.
>
> When we call inode_add_lru(), if the inode falls into one of these
> categories, we will add it to the cached inode list and hold an
> i_obj_count reference. If the inode does not fall into one of these
> categories it will be moved to the normal LRU, which is already holds an
> i_obj_count reference.
>
> The dirty case we will delete it from the LRU if it is on one, and then
> the iput after the writeout will make sure it's placed onto the correct
> list at that point.
>
> The page cache case will migrate it when it calls inode_add_lru() when
> deleting pages from the page cache.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
Ok, I'm trying to wrap my head around the justification for this new
list. Currently we have inodes with a zero reference counts that aren't
on any LRU. They just appear on sb->i_sb_list and are e.g., dealt with
during umount (sync_filesystem() followed by evict_inodes()).
So they're either dealt with by writeback or by the page cache and are
eventually put on the regular LRU or the filesystem shuts down before
that happens.
They're easy to handle and recognize because their inode->i_count is
zero.
Now you make the LRUs hold a full reference so it can be grabbed from
the LRU again avoiding the zombie resurrection from zero. So to
recognize inodes that are pinned internally due to being dirty or having
pagecache pages attached to it you need to track them in a new list
otherwise you can't really differentiate them and when to move them onto
the LRU after writeback and pagecache is done with them.
> fs/fs-writeback.c | 8 +++
> fs/inode.c | 102 +++++++++++++++++++++++++++++--
> fs/internal.h | 1 +
> fs/super.c | 3 +
> include/linux/fs.h | 13 +++-
> include/trace/events/writeback.h | 3 +-
> 6 files changed, 122 insertions(+), 8 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index b83d556d7ffe..dbcb317e7113 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2736,6 +2736,14 @@ static void wait_sb_inodes(struct super_block *sb)
> continue;
> }
> __iget(inode);
> +
> + /*
> + * We could have potentially ended up on the cached LRU list, so
> + * remove ourselves from this list now that we have a reference,
> + * the iput will handle placing it back on the appropriate LRU
> + * list if necessary.
> + */
> + inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
> rcu_read_unlock();
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 15ff3a0ff7ee..4d39f260901f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -319,6 +319,23 @@ void free_inode_nonrcu(struct inode *inode)
> }
> EXPORT_SYMBOL(free_inode_nonrcu);
>
> +/*
> + * Some inodes need to stay pinned in memory because they are dirty or there are
> + * cached pages that the VM wants to keep around to avoid thrashing. This does
> + * the appropriate checks to see if we want to sheild this inode from periodic
> + * reclaim. Must be called with ->i_lock held.
> + */
> +static bool inode_needs_cached(struct inode *inode)
> +{
> + lockdep_assert_held(&inode->i_lock);
> +
> + if (inode->i_state & (I_DIRTY_ALL | I_SYNC))
> + return true;
> + if (!mapping_shrinkable(&inode->i_data))
> + return true;
> + return false;
> +}
> +
> static void i_callback(struct rcu_head *head)
> {
> struct inode *inode = container_of(head, struct inode, i_rcu);
> @@ -532,20 +549,67 @@ void ihold(struct inode *inode)
> }
> EXPORT_SYMBOL(ihold);
>
> +static void inode_add_cached_lru(struct inode *inode)
> +{
> + lockdep_assert_held(&inode->i_lock);
> +
> + if (inode->i_state & I_CACHED_LRU)
> + return;
> + if (!list_empty(&inode->i_lru))
> + return;
> +
> + inode->i_state |= I_CACHED_LRU;
> + iobj_get(inode);
> + spin_lock(&inode->i_sb->s_cached_inodes_lock);
> + list_add(&inode->i_lru, &inode->i_sb->s_cached_inodes);
> + spin_unlock(&inode->i_sb->s_cached_inodes_lock);
> +}
> +
> +static bool __inode_del_cached_lru(struct inode *inode)
> +{
> + lockdep_assert_held(&inode->i_lock);
> +
> + if (!(inode->i_state & I_CACHED_LRU))
> + return false;
> +
> + inode->i_state &= ~I_CACHED_LRU;
> + spin_lock(&inode->i_sb->s_cached_inodes_lock);
> + list_del_init(&inode->i_lru);
> + spin_unlock(&inode->i_sb->s_cached_inodes_lock);
> + return true;
> +}
> +
> +static bool inode_del_cached_lru(struct inode *inode)
> +{
> + if (__inode_del_cached_lru(inode)) {
> + iobj_put(inode);
> + return true;
> + }
> + return false;
> +}
> +
> static void __inode_add_lru(struct inode *inode, bool rotate)
> {
> - if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE))
> + bool need_ref = true;
> +
> + lockdep_assert_held(&inode->i_lock);
> +
> + if (inode->i_state & (I_FREEING | I_WILL_FREE))
> return;
> if (icount_read(inode))
> return;
> if (!(inode->i_sb->s_flags & SB_ACTIVE))
> return;
> - if (!mapping_shrinkable(&inode->i_data))
> + if (inode_needs_cached(inode)) {
> + inode_add_cached_lru(inode);
> return;
> + }
>
> + need_ref = __inode_del_cached_lru(inode) == false;
> if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
> - iobj_get(inode);
> inode->i_state |= I_LRU;
> + if (need_ref)
> + iobj_get(inode);
> this_cpu_inc(nr_unused);
> } else if (rotate) {
> inode->i_state |= I_REFERENCED;
> @@ -573,8 +637,19 @@ void inode_add_lru(struct inode *inode)
> __inode_add_lru(inode, false);
> }
>
> -static void inode_lru_list_del(struct inode *inode)
> +/*
> + * Caller must be holding it's own i_count reference on this inode in order to
> + * prevent this being the final iput.
> + *
> + * Needs inode->i_lock held.
> + */
> +void inode_lru_list_del(struct inode *inode)
> {
> + lockdep_assert_held(&inode->i_lock);
> +
> + if (inode_del_cached_lru(inode))
> + return;
> +
> if (!(inode->i_state & I_LRU))
> return;
>
> @@ -950,6 +1025,22 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> if (!spin_trylock(&inode->i_lock))
> return LRU_SKIP;
>
> + /*
> + * This inode is either dirty or has page cache we want to keep around,
> + * so move it to the cached list.
> + *
> + * We drop the extra i_obj_count reference we grab when adding it to the
> + * cached lru.
> + */
> + if (inode_needs_cached(inode)) {
> + list_lru_isolate(lru, &inode->i_lru);
> + inode_add_cached_lru(inode);
> + iobj_put(inode);
> + spin_unlock(&inode->i_lock);
> + this_cpu_dec(nr_unused);
> + return LRU_REMOVED;
> + }
> +
> /*
> * Inodes can get referenced, redirtied, or repopulated while
> * they're already on the LRU, and this can make them
> @@ -957,8 +1048,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> * sync, or the last page cache deletion will requeue them.
> */
> if (icount_read(inode) ||
> - (inode->i_state & ~I_REFERENCED) ||
> - !mapping_shrinkable(&inode->i_data)) {
> + (inode->i_state & ~I_REFERENCED)) {
> list_lru_isolate(lru, &inode->i_lru);
> inode->i_state &= ~I_LRU;
> spin_unlock(&inode->i_lock);
> diff --git a/fs/internal.h b/fs/internal.h
> index 38e8aab27bbd..17ecee7056d5 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -207,6 +207,7 @@ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
> int dentry_needs_remove_privs(struct mnt_idmap *, struct dentry *dentry);
> bool in_group_or_capable(struct mnt_idmap *idmap,
> const struct inode *inode, vfsgid_t vfsgid);
> +void inode_lru_list_del(struct inode *inode);
>
> /*
> * fs-writeback.c
> diff --git a/fs/super.c b/fs/super.c
> index a038848e8d1f..bf3e6d9055af 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -364,6 +364,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> spin_lock_init(&s->s_inode_list_lock);
> INIT_LIST_HEAD(&s->s_inodes_wb);
> spin_lock_init(&s->s_inode_wblist_lock);
> + INIT_LIST_HEAD(&s->s_cached_inodes);
> + spin_lock_init(&s->s_cached_inodes_lock);
>
> s->s_count = 1;
> atomic_set(&s->s_active, 1);
> @@ -409,6 +411,7 @@ static void __put_super(struct super_block *s)
> WARN_ON(s->s_dentry_lru.node);
> WARN_ON(s->s_inode_lru.node);
> WARN_ON(!list_empty(&s->s_mounts));
> + WARN_ON(!list_empty(&s->s_cached_inodes));
> call_rcu(&s->rcu, destroy_super_rcu);
> }
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e12c09b9fcaf..999ffea2aac1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -749,6 +749,9 @@ is_uncached_acl(struct posix_acl *acl)
> * ->i_lru is on the LRU and those that are using ->i_lru
> * for some other means.
> *
> + * I_CACHED_LRU Inode is cached because it is dirty or isn't shrinkable,
> + * and thus is on the s_cached_inode_lru list.
> + *
> * Q: What is the difference between I_WILL_FREE and I_FREEING?
> *
> * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
> @@ -780,7 +783,8 @@ enum inode_state_flags_t {
> I_DONTCACHE = (1U << 15),
> I_SYNC_QUEUED = (1U << 16),
> I_PINNING_NETFS_WB = (1U << 17),
> - I_LRU = (1U << 18)
> + I_LRU = (1U << 18),
> + I_CACHED_LRU = (1U << 19)
> };
>
> #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> @@ -1579,6 +1583,13 @@ struct super_block {
>
> spinlock_t s_inode_wblist_lock;
> struct list_head s_inodes_wb; /* writeback inodes */
> +
> + /*
> + * Cached inodes, any inodes that their reference is held by another
> + * mechanism, such as dirty inodes or unshrinkable inodes.
> + */
> + spinlock_t s_cached_inodes_lock;
> + struct list_head s_cached_inodes;
> } __randomize_layout;
>
> static inline struct user_namespace *i_user_ns(const struct inode *inode)
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 486f85aca84d..6949329c744a 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -29,7 +29,8 @@
> {I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \
> {I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \
> {I_LRU_ISOLATING, "I_LRU_ISOLATING"}, \
> - {I_LRU, "I_LRU"} \
> + {I_LRU, "I_LRU"}, \
> + {I_CACHED_LRU, "I_CACHED_LRU"} \
> )
>
> /* enums need to be exported to user space */
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 15/54] fs: maintain a list of pinned inodes
2025-08-27 15:20 ` Christian Brauner
@ 2025-08-27 16:07 ` Josef Bacik
2025-08-28 8:24 ` Christian Brauner
0 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-27 16:07 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Wed, Aug 27, 2025 at 05:20:17PM +0200, Christian Brauner wrote:
> On Tue, Aug 26, 2025 at 11:39:15AM -0400, Josef Bacik wrote:
> > Currently we have relied on dirty inodes and inodes with cache on them
> > to simply be left hanging around on the system outside of an LRU. The
> > only way to make sure these inodes are eventually reclaimed is because
> > dirty writeback will grab a reference on the inode and then iput it when
> > it's done, potentially getting it on the LRU. For the cached case the
> > page cache deletion path will call inode_add_lru when the inode no
> > longer has cached pages in order to make sure the inode object can be
> > freed eventually. In the unmount case we walk all inodes and free them
> > so this all works out fine.
> >
> > But we want to eliminate 0 i_count objects as a concept, so we need a
> > mechanism to hold a reference on these pinned inodes. To that end, add a
> > list to the super block that contains any inodes that are cached for one
> > reason or another.
> >
> > When we call inode_add_lru(), if the inode falls into one of these
> > categories, we will add it to the cached inode list and hold an
> > i_obj_count reference. If the inode does not fall into one of these
> > categories it will be moved to the normal LRU, which is already holds an
> > i_obj_count reference.
> >
> > The dirty case we will delete it from the LRU if it is on one, and then
> > the iput after the writeout will make sure it's placed onto the correct
> > list at that point.
> >
> > The page cache case will migrate it when it calls inode_add_lru() when
> > deleting pages from the page cache.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
>
> Ok, I'm trying to wrap my head around the justification for this new
> list. Currently we have inodes with a zero reference counts that aren't
> on any LRU. They just appear on sb->i_sb_list and are e.g., dealt with
> during umount (sync_filesystem() followed by evict_inodes()).
>
> So they're either dealt with by writeback or by the page cache and are
> eventually put on the regular LRU or the filesystem shuts down before
> that happens.
>
> They're easy to handle and recognize because their inode->i_count is
> zero.
>
> Now you make the LRUs hold a full reference so it can be grabbed from
> the LRU again avoiding the zombie resurrection from zero. So to
> recognize inodes that are pinned internally due to being dirty or having
> pagecache pages attached to it you need to track them in a new list
> otherwise you can't really differentiate them and when to move them onto
> the LRU after writeback and pagecache is done with them.
>
Exactly. We need to put them somewhere so we can account for their reference.
We could technically just use a flag and not have a list for this, and just use
the flag to indicate that the inode is pinned and the flag has a full reference
associated with it.
I did it this way because if I had a nickel for every time I needed to figure
out where a zombie inode was and had to do the most grotesque drgn magic to find
it, I'd have like 15 cents, which isn't a lot but weird that it's happened 3
times. Having a list makes it easier from a debugging perspective.
But again, we have ->s_inodes, and I can just scan that list and look for
I_LRU_CACHED. We'd still need to hold a full reference for that, but it would
eliminate the need for another list if that's more preferable? Thanks,
Josef
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir
2025-08-27 12:32 ` Christian Brauner
@ 2025-08-27 16:08 ` Josef Bacik
2025-08-27 22:01 ` Dave Chinner
1 sibling, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-27 16:08 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Wed, Aug 27, 2025 at 02:32:49PM +0200, Christian Brauner wrote:
> On Tue, Aug 26, 2025 at 11:39:17AM -0400, Josef Bacik wrote:
> > We can end up with an inode on the LRU list or the cached list, then at
> > some point in the future go to unlink that inode and then still have an
> > elevated i_count reference for that inode because it is on one of these
> > lists.
> >
> > The more common case is the cached list. We open a file, write to it,
> > truncate some of it which triggers the inode_add_lru code in the
> > pagecache, adding it to the cached LRU. Then we unlink this inode, and
> > it exists until writeback or reclaim kicks in and removes the inode.
> >
> > To handle this case, delete the inode from the LRU list when it is
> > unlinked, so we have the best case scenario for immediately freeing the
> > inode.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
>
> I'm not too fond of this particular change I think it's really misplaced
> and the correct place is indeed drop_nlink() and clear_nlink().
>
> I'm pretty sure that the number of callers that hold i_lock around
> drop_nlink() and clear_nlink() is relatively small. So it might just be
> preferable to drop_nlink_locked() and clear_nlink_locked() and just
> switch the few places over to it. I think you have tooling to give you a
> preliminary glimpse what and how many callers do this...
Fair, I'll make the weird french guy figure it out. Thanks,
Josef
^ permalink raw reply [flat|nested] 105+ messages in thread
* [PATCH] fs: revamp iput()
2025-08-27 14:57 ` Christian Brauner
@ 2025-08-27 16:24 ` Mateusz Guzik
2025-08-30 15:54 ` Mateusz Guzik
0 siblings, 1 reply; 105+ messages in thread
From: Mateusz Guzik @ 2025-08-27 16:24 UTC (permalink / raw)
To: brauner
Cc: viro, jack, linux-kernel, linux-fsdevel, josef, kernel-team,
amir73il, linux-btrfs, linux-ext4, linux-xfs, Mateusz Guzik
The material change is I_DIRTY_TIME handling without a spurious ref
acquire/release cycle.
While here a bunch of smaller changes:
1. predict there is an inode -- bpftrace suggests one is passed vast
majority of the time
2. convert BUG_ON into VFS_BUG_ON_INODE
3. assert on ->i_count
4. assert ->i_lock is not held
5. flip the order of I_DIRTY_TIME and nlink count checks as the former
is less likely to be true
I verified atomic_read(&inode->i_count) does not show up in asm if
debug is disabled.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
The routine kept annoying me, so here is a further revised variant.
I verified this compiles, but I still cannot runtime test. I'm sorry for
that. My signed-off is conditional on a good samaritan making sure it
works :)
diff compared to the thing I sent "informally":
- if (unlikely(!inode))
- asserts
- slightly reworded iput_final commentary
- unlikely() on the second I_DIRTY_TIME check
Given the revamp I think it makes sense to attribute the change to me,
hence a "proper" mail.
The thing surviving from the submission by Josef is:
+ if (atomic_add_unless(&inode->i_count, -1, 1))
+ return;
And of course he is the one who brought up the spurious refcount trip in
the first place.
I'm happy with Reported-by, Co-developed-by or whatever other credit
as you guys see fit.
That aside I think it would be nice if NULL inodes passed to iput
became illegal, but that's a different story for another day.
fs/inode.c | 46 +++++++++++++++++++++++++++++++++++-----------
1 file changed, 35 insertions(+), 11 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 01ebdc40021e..01a554e11279 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1908,20 +1908,44 @@ static void iput_final(struct inode *inode)
*/
void iput(struct inode *inode)
{
- if (!inode)
+ if (unlikely(!inode))
return;
- BUG_ON(inode->i_state & I_CLEAR);
+
retry:
- if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
- if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
- atomic_inc(&inode->i_count);
- spin_unlock(&inode->i_lock);
- trace_writeback_lazytime_iput(inode);
- mark_inode_dirty_sync(inode);
- goto retry;
- }
- iput_final(inode);
+ lockdep_assert_not_held(&inode->i_lock);
+ VFS_BUG_ON_INODE(inode->i_state & I_CLEAR, inode);
+ /*
+ * Note this assert is technically racy as if the count is bogusly
+ * equal to one, then two CPUs racing to further drop it can both
+ * conclude it's fine.
+ */
+ VFS_BUG_ON_INODE(atomic_read(&inode->i_count) < 1, inode);
+
+ if (atomic_add_unless(&inode->i_count, -1, 1))
+ return;
+
+ if ((inode->i_state & I_DIRTY_TIME) && inode->i_nlink) {
+ trace_writeback_lazytime_iput(inode);
+ mark_inode_dirty_sync(inode);
+ goto retry;
}
+
+ spin_lock(&inode->i_lock);
+ if (unlikely((inode->i_state & I_DIRTY_TIME) && inode->i_nlink)) {
+ spin_unlock(&inode->i_lock);
+ goto retry;
+ }
+
+ if (!atomic_dec_and_test(&inode->i_count)) {
+ spin_unlock(&inode->i_lock);
+ return;
+ }
+
+ /*
+ * iput_final() drops ->i_lock, we can't assert on it as the inode may
+ * be deallocated by the time the call returns.
+ */
+ iput_final(inode);
}
EXPORT_SYMBOL(iput);
--
2.43.0
^ permalink raw reply related [flat|nested] 105+ messages in thread
* Re: [PATCH v2 16/54] fs: delete the inode from the LRU list on lookup
2025-08-26 15:39 ` [PATCH v2 16/54] fs: delete the inode from the LRU list on lookup Josef Bacik
@ 2025-08-27 21:46 ` Dave Chinner
2025-08-28 11:42 ` Josef Bacik
0 siblings, 1 reply; 105+ messages in thread
From: Dave Chinner @ 2025-08-27 21:46 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Tue, Aug 26, 2025 at 11:39:16AM -0400, Josef Bacik wrote:
> When we move to holding a full reference on the inode when it is on an
> LRU list we need to have a mechanism to re-run the LRU add logic. The
> use case for this is btrfs's snapshot delete, we will lookup all the
> inodes and try to drop them, but if they're on the LRU we will not call
> ->drop_inode() because their refcount will be elevated, so we won't know
> that we need to drop the inode.
>
> Fix this by simply removing the inode from it's respective LRU list when
> we grab a reference to it in a way that we have active users. This will
> ensure that the logic to add the inode to the LRU or drop the inode will
> be run on the final iput from the user.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Have you benchmarked this for scalability?
The whole point of lazy LRU removal was to remove LRU lock
contention from the hot lookup path. I suspect that putting the LRU
locks back inside the lookup path is going to cause performance
regressions...
FWIW, why do we even need the inode LRU anymore?
We certainly don't need it anymore to keep the working set in memory
because that's what the dentry cache LRU does (i.e. by pinning a
reference to the inode whilst the dentry is active).
And with the introduction of the cached inode list, we don't need
the inode LRU to track unreferenced dirty inodes around whilst
they hang out on writeback lists. The inodes on the writeback lists
are now referenced and tracked on the cached inode list, so they
don't need special hooks in the mm/ code to handle the special
transition from "unreferenced writeback" to "unreferenced LRU"
anymore, they can just be dropped from the cached inode list....
So rather than jumping through hoops to maintain an LRU we likely
don't actually need and is likely to re-introduce old scalability
issues, why not remove it completely?
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir
2025-08-27 12:32 ` Christian Brauner
2025-08-27 16:08 ` Josef Bacik
@ 2025-08-27 22:01 ` Dave Chinner
2025-08-28 11:46 ` Josef Bacik
1 sibling, 1 reply; 105+ messages in thread
From: Dave Chinner @ 2025-08-27 22:01 UTC (permalink / raw)
To: Christian Brauner
Cc: Josef Bacik, linux-fsdevel, linux-btrfs, kernel-team, linux-ext4,
linux-xfs, viro, amir73il
On Wed, Aug 27, 2025 at 02:32:49PM +0200, Christian Brauner wrote:
> On Tue, Aug 26, 2025 at 11:39:17AM -0400, Josef Bacik wrote:
> > We can end up with an inode on the LRU list or the cached list, then at
> > some point in the future go to unlink that inode and then still have an
> > elevated i_count reference for that inode because it is on one of these
> > lists.
> >
> > The more common case is the cached list. We open a file, write to it,
> > truncate some of it which triggers the inode_add_lru code in the
> > pagecache, adding it to the cached LRU. Then we unlink this inode, and
> > it exists until writeback or reclaim kicks in and removes the inode.
> >
> > To handle this case, delete the inode from the LRU list when it is
> > unlinked, so we have the best case scenario for immediately freeing the
> > inode.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
>
> I'm not too fond of this particular change I think it's really misplaced
> and the correct place is indeed drop_nlink() and clear_nlink().
I don't really like putting it in drop_nlink because that then puts
the inode LRU in the middle of filesystem transactions when lots of
different filesystem locks are held.
IF the LRU operations are in the VFS, then we know exactly what
locks are held when it is performed (current behaviour). However,
when done from the filesystem transaction context running
drop_nlink, we'll have different sets of locks and/or execution
contexts held for each different fs type.
> I'm pretty sure that the number of callers that hold i_lock around
> drop_nlink() and clear_nlink() is relatively small.
I think the calling context problem is wider than the obvious issue
with i_lock....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 53/54] fs: remove I_LRU_ISOLATING flag
2025-08-26 15:39 ` [PATCH v2 53/54] fs: remove I_LRU_ISOLATING flag Josef Bacik
@ 2025-08-28 0:26 ` Dave Chinner
2025-08-28 10:35 ` Christian Brauner
0 siblings, 1 reply; 105+ messages in thread
From: Dave Chinner @ 2025-08-28 0:26 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Tue, Aug 26, 2025 at 11:39:53AM -0400, Josef Bacik wrote:
> If the inode is on the LRU it has a full reference and thus no longer
> needs to be pinned while it is being isolated.
>
> Remove the I_LRU_ISOLATING flag and associated helper functions
> (inode_pin_lru_isolating, inode_unpin_lru_isolating, and
> inode_wait_for_lru_isolating) as they are no longer needed.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
....
> @@ -745,34 +742,32 @@ is_uncached_acl(struct posix_acl *acl)
> * I_CACHED_LRU Inode is cached because it is dirty or isn't shrinkable,
> * and thus is on the s_cached_inode_lru list.
> *
> - * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
> - * upon. There's one free address left.
> + * __I_{SYNC,NEW} are used to derive unique addresses to wait upon. There are
> + * two free address left.
> */
>
> enum inode_state_bits {
> __I_NEW = 0U,
> - __I_SYNC = 1U,
> - __I_LRU_ISOLATING = 2U
> + __I_SYNC = 1U
> };
>
> enum inode_state_flags_t {
> I_NEW = (1U << __I_NEW),
> I_SYNC = (1U << __I_SYNC),
> - I_LRU_ISOLATING = (1U << __I_LRU_ISOLATING),
> - I_DIRTY_SYNC = (1U << 3),
> - I_DIRTY_DATASYNC = (1U << 4),
> - I_DIRTY_PAGES = (1U << 5),
> - I_CLEAR = (1U << 6),
> - I_LINKABLE = (1U << 7),
> - I_DIRTY_TIME = (1U << 8),
> - I_WB_SWITCH = (1U << 9),
> - I_OVL_INUSE = (1U << 10),
> - I_CREATING = (1U << 11),
> - I_DONTCACHE = (1U << 12),
> - I_SYNC_QUEUED = (1U << 13),
> - I_PINNING_NETFS_WB = (1U << 14),
> - I_LRU = (1U << 15),
> - I_CACHED_LRU = (1U << 16)
> + I_DIRTY_SYNC = (1U << 2),
> + I_DIRTY_DATASYNC = (1U << 3),
> + I_DIRTY_PAGES = (1U << 4),
> + I_CLEAR = (1U << 5),
> + I_LINKABLE = (1U << 6),
> + I_DIRTY_TIME = (1U << 7),
> + I_WB_SWITCH = (1U << 8),
> + I_OVL_INUSE = (1U << 9),
> + I_CREATING = (1U << 10),
> + I_DONTCACHE = (1U << 11),
> + I_SYNC_QUEUED = (1U << 12),
> + I_PINNING_NETFS_WB = (1U << 13),
> + I_LRU = (1U << 14),
> + I_CACHED_LRU = (1U << 15)
> };
This is a bit of a mess - we should reserve the first 4 bits for the
waitable inode_state_bits right from the start and not renumber the
other flag bits into that range. i.e. start the first non-waitable
bit at bit 4. That way every time we add/remove a waitable bit, we
don't have to rewrite the entire set of flags. i.e: something like:
enum inode_state_flags_t {
I_NEW = (1U << __I_NEW),
I_SYNC = (1U << __I_SYNC),
// waitable bit 2 unused
// waitable bit 3 unused
I_DIRTY_SYNC = (1U << 4),
....
This will be much more blame friendly if we do it this way from the
start of this patch set.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 15/54] fs: maintain a list of pinned inodes
2025-08-27 16:07 ` Josef Bacik
@ 2025-08-28 8:24 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 8:24 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Wed, Aug 27, 2025 at 12:07:56PM -0400, Josef Bacik wrote:
> On Wed, Aug 27, 2025 at 05:20:17PM +0200, Christian Brauner wrote:
> > On Tue, Aug 26, 2025 at 11:39:15AM -0400, Josef Bacik wrote:
> > > Currently we have relied on dirty inodes and inodes with cache on them
> > > to simply be left hanging around on the system outside of an LRU. The
> > > only way to make sure these inodes are eventually reclaimed is because
> > > dirty writeback will grab a reference on the inode and then iput it when
> > > it's done, potentially getting it on the LRU. For the cached case the
> > > page cache deletion path will call inode_add_lru when the inode no
> > > longer has cached pages in order to make sure the inode object can be
> > > freed eventually. In the unmount case we walk all inodes and free them
> > > so this all works out fine.
> > >
> > > But we want to eliminate 0 i_count objects as a concept, so we need a
> > > mechanism to hold a reference on these pinned inodes. To that end, add a
> > > list to the super block that contains any inodes that are cached for one
> > > reason or another.
> > >
> > > When we call inode_add_lru(), if the inode falls into one of these
> > > categories, we will add it to the cached inode list and hold an
> > > i_obj_count reference. If the inode does not fall into one of these
> > > categories it will be moved to the normal LRU, which is already holds an
> > > i_obj_count reference.
> > >
> > > The dirty case we will delete it from the LRU if it is on one, and then
> > > the iput after the writeout will make sure it's placed onto the correct
> > > list at that point.
> > >
> > > The page cache case will migrate it when it calls inode_add_lru() when
> > > deleting pages from the page cache.
> > >
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> >
> > Ok, I'm trying to wrap my head around the justification for this new
> > list. Currently we have inodes with a zero reference counts that aren't
> > on any LRU. They just appear on sb->i_sb_list and are e.g., dealt with
> > during umount (sync_filesystem() followed by evict_inodes()).
> >
> > So they're either dealt with by writeback or by the page cache and are
> > eventually put on the regular LRU or the filesystem shuts down before
> > that happens.
> >
> > They're easy to handle and recognize because their inode->i_count is
> > zero.
> >
> > Now you make the LRUs hold a full reference so it can be grabbed from
> > the LRU again avoiding the zombie resurrection from zero. So to
> > recognize inodes that are pinned internally due to being dirty or having
> > pagecache pages attached to it you need to track them in a new list
> > otherwise you can't really differentiate them and when to move them onto
> > the LRU after writeback and pagecache is done with them.
> >
>
> Exactly. We need to put them somewhere so we can account for their reference.
>
> We could technically just use a flag and not have a list for this, and just use
> the flag to indicate that the inode is pinned and the flag has a full reference
> associated with it.
>
> I did it this way because if I had a nickel for every time I needed to figure
> out where a zombie inode was and had to do the most grotesque drgn magic to find
> it, I'd have like 15 cents, which isn't a lot but weird that it's happened 3
> times. Having a list makes it easier from a debugging perspective.
>
> But again, we have ->s_inodes, and I can just scan that list and look for
> I_LRU_CACHED. We'd still need to hold a full reference for that, but it would
> eliminate the need for another list if that's more preferable? Thanks,
I don't mind the additional list and the sb struct is not very size
sensitive anyway.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir
2025-08-26 15:39 ` [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir Josef Bacik
2025-08-27 12:32 ` Christian Brauner
@ 2025-08-28 9:00 ` Christian Brauner
2025-08-28 9:06 ` Christian Brauner
2 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 9:00 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:17AM -0400, Josef Bacik wrote:
> We can end up with an inode on the LRU list or the cached list, then at
> some point in the future go to unlink that inode and then still have an
> elevated i_count reference for that inode because it is on one of these
> lists.
>
> The more common case is the cached list. We open a file, write to it,
> truncate some of it which triggers the inode_add_lru code in the
> pagecache, adding it to the cached LRU. Then we unlink this inode, and
> it exists until writeback or reclaim kicks in and removes the inode.
>
> To handle this case, delete the inode from the LRU list when it is
> unlinked, so we have the best case scenario for immediately freeing the
> inode.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/namei.c | 30 +++++++++++++++++++++++++-----
> 1 file changed, 25 insertions(+), 5 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 138a693c2346..e56dcb5747e4 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -4438,6 +4438,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
> int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> struct dentry *dentry)
> {
> + struct inode *inode = dentry->d_inode;
> int error = may_delete(idmap, dir, dentry, 1);
>
> if (error)
> @@ -4447,11 +4448,11 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> return -EPERM;
>
> dget(dentry);
> - inode_lock(dentry->d_inode);
> + inode_lock(inode);
>
> error = -EBUSY;
> if (is_local_mountpoint(dentry) ||
> - (dentry->d_inode->i_flags & S_KERNEL_FILE))
> + (inode->i_flags & S_KERNEL_FILE))
> goto out;
>
> error = security_inode_rmdir(dir, dentry);
> @@ -4463,12 +4464,21 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> goto out;
>
> shrink_dcache_parent(dentry);
> - dentry->d_inode->i_flags |= S_DEAD;
> + inode->i_flags |= S_DEAD;
> dont_mount(dentry);
> detach_mounts(dentry);
>
> out:
> - inode_unlock(dentry->d_inode);
> + /*
> + * The inode may be on the LRU list, so delete it from the LRU at this
> + * point in order to make sure that the inode is freed as soon as
> + * possible.
> + */
> + spin_lock(&inode->i_lock);
> + inode_lru_list_del(inode);
> + spin_unlock(&inode->i_lock);
I think it should be possible to optimize this with an appropriate
helper doing:
static inline bool inode_on_lru(const struct inode *inode)
{
return !!(READ_ONCE(inode->i_state) & (I_LRU | I_CACHED_LRU));
}
then
if (inode_on_lru(inode)) {
spin_lock(&inode->i_lock);
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
}
so you don't needlessly acquire i_lock.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir
2025-08-26 15:39 ` [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir Josef Bacik
2025-08-27 12:32 ` Christian Brauner
2025-08-28 9:00 ` Christian Brauner
@ 2025-08-28 9:06 ` Christian Brauner
2025-08-28 10:43 ` Christian Brauner
2 siblings, 1 reply; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 9:06 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:17AM -0400, Josef Bacik wrote:
> We can end up with an inode on the LRU list or the cached list, then at
> some point in the future go to unlink that inode and then still have an
> elevated i_count reference for that inode because it is on one of these
> lists.
>
> The more common case is the cached list. We open a file, write to it,
> truncate some of it which triggers the inode_add_lru code in the
> pagecache, adding it to the cached LRU. Then we unlink this inode, and
> it exists until writeback or reclaim kicks in and removes the inode.
>
> To handle this case, delete the inode from the LRU list when it is
> unlinked, so we have the best case scenario for immediately freeing the
> inode.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/namei.c | 30 +++++++++++++++++++++++++-----
> 1 file changed, 25 insertions(+), 5 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 138a693c2346..e56dcb5747e4 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -4438,6 +4438,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
> int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> struct dentry *dentry)
> {
> + struct inode *inode = dentry->d_inode;
> int error = may_delete(idmap, dir, dentry, 1);
>
> if (error)
> @@ -4447,11 +4448,11 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> return -EPERM;
>
> dget(dentry);
> - inode_lock(dentry->d_inode);
> + inode_lock(inode);
>
> error = -EBUSY;
> if (is_local_mountpoint(dentry) ||
> - (dentry->d_inode->i_flags & S_KERNEL_FILE))
> + (inode->i_flags & S_KERNEL_FILE))
> goto out;
>
> error = security_inode_rmdir(dir, dentry);
> @@ -4463,12 +4464,21 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> goto out;
>
> shrink_dcache_parent(dentry);
> - dentry->d_inode->i_flags |= S_DEAD;
> + inode->i_flags |= S_DEAD;
> dont_mount(dentry);
> detach_mounts(dentry);
>
> out:
> - inode_unlock(dentry->d_inode);
> + /*
> + * The inode may be on the LRU list, so delete it from the LRU at this
> + * point in order to make sure that the inode is freed as soon as
> + * possible.
> + */
> + spin_lock(&inode->i_lock);
> + inode_lru_list_del(inode);
> + spin_unlock(&inode->i_lock);
> +
> + inode_unlock(inode);
> dput(dentry);
> if (!error)
> d_delete_notify(dir, dentry);
> @@ -4653,8 +4663,18 @@ int do_unlinkat(int dfd, struct filename *name)
> dput(dentry);
Why are you doing that in do_unlinkat() instead of vfs_unlink() (as
you're doing it in vfs_rmdir() and not do_rmdir())?
Doing it in do_unlinkat() means any stacking filesystem such as
overlayfs will end up skipping the LRU list removal as they use
vfs_unlink() directly.
And does btrfs subvolume/snapshot deletion special treatment as well for
this as it's semantically equivalent to an rmdir?
> }
> inode_unlock(path.dentry->d_inode);
> - if (inode)
> + if (inode) {
> + /*
> + * The LRU may be holding a reference, remove the inode from the
> + * LRU here before dropping our hopefully final reference on the
> + * inode.
> + */
> + spin_lock(&inode->i_lock);
> + inode_lru_list_del(inode);
> + spin_unlock(&inode->i_lock);
> +
> iput(inode); /* truncate the inode here */
> + }
> inode = NULL;
> if (delegated_inode) {
> error = break_deleg_wait(&delegated_inode);
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 18/54] fs: change evict_inodes to use iput instead of evict directly
2025-08-26 15:39 ` [PATCH v2 18/54] fs: change evict_inodes to use iput instead of evict directly Josef Bacik
@ 2025-08-28 10:18 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 10:18 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:18AM -0400, Josef Bacik wrote:
> At evict_inodes() time, we no longer have SB_ACTIVE set, so we can
> easily go through the normal iput path to clear any inodes. Update
> dispose_list() to check how we need to free the inode, and then grab a
> full reference to the inode while we're looping through the remaining
> inodes, and simply iput them at the end.
>
> Since we're just calling iput we don't really care about the i_count on
> the inode at the current time. Remove the i_count checks and just call
> iput on every inode we find.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 26 +++++++++++---------------
> 1 file changed, 11 insertions(+), 15 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 399598e90693..ede9118bb649 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -933,7 +933,7 @@ static void evict(struct inode *inode)
> * Dispose-list gets a local list with local inodes in it, so it doesn't
> * need to worry about list corruption and SMP locks.
> */
> -static void dispose_list(struct list_head *head)
> +static void dispose_list(struct list_head *head, bool for_lru)
> {
> while (!list_empty(head)) {
> struct inode *inode;
> @@ -941,8 +941,12 @@ static void dispose_list(struct list_head *head)
> inode = list_first_entry(head, struct inode, i_lru);
> list_del_init(&inode->i_lru);
>
> - evict(inode);
> - iobj_put(inode);
> + if (for_lru) {
> + evict(inode);
> + iobj_put(inode);
> + } else {
> + iput(inode);
> + }
I would really like to see a transitionary comment here or at least some
more details in the commit. Something like:
Once the inode has gone through iput_final() and has ended up on
the LRU it's inode->i_count will have gone to zero but the LRU
still holds an inode->i_obj_count reference. So we need to evict
and put that i_obj_count reference when disposing the collected
inodes.
When the inodes have been collected via evict_inodes() e.g, on
umount() they will now be made to hold a full reference that we
can simply iput().
Calling iput() in this case is safe as we no longer have
SB_ACTIVE set and thus cannot be readded to the LRU.
I think that's easier to follow and better describes the change.
But, there's a wrinkle. It is not guaranteed that at evict_inodes() time
SB_ACTIVE isn't raised anymore. Afaict, both bch2_fs_bdev_mark_dead()
and fs_bdev_mark_dead() will be called with SB_ACTIVE set from the block
layer on device removal. So in that case iput() would add them back to
the LRU.
> cond_resched();
> }
> }
> @@ -964,21 +968,13 @@ void evict_inodes(struct super_block *sb)
> again:
> spin_lock(&sb->s_inode_list_lock);
> list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> - if (icount_read(inode))
> - continue;
> -
> spin_lock(&inode->i_lock);
> - if (icount_read(inode)) {
> - spin_unlock(&inode->i_lock);
> - continue;
> - }
> if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
> spin_unlock(&inode->i_lock);
> continue;
> }
>
> - inode->i_state |= I_FREEING;
> - iobj_get(inode);
> + __iget(inode);
> inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
> list_add(&inode->i_lru, &dispose);
> @@ -991,13 +987,13 @@ void evict_inodes(struct super_block *sb)
> if (need_resched()) {
> spin_unlock(&sb->s_inode_list_lock);
> cond_resched();
> - dispose_list(&dispose);
> + dispose_list(&dispose, false);
> goto again;
> }
> }
> spin_unlock(&sb->s_inode_list_lock);
>
> - dispose_list(&dispose);
> + dispose_list(&dispose, false);
> }
> EXPORT_SYMBOL_GPL(evict_inodes);
>
> @@ -1108,7 +1104,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
>
> freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
> inode_lru_isolate, &freeable);
> - dispose_list(&freeable);
> + dispose_list(&freeable, true);
> return freed;
> }
>
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 53/54] fs: remove I_LRU_ISOLATING flag
2025-08-28 0:26 ` Dave Chinner
@ 2025-08-28 10:35 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 10:35 UTC (permalink / raw)
To: Dave Chinner
Cc: Josef Bacik, linux-fsdevel, linux-btrfs, kernel-team, linux-ext4,
linux-xfs, viro, amir73il
On Thu, Aug 28, 2025 at 10:26:50AM +1000, Dave Chinner wrote:
> On Tue, Aug 26, 2025 at 11:39:53AM -0400, Josef Bacik wrote:
> > If the inode is on the LRU it has a full reference and thus no longer
> > needs to be pinned while it is being isolated.
> >
> > Remove the I_LRU_ISOLATING flag and associated helper functions
> > (inode_pin_lru_isolating, inode_unpin_lru_isolating, and
> > inode_wait_for_lru_isolating) as they are no longer needed.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>
> ....
> > @@ -745,34 +742,32 @@ is_uncached_acl(struct posix_acl *acl)
> > * I_CACHED_LRU Inode is cached because it is dirty or isn't shrinkable,
> > * and thus is on the s_cached_inode_lru list.
> > *
> > - * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
> > - * upon. There's one free address left.
> > + * __I_{SYNC,NEW} are used to derive unique addresses to wait upon. There are
> > + * two free address left.
> > */
> >
> > enum inode_state_bits {
> > __I_NEW = 0U,
> > - __I_SYNC = 1U,
> > - __I_LRU_ISOLATING = 2U
> > + __I_SYNC = 1U
> > };
> >
> > enum inode_state_flags_t {
> > I_NEW = (1U << __I_NEW),
> > I_SYNC = (1U << __I_SYNC),
> > - I_LRU_ISOLATING = (1U << __I_LRU_ISOLATING),
> > - I_DIRTY_SYNC = (1U << 3),
> > - I_DIRTY_DATASYNC = (1U << 4),
> > - I_DIRTY_PAGES = (1U << 5),
> > - I_CLEAR = (1U << 6),
> > - I_LINKABLE = (1U << 7),
> > - I_DIRTY_TIME = (1U << 8),
> > - I_WB_SWITCH = (1U << 9),
> > - I_OVL_INUSE = (1U << 10),
> > - I_CREATING = (1U << 11),
> > - I_DONTCACHE = (1U << 12),
> > - I_SYNC_QUEUED = (1U << 13),
> > - I_PINNING_NETFS_WB = (1U << 14),
> > - I_LRU = (1U << 15),
> > - I_CACHED_LRU = (1U << 16)
> > + I_DIRTY_SYNC = (1U << 2),
> > + I_DIRTY_DATASYNC = (1U << 3),
> > + I_DIRTY_PAGES = (1U << 4),
> > + I_CLEAR = (1U << 5),
> > + I_LINKABLE = (1U << 6),
> > + I_DIRTY_TIME = (1U << 7),
> > + I_WB_SWITCH = (1U << 8),
> > + I_OVL_INUSE = (1U << 9),
> > + I_CREATING = (1U << 10),
> > + I_DONTCACHE = (1U << 11),
> > + I_SYNC_QUEUED = (1U << 12),
> > + I_PINNING_NETFS_WB = (1U << 13),
> > + I_LRU = (1U << 14),
> > + I_CACHED_LRU = (1U << 15)
> > };
>
> This is a bit of a mess - we should reserve the first 4 bits for the
> waitable inode_state_bits right from the start and not renumber the
> other flag bits into that range. i.e. start the first non-waitable
> bit at bit 4. That way every time we add/remove a waitable bit, we
> don't have to rewrite the entire set of flags. i.e: something like:
>
> enum inode_state_flags_t {
> I_NEW = (1U << __I_NEW),
> I_SYNC = (1U << __I_SYNC),
> // waitable bit 2 unused
> // waitable bit 3 unused
> I_DIRTY_SYNC = (1U << 4),
> ....
>
> This will be much more blame friendly if we do it this way from the
> start of this patch set.
Thanks. I had this locally a bit differently but I just change it to a
comment.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir
2025-08-28 9:06 ` Christian Brauner
@ 2025-08-28 10:43 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 10:43 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Thu, Aug 28, 2025 at 11:06:37AM +0200, Christian Brauner wrote:
> On Tue, Aug 26, 2025 at 11:39:17AM -0400, Josef Bacik wrote:
> > We can end up with an inode on the LRU list or the cached list, then at
> > some point in the future go to unlink that inode and then still have an
> > elevated i_count reference for that inode because it is on one of these
> > lists.
> >
> > The more common case is the cached list. We open a file, write to it,
> > truncate some of it which triggers the inode_add_lru code in the
> > pagecache, adding it to the cached LRU. Then we unlink this inode, and
> > it exists until writeback or reclaim kicks in and removes the inode.
> >
> > To handle this case, delete the inode from the LRU list when it is
> > unlinked, so we have the best case scenario for immediately freeing the
> > inode.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > fs/namei.c | 30 +++++++++++++++++++++++++-----
> > 1 file changed, 25 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 138a693c2346..e56dcb5747e4 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -4438,6 +4438,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
> > int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> > struct dentry *dentry)
> > {
> > + struct inode *inode = dentry->d_inode;
> > int error = may_delete(idmap, dir, dentry, 1);
> >
> > if (error)
> > @@ -4447,11 +4448,11 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> > return -EPERM;
> >
> > dget(dentry);
> > - inode_lock(dentry->d_inode);
> > + inode_lock(inode);
> >
> > error = -EBUSY;
> > if (is_local_mountpoint(dentry) ||
> > - (dentry->d_inode->i_flags & S_KERNEL_FILE))
> > + (inode->i_flags & S_KERNEL_FILE))
> > goto out;
> >
> > error = security_inode_rmdir(dir, dentry);
> > @@ -4463,12 +4464,21 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> > goto out;
> >
> > shrink_dcache_parent(dentry);
> > - dentry->d_inode->i_flags |= S_DEAD;
> > + inode->i_flags |= S_DEAD;
> > dont_mount(dentry);
> > detach_mounts(dentry);
> >
> > out:
> > - inode_unlock(dentry->d_inode);
> > + /*
> > + * The inode may be on the LRU list, so delete it from the LRU at this
> > + * point in order to make sure that the inode is freed as soon as
> > + * possible.
> > + */
> > + spin_lock(&inode->i_lock);
> > + inode_lru_list_del(inode);
> > + spin_unlock(&inode->i_lock);
> > +
> > + inode_unlock(inode);
> > dput(dentry);
> > if (!error)
> > d_delete_notify(dir, dentry);
> > @@ -4653,8 +4663,18 @@ int do_unlinkat(int dfd, struct filename *name)
> > dput(dentry);
>
> Why are you doing that in do_unlinkat() instead of vfs_unlink() (as
> you're doing it in vfs_rmdir() and not do_rmdir())?
>
> Doing it in do_unlinkat() means any stacking filesystem such as
> overlayfs will end up skipping the LRU list removal as they use
> vfs_unlink() directly.
>
> And does btrfs subvolume/snapshot deletion special treatment as well for
> this as it's semantically equivalent to an rmdir?
Another thing. If this is unlinking an inode with multiple hardlinks
then vfs_unlink() might not remove the inode if it's not the last
hardlink. Do you really want to remove the inode from the LRU list in
that case as well?
> > }
> > inode_unlock(path.dentry->d_inode);
> > - if (inode)
> > + if (inode) {
> > + /*
> > + * The LRU may be holding a reference, remove the inode from the
> > + * LRU here before dropping our hopefully final reference on the
> > + * inode.
> > + */
> > + spin_lock(&inode->i_lock);
> > + inode_lru_list_del(inode);
> > + spin_unlock(&inode->i_lock);
> > +
> > iput(inode); /* truncate the inode here */
> > + }
> > inode = NULL;
> > if (delegated_inode) {
> > error = break_deleg_wait(&delegated_inode);
> > --
> > 2.49.0
> >
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 19/54] fs: hold a full ref while the inode is on a LRU
2025-08-26 15:39 ` [PATCH v2 19/54] fs: hold a full ref while the inode is on a LRU Josef Bacik
@ 2025-08-28 10:51 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 10:51 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:19AM -0400, Josef Bacik wrote:
> We want to eliminate 0 refcount inodes that can be used. To that end,
> make the LRU's hold a full reference on the inode while it is on an LRU
> list. From there we can change the eviction code to always just iput the
> inode, and the LRU operations will just add or drop a full reference
> where appropriate.
>
> We also now must take into account unlink, and avoid adding the inode to
> the LRU if it has an nlink of 0.
It would be good to explain why. I'm pretty sure it's because of the
prior patch in the series that drops the inode from the LRU on
unlink/rmdir and you want to avoid adding it back.
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 87 +++++++++++++++++++++++++-----------------------------
> 1 file changed, 40 insertions(+), 47 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index ede9118bb649..9001f809add0 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -555,11 +555,13 @@ static void inode_add_cached_lru(struct inode *inode)
>
> if (inode->i_state & I_CACHED_LRU)
> return;
> + if (inode->__i_nlink == 0)
> + return;
> if (!list_empty(&inode->i_lru))
> return;
>
> inode->i_state |= I_CACHED_LRU;
> - iobj_get(inode);
> + __iget(inode);
> spin_lock(&inode->i_sb->s_cached_inodes_lock);
> list_add(&inode->i_lru, &inode->i_sb->s_cached_inodes);
> spin_unlock(&inode->i_sb->s_cached_inodes_lock);
> @@ -582,7 +584,7 @@ static bool __inode_del_cached_lru(struct inode *inode)
> static bool inode_del_cached_lru(struct inode *inode)
> {
> if (__inode_del_cached_lru(inode)) {
> - iobj_put(inode);
> + iput(inode);
> return true;
> }
> return false;
> @@ -598,6 +600,8 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> return;
> if (icount_read(inode))
> return;
> + if (inode->__i_nlink == 0)
> + return;
> if (!(inode->i_sb->s_flags & SB_ACTIVE))
> return;
> if (inode_needs_cached(inode)) {
> @@ -609,7 +613,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
> inode->i_state |= I_LRU;
> if (need_ref)
> - iobj_get(inode);
> + __iget(inode);
> this_cpu_inc(nr_unused);
> } else if (rotate) {
> inode->i_state |= I_REFERENCED;
> @@ -655,7 +659,7 @@ void inode_lru_list_del(struct inode *inode)
>
> if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
> inode->i_state &= ~I_LRU;
> - iobj_put(inode);
> + iput(inode);
> this_cpu_dec(nr_unused);
> }
> }
> @@ -926,6 +930,7 @@ static void evict(struct inode *inode)
> BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
> }
>
> +static void iput_evict(struct inode *inode);
> /*
> * dispose_list - dispose of the contents of a local list
> * @head: the head of the list to free
> @@ -933,20 +938,14 @@ static void evict(struct inode *inode)
> * Dispose-list gets a local list with local inodes in it, so it doesn't
> * need to worry about list corruption and SMP locks.
> */
> -static void dispose_list(struct list_head *head, bool for_lru)
> +static void dispose_list(struct list_head *head)
> {
> while (!list_empty(head)) {
> struct inode *inode;
>
> inode = list_first_entry(head, struct inode, i_lru);
> list_del_init(&inode->i_lru);
> -
> - if (for_lru) {
> - evict(inode);
> - iobj_put(inode);
> - } else {
> - iput(inode);
> - }
> + iput_evict(inode);
Ok, so this allows evict_inodes() to run on an SB_ACTIVE superblock.
> cond_resched();
> }
> }
> @@ -987,13 +986,13 @@ void evict_inodes(struct super_block *sb)
> if (need_resched()) {
> spin_unlock(&sb->s_inode_list_lock);
> cond_resched();
> - dispose_list(&dispose, false);
> + dispose_list(&dispose);
> goto again;
> }
> }
> spin_unlock(&sb->s_inode_list_lock);
>
> - dispose_list(&dispose, false);
> + dispose_list(&dispose);
> }
> EXPORT_SYMBOL_GPL(evict_inodes);
>
> @@ -1031,22 +1030,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> if (inode_needs_cached(inode)) {
> list_lru_isolate(lru, &inode->i_lru);
> inode_add_cached_lru(inode);
> - iobj_put(inode);
> - spin_unlock(&inode->i_lock);
> - this_cpu_dec(nr_unused);
> - return LRU_REMOVED;
> - }
> -
> - /*
> - * Inodes can get referenced, redirtied, or repopulated while
> - * they're already on the LRU, and this can make them
> - * unreclaimable for a while. Remove them lazily here; iput,
> - * sync, or the last page cache deletion will requeue them.
> - */
> - if (icount_read(inode) ||
> - (inode->i_state & ~I_REFERENCED)) {
> - list_lru_isolate(lru, &inode->i_lru);
> - inode->i_state &= ~I_LRU;
> + iput(inode);
> spin_unlock(&inode->i_lock);
> this_cpu_dec(nr_unused);
> return LRU_REMOVED;
> @@ -1082,7 +1066,6 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> }
>
> WARN_ON(inode->i_state & I_NEW);
> - inode->i_state |= I_FREEING;
> inode->i_state &= ~I_LRU;
> list_lru_isolate_move(lru, &inode->i_lru, freeable);
> spin_unlock(&inode->i_lock);
> @@ -1104,7 +1087,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
>
> freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
> inode_lru_isolate, &freeable);
> - dispose_list(&freeable, true);
> + dispose_list(&freeable);
> return freed;
> }
>
> @@ -1967,7 +1950,7 @@ EXPORT_SYMBOL(generic_delete_inode);
> * in cache if fs is alive, sync and evict if fs is
> * shutting down.
> */
> -static void iput_final(struct inode *inode)
> +static void iput_final(struct inode *inode, bool skip_lru)
> {
> struct super_block *sb = inode->i_sb;
> const struct super_operations *op = inode->i_sb->s_op;
> @@ -1981,7 +1964,7 @@ static void iput_final(struct inode *inode)
> else
> drop = generic_drop_inode(inode);
>
> - if (!drop &&
> + if (!drop && !skip_lru &&
> !(inode->i_state & I_DONTCACHE) &&
> (sb->s_flags & SB_ACTIVE)) {
> __inode_add_lru(inode, true);
> @@ -1989,6 +1972,8 @@ static void iput_final(struct inode *inode)
> return;
> }
>
> + WARN_ON(!list_empty(&inode->i_lru));
> +
> state = inode->i_state;
> if (!drop) {
> WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
> @@ -2003,23 +1988,12 @@ static void iput_final(struct inode *inode)
> }
>
> WRITE_ONCE(inode->i_state, state | I_FREEING);
> - if (!list_empty(&inode->i_lru))
> - inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
>
> evict(inode);
> }
>
> -/**
> - * iput - put an inode
> - * @inode: inode to put
> - *
> - * Puts an inode, dropping its usage count. If the inode use count hits
> - * zero, the inode is then freed and may also be destroyed.
> - *
> - * Consequently, iput() can sleep.
> - */
> -void iput(struct inode *inode)
> +static void __iput(struct inode *inode, bool skip_lru)
> {
> if (!inode)
> return;
> @@ -2038,12 +2012,31 @@ void iput(struct inode *inode)
> spin_lock(&inode->i_lock);
> if (atomic_dec_and_test(&inode->i_count)) {
> /* iput_final() drops i_lock */
> - iput_final(inode);
> + iput_final(inode, skip_lru);
> } else {
> spin_unlock(&inode->i_lock);
> }
> iobj_put(inode);
> }
> +
> +static void iput_evict(struct inode *inode)
> +{
> + __iput(inode, true);
> +}
> +
> +/**
> + * iput - put an inode
> + * @inode: inode to put
> + *
> + * Puts an inode, dropping its usage count. If the inode use count hits
> + * zero, the inode is then freed and may also be destroyed.
> + *
> + * Consequently, iput() can sleep.
> + */
> +void iput(struct inode *inode)
> +{
> + __iput(inode, false);
> +}
> EXPORT_SYMBOL(iput);
>
> /**
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 20/54] fs: disallow 0 reference count inodes
2025-08-26 15:39 ` [PATCH v2 20/54] fs: disallow 0 reference count inodes Josef Bacik
@ 2025-08-28 11:02 ` Christian Brauner
2025-08-28 11:44 ` Josef Bacik
0 siblings, 1 reply; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 11:02 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:20AM -0400, Josef Bacik wrote:
> Now that we take a full reference for inodes on the LRU, move the logic
> to add the inode to the LRU to before we drop our last reference. This
> allows us to ensure that if the inode has a reference count it can be
> used, and we no longer hold onto inodes that have a 0 reference count.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 61 ++++++++++++++++++++++++++++++++++++------------------
> 1 file changed, 41 insertions(+), 20 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 9001f809add0..d1668f7fb73e 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -598,7 +598,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
>
> if (inode->i_state & (I_FREEING | I_WILL_FREE))
> return;
> - if (icount_read(inode))
> + if (icount_read(inode) != 1)
> return;
> if (inode->__i_nlink == 0)
> return;
> @@ -1950,28 +1950,11 @@ EXPORT_SYMBOL(generic_delete_inode);
> * in cache if fs is alive, sync and evict if fs is
> * shutting down.
> */
> -static void iput_final(struct inode *inode, bool skip_lru)
> +static void iput_final(struct inode *inode, bool drop)
> {
> - struct super_block *sb = inode->i_sb;
> - const struct super_operations *op = inode->i_sb->s_op;
> unsigned long state;
> - int drop;
>
> WARN_ON(inode->i_state & I_NEW);
> -
> - if (op->drop_inode)
> - drop = op->drop_inode(inode);
> - else
> - drop = generic_drop_inode(inode);
> -
> - if (!drop && !skip_lru &&
> - !(inode->i_state & I_DONTCACHE) &&
> - (sb->s_flags & SB_ACTIVE)) {
> - __inode_add_lru(inode, true);
> - spin_unlock(&inode->i_lock);
> - return;
> - }
> -
> WARN_ON(!list_empty(&inode->i_lru));
>
> state = inode->i_state;
> @@ -1993,8 +1976,37 @@ static void iput_final(struct inode *inode, bool skip_lru)
> evict(inode);
> }
>
> +static bool maybe_add_lru(struct inode *inode, bool skip_lru)
> +{
> + const struct super_operations *op = inode->i_sb->s_op;
> + const struct super_block *sb = inode->i_sb;
> + bool drop = false;
> +
> + if (op->drop_inode)
> + drop = op->drop_inode(inode);
> + else
> + drop = generic_drop_inode(inode);
> +
> + if (drop)
> + return drop;
> +
> + if (skip_lru)
> + return drop;
> +
> + if (inode->i_state & I_DONTCACHE)
> + return drop;
> +
> + if (!(sb->s_flags & SB_ACTIVE))
> + return drop;
> +
> + __inode_add_lru(inode, true);
> + return drop;
> +}
> +
> static void __iput(struct inode *inode, bool skip_lru)
> {
> + bool drop;
> +
> if (!inode)
> return;
> BUG_ON(inode->i_state & I_CLEAR);
> @@ -2010,9 +2022,18 @@ static void __iput(struct inode *inode, bool skip_lru)
> }
>
> spin_lock(&inode->i_lock);
> +
> + /*
> + * If we want to keep the inode around on an LRU we will grab a ref to
> + * the inode when we add it to the LRU list, so we can safely drop the
> + * callers reference after this. If we didn't add the inode to the LRU
> + * then the refcount will still be 1 and we can do the final iput.
> + */
> + drop = maybe_add_lru(inode, skip_lru);
So before we only put the inode on an LRU when we knew we this was the
last reference. Now we're putting it on the LRU before we know that for
sure.
While __inode_add_lru() now checks whether this is potentially the last
reference we're goint to but, someone could grab another full reference
in between the check, putting it on the LRU and atomic_dec_and_test().
So we are left with an inode on the LRU that previously would not have
ended up there. And then later we need to remove it again. I guess the
arguments are:
(1) It's not a big deal because if the shrinker runs we'll just toss that
inode from the LRU again.
(2) If it ended up being put on the cached LRU it'll stay there for at
least as long as the inode is referenced? I guess that's ok too.
(3) The race is not that common?
Anyway, again it would be nice to have some comments noting this
behavior and arguing why that's ok.
> +
> if (atomic_dec_and_test(&inode->i_count)) {
> /* iput_final() drops i_lock */
> - iput_final(inode, skip_lru);
> + iput_final(inode, drop);
> } else {
> spin_unlock(&inode->i_lock);
> }
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 16/54] fs: delete the inode from the LRU list on lookup
2025-08-27 21:46 ` Dave Chinner
@ 2025-08-28 11:42 ` Josef Bacik
2025-09-02 4:07 ` Dave Chinner
0 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-28 11:42 UTC (permalink / raw)
To: Dave Chinner
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Thu, Aug 28, 2025 at 07:46:56AM +1000, Dave Chinner wrote:
> On Tue, Aug 26, 2025 at 11:39:16AM -0400, Josef Bacik wrote:
> > When we move to holding a full reference on the inode when it is on an
> > LRU list we need to have a mechanism to re-run the LRU add logic. The
> > use case for this is btrfs's snapshot delete, we will lookup all the
> > inodes and try to drop them, but if they're on the LRU we will not call
> > ->drop_inode() because their refcount will be elevated, so we won't know
> > that we need to drop the inode.
> >
> > Fix this by simply removing the inode from it's respective LRU list when
> > we grab a reference to it in a way that we have active users. This will
> > ensure that the logic to add the inode to the LRU or drop the inode will
> > be run on the final iput from the user.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>
> Have you benchmarked this for scalability?
>
> The whole point of lazy LRU removal was to remove LRU lock
> contention from the hot lookup path. I suspect that putting the LRU
> locks back inside the lookup path is going to cause performance
> regressions...
>
> FWIW, why do we even need the inode LRU anymore?
>
> We certainly don't need it anymore to keep the working set in memory
> because that's what the dentry cache LRU does (i.e. by pinning a
> reference to the inode whilst the dentry is active).
>
> And with the introduction of the cached inode list, we don't need
> the inode LRU to track unreferenced dirty inodes around whilst
> they hang out on writeback lists. The inodes on the writeback lists
> are now referenced and tracked on the cached inode list, so they
> don't need special hooks in the mm/ code to handle the special
> transition from "unreferenced writeback" to "unreferenced LRU"
> anymore, they can just be dropped from the cached inode list....
>
> So rather than jumping through hoops to maintain an LRU we likely
> don't actually need and is likely to re-introduce old scalability
> issues, why not remove it completely?
That's next on the list, but we're already at 54 patches. This won't be a hot
path, we're not going to consistently find inodes on the LRU to remove.
My rough plans are
1. Get this series merged.
2. Let it bake and see if any issues arise.
3. Remove the inode LRU completely.
4. Remove the i_hash and use an xarray for inode lookups.
The inode LRU removal is going to be a big change, and I want it to be separate
from this work from the LRU work in case we find that we do really need the LRU.
If that turns out to be the case then we can revisit if this is a scalability
issue. Thanks,
Josef
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 20/54] fs: disallow 0 reference count inodes
2025-08-28 11:02 ` Christian Brauner
@ 2025-08-28 11:44 ` Josef Bacik
0 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-28 11:44 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Thu, Aug 28, 2025 at 01:02:31PM +0200, Christian Brauner wrote:
> On Tue, Aug 26, 2025 at 11:39:20AM -0400, Josef Bacik wrote:
> > Now that we take a full reference for inodes on the LRU, move the logic
> > to add the inode to the LRU to before we drop our last reference. This
> > allows us to ensure that if the inode has a reference count it can be
> > used, and we no longer hold onto inodes that have a 0 reference count.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > fs/inode.c | 61 ++++++++++++++++++++++++++++++++++++------------------
> > 1 file changed, 41 insertions(+), 20 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 9001f809add0..d1668f7fb73e 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -598,7 +598,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> >
> > if (inode->i_state & (I_FREEING | I_WILL_FREE))
> > return;
> > - if (icount_read(inode))
> > + if (icount_read(inode) != 1)
> > return;
> > if (inode->__i_nlink == 0)
> > return;
> > @@ -1950,28 +1950,11 @@ EXPORT_SYMBOL(generic_delete_inode);
> > * in cache if fs is alive, sync and evict if fs is
> > * shutting down.
> > */
> > -static void iput_final(struct inode *inode, bool skip_lru)
> > +static void iput_final(struct inode *inode, bool drop)
> > {
> > - struct super_block *sb = inode->i_sb;
> > - const struct super_operations *op = inode->i_sb->s_op;
> > unsigned long state;
> > - int drop;
> >
> > WARN_ON(inode->i_state & I_NEW);
> > -
> > - if (op->drop_inode)
> > - drop = op->drop_inode(inode);
> > - else
> > - drop = generic_drop_inode(inode);
> > -
> > - if (!drop && !skip_lru &&
> > - !(inode->i_state & I_DONTCACHE) &&
> > - (sb->s_flags & SB_ACTIVE)) {
> > - __inode_add_lru(inode, true);
> > - spin_unlock(&inode->i_lock);
> > - return;
> > - }
> > -
> > WARN_ON(!list_empty(&inode->i_lru));
> >
> > state = inode->i_state;
> > @@ -1993,8 +1976,37 @@ static void iput_final(struct inode *inode, bool skip_lru)
> > evict(inode);
> > }
> >
> > +static bool maybe_add_lru(struct inode *inode, bool skip_lru)
> > +{
> > + const struct super_operations *op = inode->i_sb->s_op;
> > + const struct super_block *sb = inode->i_sb;
> > + bool drop = false;
> > +
> > + if (op->drop_inode)
> > + drop = op->drop_inode(inode);
> > + else
> > + drop = generic_drop_inode(inode);
> > +
> > + if (drop)
> > + return drop;
> > +
> > + if (skip_lru)
> > + return drop;
> > +
> > + if (inode->i_state & I_DONTCACHE)
> > + return drop;
> > +
> > + if (!(sb->s_flags & SB_ACTIVE))
> > + return drop;
> > +
> > + __inode_add_lru(inode, true);
> > + return drop;
> > +}
> > +
> > static void __iput(struct inode *inode, bool skip_lru)
> > {
> > + bool drop;
> > +
> > if (!inode)
> > return;
> > BUG_ON(inode->i_state & I_CLEAR);
> > @@ -2010,9 +2022,18 @@ static void __iput(struct inode *inode, bool skip_lru)
> > }
> >
> > spin_lock(&inode->i_lock);
> > +
> > + /*
> > + * If we want to keep the inode around on an LRU we will grab a ref to
> > + * the inode when we add it to the LRU list, so we can safely drop the
> > + * callers reference after this. If we didn't add the inode to the LRU
> > + * then the refcount will still be 1 and we can do the final iput.
> > + */
> > + drop = maybe_add_lru(inode, skip_lru);
>
> So before we only put the inode on an LRU when we knew we this was the
> last reference. Now we're putting it on the LRU before we know that for
> sure.
>
> While __inode_add_lru() now checks whether this is potentially the last
> reference we're goint to but, someone could grab another full reference
> in between the check, putting it on the LRU and atomic_dec_and_test().
> So we are left with an inode on the LRU that previously would not have
> ended up there. And then later we need to remove it again. I guess the
> arguments are:
>
> (1) It's not a big deal because if the shrinker runs we'll just toss that
> inode from the LRU again.
> (2) If it ended up being put on the cached LRU it'll stay there for at
> least as long as the inode is referenced? I guess that's ok too.
> (3) The race is not that common?
>
> Anyway, again it would be nice to have some comments noting this
> behavior and arguing why that's ok.
Yup I'll add a lengthy explanation. Thanks,
Josef
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir
2025-08-27 22:01 ` Dave Chinner
@ 2025-08-28 11:46 ` Josef Bacik
2025-09-02 1:48 ` Dave Chinner
0 siblings, 1 reply; 105+ messages in thread
From: Josef Bacik @ 2025-08-28 11:46 UTC (permalink / raw)
To: Dave Chinner
Cc: Christian Brauner, linux-fsdevel, linux-btrfs, kernel-team,
linux-ext4, linux-xfs, viro, amir73il
On Thu, Aug 28, 2025 at 08:01:39AM +1000, Dave Chinner wrote:
> On Wed, Aug 27, 2025 at 02:32:49PM +0200, Christian Brauner wrote:
> > On Tue, Aug 26, 2025 at 11:39:17AM -0400, Josef Bacik wrote:
> > > We can end up with an inode on the LRU list or the cached list, then at
> > > some point in the future go to unlink that inode and then still have an
> > > elevated i_count reference for that inode because it is on one of these
> > > lists.
> > >
> > > The more common case is the cached list. We open a file, write to it,
> > > truncate some of it which triggers the inode_add_lru code in the
> > > pagecache, adding it to the cached LRU. Then we unlink this inode, and
> > > it exists until writeback or reclaim kicks in and removes the inode.
> > >
> > > To handle this case, delete the inode from the LRU list when it is
> > > unlinked, so we have the best case scenario for immediately freeing the
> > > inode.
> > >
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> >
> > I'm not too fond of this particular change I think it's really misplaced
> > and the correct place is indeed drop_nlink() and clear_nlink().
>
> I don't really like putting it in drop_nlink because that then puts
> the inode LRU in the middle of filesystem transactions when lots of
> different filesystem locks are held.
>
> IF the LRU operations are in the VFS, then we know exactly what
> locks are held when it is performed (current behaviour). However,
> when done from the filesystem transaction context running
> drop_nlink, we'll have different sets of locks and/or execution
> contexts held for each different fs type.
>
> > I'm pretty sure that the number of callers that hold i_lock around
> > drop_nlink() and clear_nlink() is relatively small.
>
> I think the calling context problem is wider than the obvious issue
> with i_lock....
This is an internal LRU, so yes potentially we could have locking issues, but
right now all LRU operations are nested inside of the i_lock, and this is purely
about object lifetime. I'm not concerned about this being in the bowls of any
filesystem because it's purely list manipulation.
And if it makes you feel better, the next patchset queued up for after the next
merge window is deleting the LRU, so you won't have to worry about it for long
:). Thanks,
Josef
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 22/54] fs: convert i_count to refcount_t
2025-08-26 15:39 ` [PATCH v2 22/54] fs: convert i_count to refcount_t Josef Bacik
@ 2025-08-28 12:00 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 12:00 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:22AM -0400, Josef Bacik wrote:
> Now that we do not allow i_count to drop to 0 and be used we can convert
> it to a refcount_t and benefit from the protections those helpers add.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/btrfs/inode.c | 2 +-
> fs/inode.c | 9 +++++----
> include/linux/fs.h | 6 +++---
> 3 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e16df38e0eef..eb9496342346 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3418,7 +3418,7 @@ void btrfs_add_delayed_iput(struct btrfs_inode *inode)
> struct btrfs_fs_info *fs_info = inode->root->fs_info;
> unsigned long flags;
>
> - if (atomic_add_unless(&inode->vfs_inode.i_count, -1, 1)) {
> + if (refcount_dec_not_one(&inode->vfs_inode.i_count)) {
Now this is the only place outside core VFS where we open-access
i_count. Add a helper and reuse it iput() as well? icount_maybe_dec()?
icount_dec_not_one()?
> iobj_put(&inode->vfs_inode);
> return;
> }
> diff --git a/fs/inode.c b/fs/inode.c
> index 1992db5cd70a..0be1c137bf1e 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -236,7 +236,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
> inode->i_state = 0;
> atomic64_set(&inode->i_sequence, 0);
> refcount_set(&inode->i_obj_count, 1);
> - atomic_set(&inode->i_count, 1);
> + refcount_set(&inode->i_count, 1);
> inode->i_op = &empty_iops;
> inode->i_fop = &no_open_fops;
> inode->i_ino = 0;
> @@ -545,7 +545,8 @@ static void init_once(void *foo)
> void ihold(struct inode *inode)
> {
> iobj_get(inode);
> - WARN_ON(atomic_inc_return(&inode->i_count) < 2);
> + refcount_inc(&inode->i_count);
> + WARN_ON(icount_read(inode) < 2);
> }
> EXPORT_SYMBOL(ihold);
>
> @@ -2011,7 +2012,7 @@ static void __iput(struct inode *inode, bool skip_lru)
> return;
> BUG_ON(inode->i_state & I_CLEAR);
>
> - if (atomic_add_unless(&inode->i_count, -1, 1)) {
> + if (refcount_dec_not_one(&inode->i_count)) {
> iobj_put(inode);
> return;
> }
> @@ -2031,7 +2032,7 @@ static void __iput(struct inode *inode, bool skip_lru)
> */
> drop = maybe_add_lru(inode, skip_lru);
>
> - if (atomic_dec_and_test(&inode->i_count)) {
> + if (refcount_dec_and_test(&inode->i_count)) {
> /* iput_final() drops i_lock */
> iput_final(inode, drop);
> } else {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 999ffea2aac1..fc23e37ca250 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -875,7 +875,7 @@ struct inode {
> };
> atomic64_t i_version;
> atomic64_t i_sequence; /* see futex */
> - atomic_t i_count;
> + refcount_t i_count;
> atomic_t i_dio_count;
> atomic_t i_writecount;
> #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
> @@ -3399,12 +3399,12 @@ static inline unsigned int iobj_count_read(const struct inode *inode)
> static inline void __iget(struct inode *inode)
> {
> iobj_get(inode);
> - atomic_inc(&inode->i_count);
> + refcount_inc(&inode->i_count);
> }
>
> static inline int icount_read(const struct inode *inode)
> {
> - return atomic_read(&inode->i_count);
> + return refcount_read(&inode->i_count);
> }
>
> extern void iget_failed(struct inode *);
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 26/54] fs: use igrab in insert_inode_locked
2025-08-26 15:39 ` [PATCH v2 26/54] fs: use igrab in insert_inode_locked Josef Bacik
@ 2025-08-28 12:15 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 12:15 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:26AM -0400, Josef Bacik wrote:
> Follow the same pattern in find_inode*. Instead of checking for
> I_WILL_FREE|I_FREEING simply call igrab() and if it succeeds we're done.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 8 +++-----
> 1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 8ae9ed9605ef..d34da95a3295 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1883,11 +1883,8 @@ int insert_inode_locked(struct inode *inode)
> continue;
> if (old->i_sb != sb)
> continue;
> - spin_lock(&old->i_lock);
> - if (old->i_state & (I_FREEING|I_WILL_FREE)) {
> - spin_unlock(&old->i_lock);
> + if (!igrab(old))
> continue;
> - }
> break;
> }
> if (likely(!old)) {
> @@ -1899,12 +1896,13 @@ int insert_inode_locked(struct inode *inode)
> spin_unlock(&inode_hash_lock);
> return 0;
> }
> + spin_lock(&old->i_lock);
> if (unlikely(old->i_state & I_CREATING)) {
> spin_unlock(&old->i_lock);
> spin_unlock(&inode_hash_lock);
> + iput(old);
> return -EBUSY;
> }
> - __iget(old);
> spin_unlock(&old->i_lock);
> spin_unlock(&inode_hash_lock);
> wait_on_inode(old);
> --
> 2.49.0
>
So looking at the function in full context:
int insert_inode_locked(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
ino_t ino = inode->i_ino;
struct hlist_head *head = inode_hashtable + hash(sb, ino);
while (1) {
struct inode *old = NULL;
spin_lock(&inode_hash_lock);
hlist_for_each_entry(old, head, i_hash) {
if (old->i_ino != ino)
continue;
if (old->i_sb != sb)
continue;
if (!igrab(old))
continue;
break;
}
if (likely(!old)) {
spin_lock(&inode->i_lock);
iobj_get(inode);
Sorry, this is probably me being confused.
Say we allocated a new inode then we've definitely went through
inode_init_always() and so i_obj_count == i_count == 1.
Then we insert it into the hash table. For that we only take an
i_obj_count but no i_count bringing it to 2.
So for the hashlist we only deal with i_obj_count.
Is that documented somewhere? I probably just read over it.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 30/54] fs: change evict_dentries_for_decrypted_inodes to use refcount
2025-08-26 15:39 ` [PATCH v2 30/54] fs: change evict_dentries_for_decrypted_inodes to use refcount Josef Bacik
@ 2025-08-28 12:25 ` Christian Brauner
2025-08-28 22:26 ` Eric Biggers
0 siblings, 1 reply; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 12:25 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:30AM -0400, Josef Bacik wrote:
> Instead of checking for I_WILL_FREE|I_FREEING simply use the refcount to
> make sure we have a live inode.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
I have no idea how the lifetime of such decrypted inodes are managed.
I suppose they don't carry a separate reference but are still somehow
safe to be accessed based on the mk_decrypted_inodes list. In any case
something must hold an i_obj_count if we want to use igrab() since I
don't see any relevant rcu protection here.
> fs/crypto/keyring.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/crypto/keyring.c b/fs/crypto/keyring.c
> index 7557f6a88b8f..969db498149a 100644
> --- a/fs/crypto/keyring.c
> +++ b/fs/crypto/keyring.c
> @@ -956,13 +956,16 @@ static void evict_dentries_for_decrypted_inodes(struct fscrypt_master_key *mk)
>
> list_for_each_entry(ci, &mk->mk_decrypted_inodes, ci_master_key_link) {
> inode = ci->ci_inode;
> +
> spin_lock(&inode->i_lock);
> - if (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) {
> + if (inode->i_state & I_NEW) {
> spin_unlock(&inode->i_lock);
> continue;
> }
> - __iget(inode);
> spin_unlock(&inode->i_lock);
> +
> + if (!igrab(inode))
> + continue;
> spin_unlock(&mk->mk_decrypted_inodes_lock);
>
> shrink_dcache_inode(inode);
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 39/54] fs: remove I_WILL_FREE|I_FREEING check from dquot.c
2025-08-26 15:39 ` [PATCH v2 39/54] fs: remove I_WILL_FREE|I_FREEING check from dquot.c Josef Bacik
@ 2025-08-28 12:35 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 12:35 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:39AM -0400, Josef Bacik wrote:
> We can use the reference count to see if the inode is live.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/quota/dquot.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index df4a9b348769..90e69653c261 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -1030,14 +1030,16 @@ static int add_dquot_ref(struct super_block *sb, int type)
> spin_lock(&sb->s_inode_list_lock);
> list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> spin_lock(&inode->i_lock);
> - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> + if ((inode->i_state & I_NEW) ||
> !atomic_read(&inode->i_writecount) ||
> !dqinit_needed(inode, type)) {
> spin_unlock(&inode->i_lock);
> continue;
> }
> - __iget(inode);
> spin_unlock(&inode->i_lock);
> +
> + if (!igrab(inode))
> + continue;
Using this to drop a comment that I mentioned to you. I think we should
have an iterator for this because we have the exact same pattern in so
many places it's annoying.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 43/54] fs: change inode_is_dirtytime_only to use refcount
2025-08-26 22:06 ` Mateusz Guzik
@ 2025-08-28 12:38 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 12:38 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Josef Bacik, linux-fsdevel, linux-btrfs, kernel-team, linux-ext4,
linux-xfs, viro, amir73il
On Wed, Aug 27, 2025 at 12:06:12AM +0200, Mateusz Guzik wrote:
> On Tue, Aug 26, 2025 at 11:39:43AM -0400, Josef Bacik wrote:
> > We don't need the I_WILL_FREE|I_FREEING check, we can use the refcount
> > to see if the inode is valid.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > include/linux/fs.h | 14 +++++++-------
> > 1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index b13d057ad0d7..531a6d0afa75 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2628,6 +2628,11 @@ static inline void mark_inode_dirty_sync(struct inode *inode)
> > __mark_inode_dirty(inode, I_DIRTY_SYNC);
> > }
> >
> > +static inline int icount_read(const struct inode *inode)
> > +{
> > + return refcount_read(&inode->i_count);
> > +}
> > +
> > /*
> > * Returns true if the given inode itself only has dirty timestamps (its pages
> > * may still be dirty) and isn't currently being allocated or freed.
> > @@ -2639,8 +2644,8 @@ static inline void mark_inode_dirty_sync(struct inode *inode)
> > */
> > static inline bool inode_is_dirtytime_only(struct inode *inode)
> > {
> > - return (inode->i_state & (I_DIRTY_TIME | I_NEW |
> > - I_FREEING | I_WILL_FREE)) == I_DIRTY_TIME;
> > + return (inode->i_state & (I_DIRTY_TIME | I_NEW)) == I_DIRTY_TIME &&
> > + icount_read(inode);
> > }
> >
> > extern void inc_nlink(struct inode *inode);
> > @@ -3432,11 +3437,6 @@ static inline void __iget(struct inode *inode)
> > refcount_inc(&inode->i_count);
> > }
> >
> > -static inline int icount_read(const struct inode *inode)
> > -{
> > - return refcount_read(&inode->i_count);
> > -}
> > -
> > extern void iget_failed(struct inode *);
> > extern void clear_inode(struct inode *);
> > extern void __destroy_inode(struct inode *);
> > --
> > 2.49.0
> >
>
> nit: I would change the diff introducing icount_read() to already place
> it in the right spot. As is this is going to mess with blame for no good
> reason.
Fwiw, I did that in the preliminaries patch. Just looked at your comment
here.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 48/54] fs: remove some spurious I_FREEING references in inode.c
2025-08-26 15:39 ` [PATCH v2 48/54] fs: remove some spurious I_FREEING references in inode.c Josef Bacik
@ 2025-08-28 12:40 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 12:40 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:48AM -0400, Josef Bacik wrote:
> Now that we have the i_count reference count rules set so that we only
> go into these evict paths with a 0 count, update the sanity checks to
> check that instead of I_FREEING.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index eb74f7b5e967..da38c9fbb9a7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -858,7 +858,7 @@ void clear_inode(struct inode *inode)
> */
> xa_unlock_irq(&inode->i_data.i_pages);
> BUG_ON(!list_empty(&inode->i_data.i_private_list));
> - BUG_ON(!(inode->i_state & I_FREEING));
> + BUG_ON(icount_read(inode) != 0);
> BUG_ON(inode->i_state & I_CLEAR);
> BUG_ON(!list_empty(&inode->i_wb_list));
These should probably all be WARN_ON()s.
> /* don't need i_lock here, no concurrent mods to i_state */
> @@ -871,19 +871,19 @@ EXPORT_SYMBOL(clear_inode);
> * to. We remove any pages still attached to the inode and wait for any IO that
> * is still in progress before finally destroying the inode.
> *
> - * An inode must already be marked I_FREEING so that we avoid the inode being
> + * An inode must already have an i_count of 0 so that we avoid the inode being
> * moved back onto lists if we race with other code that manipulates the lists
> * (e.g. writeback_single_inode). The caller is responsible for setting this.
> *
> * An inode must already be removed from the LRU list before being evicted from
> - * the cache. This should occur atomically with setting the I_FREEING state
> - * flag, so no inodes here should ever be on the LRU when being evicted.
> + * the cache. This should always be the case as the LRU list holds an i_count
> + * reference on the inode, and we only evict inodes with an i_count of 0.
> */
> static void evict(struct inode *inode)
> {
> const struct super_operations *op = inode->i_sb->s_op;
>
> - BUG_ON(!(inode->i_state & I_FREEING));
> + BUG_ON(icount_read(inode) != 0);
> BUG_ON(!list_empty(&inode->i_lru));
>
> if (!list_empty(&inode->i_io_list))
> @@ -897,8 +897,8 @@ static void evict(struct inode *inode)
> /*
> * Wait for flusher thread to be done with the inode so that filesystem
> * does not start destroying it while writeback is still running. Since
> - * the inode has I_FREEING set, flusher thread won't start new work on
> - * the inode. We just have to wait for running writeback to finish.
> + * the inode has a 0 i_count, flusher thread won't start new work on the
> + * inode. We just have to wait for running writeback to finish.
> */
> inode_wait_for_writeback(inode);
> spin_unlock(&inode->i_lock);
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 51/54] fs: remove I_FREEING|I_WILL_FREE
2025-08-26 15:39 ` [PATCH v2 51/54] fs: remove I_FREEING|I_WILL_FREE Josef Bacik
@ 2025-08-28 12:42 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 12:42 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:51AM -0400, Josef Bacik wrote:
> Now that we're using the i_count reference count as the ultimate arbiter
> of whether or not an inode is life we can remove the I_FREEING and
> I_WILL_FREE flags.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
Very good looking diff.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 52/54] fs: remove I_REFERENCED
2025-08-26 15:39 ` [PATCH v2 52/54] fs: remove I_REFERENCED Josef Bacik
@ 2025-08-28 12:47 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 12:47 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:52AM -0400, Josef Bacik wrote:
> Because we have referenced inodes on the LRU we've had to change the
> behavior to make sure we remove the inode from the LRU when we reference
> it.
All for this patch but I'm confused by this sentence. Maybe:
"Because inodes must hold a full refcount while on the LRU we've had to
change the behavior to make sure we remove the inode from the LRU when
someone takes another full refcount to it."?
Basically, I_REFERENCED was a way to sidestep the problem that if there
was another access to the inode and it was already on the LRU we had to
do something to acknowledge that.
This is now completely put on it's head by just taking them always off
in such cases and afaict we only do that when we take another full
reference (or I guess when it's actually evicted).
>
> We do this to account for the fact that we may end up with an inode on
> the LRU list, and then unlink the inode. We want the last iput() in the
> unlink() to actually evict the inode ideally, so we don't want it to
> stick around on the LRU and be evicted that way.
>
> With that behavior change we no longer need I_REFERENCED, as we're
> always removing the inode from the LRU list on a subsequent access if
> it's on the LRU.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 36 +++++++-------------------------
> include/linux/fs.h | 22 +++++++++----------
> include/trace/events/writeback.h | 1 -
> 3 files changed, 17 insertions(+), 42 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 8f61761ca021..4f77db7aca75 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -591,7 +591,12 @@ static bool inode_del_cached_lru(struct inode *inode)
> return false;
> }
>
> -static void __inode_add_lru(struct inode *inode, bool rotate)
> +/*
> + * Add inode to LRU if needed (inode is unused and clean).
> + *
> + * Needs inode->i_lock held.
> + */
> +void inode_add_lru(struct inode *inode)
> {
> bool need_ref = true;
>
> @@ -614,8 +619,6 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> if (need_ref)
> __iget(inode);
> this_cpu_inc(nr_unused);
> - } else if (rotate) {
> - inode->i_state |= I_REFERENCED;
> }
> }
>
> @@ -630,16 +633,6 @@ struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> }
> EXPORT_SYMBOL(inode_bit_waitqueue);
>
> -/*
> - * Add inode to LRU if needed (inode is unused and clean).
> - *
> - * Needs inode->i_lock held.
> - */
> -void inode_add_lru(struct inode *inode)
> -{
> - __inode_add_lru(inode, false);
> -}
> -
> /*
> * Caller must be holding it's own i_count reference on this inode in order to
> * prevent this being the final iput.
> @@ -1001,14 +994,6 @@ EXPORT_SYMBOL_GPL(evict_inodes);
>
> /*
> * Isolate the inode from the LRU in preparation for freeing it.
> - *
> - * If the inode has the I_REFERENCED flag set, then it means that it has been
> - * used recently - the flag is set in iput_final(). When we encounter such an
> - * inode, clear the flag and move it to the back of the LRU so it gets another
> - * pass through the LRU before it gets reclaimed. This is necessary because of
> - * the fact we are doing lazy LRU updates to minimise lock contention so the
> - * LRU does not have strict ordering. Hence we don't want to reclaim inodes
> - * with this flag set because they are the inodes that are out of order.
> */
> static enum lru_status inode_lru_isolate(struct list_head *item,
> struct list_lru_one *lru, void *arg)
> @@ -1039,13 +1024,6 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> return LRU_REMOVED;
> }
>
> - /* Recently referenced inodes get one more pass */
> - if (inode->i_state & I_REFERENCED) {
> - inode->i_state &= ~I_REFERENCED;
> - spin_unlock(&inode->i_lock);
> - return LRU_ROTATE;
> - }
> -
> /*
> * On highmem systems, mapping_shrinkable() permits dropping
> * page cache in order to free up struct inodes: lowmem might
> @@ -1995,7 +1973,7 @@ static bool maybe_add_lru(struct inode *inode, bool skip_lru)
> if (!(sb->s_flags & SB_ACTIVE))
> return drop;
>
> - __inode_add_lru(inode, true);
> + inode_add_lru(inode);
> return drop;
> }
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2a7e7fc96431..39cde53c1b3b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -715,7 +715,6 @@ is_uncached_acl(struct posix_acl *acl)
> * address once it is done. The bit is also used to pin
> * the inode in memory for flusher thread.
> *
> - * I_REFERENCED Marks the inode as recently references on the LRU list.
> *
> * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
> * synchronize competing switching instances and to tell
> @@ -764,17 +763,16 @@ enum inode_state_flags_t {
> I_DIRTY_DATASYNC = (1U << 4),
> I_DIRTY_PAGES = (1U << 5),
> I_CLEAR = (1U << 6),
> - I_REFERENCED = (1U << 7),
> - I_LINKABLE = (1U << 8),
> - I_DIRTY_TIME = (1U << 9),
> - I_WB_SWITCH = (1U << 10),
> - I_OVL_INUSE = (1U << 11),
> - I_CREATING = (1U << 12),
> - I_DONTCACHE = (1U << 13),
> - I_SYNC_QUEUED = (1U << 14),
> - I_PINNING_NETFS_WB = (1U << 15),
> - I_LRU = (1U << 16),
> - I_CACHED_LRU = (1U << 17)
> + I_LINKABLE = (1U << 7),
> + I_DIRTY_TIME = (1U << 8),
> + I_WB_SWITCH = (1U << 9),
> + I_OVL_INUSE = (1U << 10),
> + I_CREATING = (1U << 11),
> + I_DONTCACHE = (1U << 12),
> + I_SYNC_QUEUED = (1U << 13),
> + I_PINNING_NETFS_WB = (1U << 14),
> + I_LRU = (1U << 15),
> + I_CACHED_LRU = (1U << 16)
> };
>
> #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 58ee61f3d91d..b419b8060dda 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -18,7 +18,6 @@
> {I_CLEAR, "I_CLEAR"}, \
> {I_SYNC, "I_SYNC"}, \
> {I_DIRTY_TIME, "I_DIRTY_TIME"}, \
> - {I_REFERENCED, "I_REFERENCED"}, \
> {I_LINKABLE, "I_LINKABLE"}, \
> {I_WB_SWITCH, "I_WB_SWITCH"}, \
> {I_OVL_INUSE, "I_OVL_INUSE"}, \
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 00/54] fs: rework inode reference counting
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (55 preceding siblings ...)
2025-08-27 11:14 ` (subset) [PATCH v2 00/54] " Christian Brauner
@ 2025-08-28 12:51 ` Christian Brauner
2025-08-28 21:22 ` Josef Bacik
2025-09-02 10:06 ` Mateusz Guzik
57 siblings, 1 reply; 105+ messages in thread
From: Christian Brauner @ 2025-08-28 12:51 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Tue, Aug 26, 2025 at 11:39:00AM -0400, Josef Bacik wrote:
> v1: https://lore.kernel.org/linux-fsdevel/cover.1755806649.git.josef@toxicpanda.com/
>
> v1->v2:
I've been through the series apart from the Documentation so far (I'll
read that later to see how it matches my own understanding.). To me this
all looks pretty great. The death of all these flags is amazing and if
we can experiment with the icache removal next that would be very nice.
So I wait for some more comments and maybe a final resend but I'm quite
happy. I'm sure I've missed subtleties but testing will hopefully also
shake out a few additional bugs.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 00/54] fs: rework inode reference counting
2025-08-28 12:51 ` Christian Brauner
@ 2025-08-28 21:22 ` Josef Bacik
0 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-28 21:22 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro, amir73il
On Thu, Aug 28, 2025 at 02:51:23PM +0200, Christian Brauner wrote:
> On Tue, Aug 26, 2025 at 11:39:00AM -0400, Josef Bacik wrote:
> > v1: https://lore.kernel.org/linux-fsdevel/cover.1755806649.git.josef@toxicpanda.com/
> >
> > v1->v2:
>
> I've been through the series apart from the Documentation so far (I'll
> read that later to see how it matches my own understanding.). To me this
> all looks pretty great. The death of all these flags is amazing and if
> we can experiment with the icache removal next that would be very nice.
>
> So I wait for some more comments and maybe a final resend but I'm quite
> happy. I'm sure I've missed subtleties but testing will hopefully also
> shake out a few additional bugs.
Perfect, I've been fixing things as I've gone along. I'm going to wait to see if
Dave has any other thoughts while I'm asleep, and then I'll resend tomorrow.
Thanks,
Josef
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 23/54] fs: use refcount_inc_not_zero in igrab
2025-08-26 15:39 ` [PATCH v2 23/54] fs: use refcount_inc_not_zero in igrab Josef Bacik
@ 2025-08-28 22:08 ` Eric Biggers
2025-08-29 13:42 ` Josef Bacik
0 siblings, 1 reply; 105+ messages in thread
From: Eric Biggers @ 2025-08-28 22:08 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Tue, Aug 26, 2025 at 11:39:23AM -0400, Josef Bacik wrote:
> +static inline struct inode *inode_tryget(struct inode *inode)
> +{
> + /*
> + * We are using inode_tryget() because we're interested in getting a
> + * live reference to the inode, which is ->i_count. Normally we would
> + * grab i_obj_count first, as it is the higher priority reference.
> + * However we're only interested in making sure we have a live inode,
> + * and we know that if we get a reference for i_count then we can safely
> + * acquire i_obj_count because we always drop i_obj_count after dropping
> + * an i_count reference.
> + *
> + * This is meant to be used either in a place where we have an existing
> + * i_obj_count reference on the inode, or under rcu_read_lock() so we
> + * know we're safe in accessing this inode still.
> + */
> + VFS_WARN_ON_ONCE(!iobj_count_read(inode) && !rcu_read_lock_held());
> +
> + if (refcount_inc_not_zero(&inode->i_count)) {
> + iobj_get(inode);
> + return inode;
> + }
> +
> + /*
> + * If we failed to increment the reference count, then the
> + * inode is being freed or has been freed. We return NULL
> + * in this case.
> + */
> + return NULL;
Is there a reason to take one i_obj_count reference per i_count
reference, instead of a single i_obj_count reference associated with
i_count being nonzero? With a single reference owned by i_count != 0,
it wouldn't be necessary to touch i_obj_count when i_count is changed,
except when i_count reaches zero. That would be more efficient.
BTW, fscrypt_master_key::mk_active_refs and
fscrypt_master_key::mk_struct_refs use that solution. For
mk_active_refs != 0, one reference in mk_struct_refs is held.
- Eric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 30/54] fs: change evict_dentries_for_decrypted_inodes to use refcount
2025-08-28 12:25 ` Christian Brauner
@ 2025-08-28 22:26 ` Eric Biggers
2025-08-29 7:38 ` Christian Brauner
0 siblings, 1 reply; 105+ messages in thread
From: Eric Biggers @ 2025-08-28 22:26 UTC (permalink / raw)
To: Christian Brauner
Cc: Josef Bacik, linux-fsdevel, linux-btrfs, kernel-team, linux-ext4,
linux-xfs, viro, amir73il
On Thu, Aug 28, 2025 at 02:25:21PM +0200, Christian Brauner wrote:
> On Tue, Aug 26, 2025 at 11:39:30AM -0400, Josef Bacik wrote:
> > Instead of checking for I_WILL_FREE|I_FREEING simply use the refcount to
> > make sure we have a live inode.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> I have no idea how the lifetime of such decrypted inodes are managed.
> I suppose they don't carry a separate reference but are still somehow
> safe to be accessed based on the mk_decrypted_inodes list. In any case
> something must hold an i_obj_count if we want to use igrab() since I
> don't see any relevant rcu protection here.
>
inodes are placed on mk_decrypted_inodes by the filesystem while it's
holding an i_count reference, and they're removed from the list by
->evict_inode shortly after i_count reaches zero. So the corresponding
i_obj_count reference is just the one associated with i_count.
This patch looks correct: we do igrab() while holding
mk_decrypted_inodes_lock. So either i_count is nonzero and we get a
temporary i_count reference, or it's zero and we skip the inode since it
cannot have dentries (and ->evict_inode is coming very soon as well).
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
- Eric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 30/54] fs: change evict_dentries_for_decrypted_inodes to use refcount
2025-08-28 22:26 ` Eric Biggers
@ 2025-08-29 7:38 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-08-29 7:38 UTC (permalink / raw)
To: Eric Biggers
Cc: Josef Bacik, linux-fsdevel, linux-btrfs, kernel-team, linux-ext4,
linux-xfs, viro, amir73il
On Thu, Aug 28, 2025 at 10:26:10PM +0000, Eric Biggers wrote:
> On Thu, Aug 28, 2025 at 02:25:21PM +0200, Christian Brauner wrote:
> > On Tue, Aug 26, 2025 at 11:39:30AM -0400, Josef Bacik wrote:
> > > Instead of checking for I_WILL_FREE|I_FREEING simply use the refcount to
> > > make sure we have a live inode.
> > >
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> > I have no idea how the lifetime of such decrypted inodes are managed.
> > I suppose they don't carry a separate reference but are still somehow
> > safe to be accessed based on the mk_decrypted_inodes list. In any case
> > something must hold an i_obj_count if we want to use igrab() since I
> > don't see any relevant rcu protection here.
> >
>
> inodes are placed on mk_decrypted_inodes by the filesystem while it's
> holding an i_count reference, and they're removed from the list by
> ->evict_inode shortly after i_count reaches zero. So the corresponding
> i_obj_count reference is just the one associated with i_count.
>
> This patch looks correct: we do igrab() while holding
> mk_decrypted_inodes_lock. So either i_count is nonzero and we get a
> temporary i_count reference, or it's zero and we skip the inode since it
> cannot have dentries (and ->evict_inode is coming very soon as well).
Thanks for the explanation, Eric! That was exactly what I was looking for.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 23/54] fs: use refcount_inc_not_zero in igrab
2025-08-28 22:08 ` Eric Biggers
@ 2025-08-29 13:42 ` Josef Bacik
0 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-08-29 13:42 UTC (permalink / raw)
To: Eric Biggers
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Thu, Aug 28, 2025 at 10:08:06PM +0000, Eric Biggers wrote:
> On Tue, Aug 26, 2025 at 11:39:23AM -0400, Josef Bacik wrote:
> > +static inline struct inode *inode_tryget(struct inode *inode)
> > +{
> > + /*
> > + * We are using inode_tryget() because we're interested in getting a
> > + * live reference to the inode, which is ->i_count. Normally we would
> > + * grab i_obj_count first, as it is the higher priority reference.
> > + * However we're only interested in making sure we have a live inode,
> > + * and we know that if we get a reference for i_count then we can safely
> > + * acquire i_obj_count because we always drop i_obj_count after dropping
> > + * an i_count reference.
> > + *
> > + * This is meant to be used either in a place where we have an existing
> > + * i_obj_count reference on the inode, or under rcu_read_lock() so we
> > + * know we're safe in accessing this inode still.
> > + */
> > + VFS_WARN_ON_ONCE(!iobj_count_read(inode) && !rcu_read_lock_held());
> > +
> > + if (refcount_inc_not_zero(&inode->i_count)) {
> > + iobj_get(inode);
> > + return inode;
> > + }
> > +
> > + /*
> > + * If we failed to increment the reference count, then the
> > + * inode is being freed or has been freed. We return NULL
> > + * in this case.
> > + */
> > + return NULL;
>
> Is there a reason to take one i_obj_count reference per i_count
> reference, instead of a single i_obj_count reference associated with
> i_count being nonzero? With a single reference owned by i_count != 0,
> it wouldn't be necessary to touch i_obj_count when i_count is changed,
> except when i_count reaches zero. That would be more efficient.
>
> BTW, fscrypt_master_key::mk_active_refs and
> fscrypt_master_key::mk_struct_refs use that solution. For
> mk_active_refs != 0, one reference in mk_struct_refs is held.
>
That certainly could be done as well, hell I do that pattern for the writeback
lists and such. I'll discuss with Christian and see what he thinks. Thanks,
Josef
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH] fs: revamp iput()
2025-08-27 16:24 ` [PATCH] fs: revamp iput() Mateusz Guzik
@ 2025-08-30 15:54 ` Mateusz Guzik
2025-09-01 8:50 ` Jan Kara
2025-09-01 10:41 ` Christian Brauner
0 siblings, 2 replies; 105+ messages in thread
From: Mateusz Guzik @ 2025-08-30 15:54 UTC (permalink / raw)
To: brauner
Cc: viro, jack, linux-kernel, linux-fsdevel, josef, kernel-team,
amir73il, linux-btrfs, linux-ext4, linux-xfs
I'm writing a long response to this series, in the meantime I noticed
this bit landed in
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6.18.inode.refcount.preliminaries&id=3cba19f6a00675fbc2af0987dfc90e216e6cfb74
but with some whitespace issues in comments -- they are indented with
spaces instead of tabs after the opening line.
I verified the mail I sent does not have it, so I'm guessing this was
copy-pasted?
Tabing them by hand does the trick, below is my copy-paste as proof,
please indent by hand in your editor ;)
diff --git a/fs/inode.c b/fs/inode.c
index 2db680a37235..fe4868e2a954 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1915,10 +1915,10 @@ void iput(struct inode *inode)
lockdep_assert_not_held(&inode->i_lock);
VFS_BUG_ON_INODE(inode->i_state & I_CLEAR, inode);
/*
- * Note this assert is technically racy as if the count is bogusly
- * equal to one, then two CPUs racing to further drop it can both
- * conclude it's fine.
- */
+ * Note this assert is technically racy as if the count is bogusly
+ * equal to one, then two CPUs racing to further drop it can both
+ * conclude it's fine.
+ */
VFS_BUG_ON_INODE(atomic_read(&inode->i_count) < 1, inode);
if (atomic_add_unless(&inode->i_count, -1, 1))
@@ -1942,9 +1942,9 @@ void iput(struct inode *inode)
}
/*
- * iput_final() drops ->i_lock, we can't assert on it as the inode may
- * be deallocated by the time the call returns.
- */
+ * iput_final() drops ->i_lock, we can't assert on it as the inode may
+ * be deallocated by the time the call returns.
+ */
iput_final(inode);
}
EXPORT_SYMBOL(iput);
While here, vim told me about spaces instead of tabs in 2 more spots
in the file. Again to show the lines:
diff --git a/fs/inode.c b/fs/inode.c
index 2db680a37235..833de5457a06 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -550,11 +550,11 @@ static void __inode_add_lru(struct inode *inode,
bool rotate)
struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
struct inode *inode, u32 bit)
{
- void *bit_address;
+ void *bit_address;
- bit_address = inode_state_wait_address(inode, bit);
- init_wait_var_entry(wqe, bit_address, 0);
- return __var_waitqueue(bit_address);
+ bit_address = inode_state_wait_address(inode, bit);
+ init_wait_var_entry(wqe, bit_address, 0);
+ return __var_waitqueue(bit_address);
}
EXPORT_SYMBOL(inode_bit_waitqueue);
@@ -2938,7 +2938,7 @@ EXPORT_SYMBOL(mode_strip_sgid);
*/
void dump_inode(struct inode *inode, const char *reason)
{
- pr_warn("%s encountered for inode %px", reason, inode);
+ pr_warn("%s encountered for inode %px", reason, inode);
}
EXPORT_SYMBOL(dump_inode);
Christian, I think it would be the most expedient if you just made
changes on your own with whatever commit message you see fit. No need
to mention I brought this up. If you insist I can send a patch.
On Wed, Aug 27, 2025 at 6:24 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> The material change is I_DIRTY_TIME handling without a spurious ref
> acquire/release cycle.
>
> While here a bunch of smaller changes:
> 1. predict there is an inode -- bpftrace suggests one is passed vast
> majority of the time
> 2. convert BUG_ON into VFS_BUG_ON_INODE
> 3. assert on ->i_count
> 4. assert ->i_lock is not held
> 5. flip the order of I_DIRTY_TIME and nlink count checks as the former
> is less likely to be true
>
> I verified atomic_read(&inode->i_count) does not show up in asm if
> debug is disabled.
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>
> The routine kept annoying me, so here is a further revised variant.
>
> I verified this compiles, but I still cannot runtime test. I'm sorry for
> that. My signed-off is conditional on a good samaritan making sure it
> works :)
>
> diff compared to the thing I sent "informally":
> - if (unlikely(!inode))
> - asserts
> - slightly reworded iput_final commentary
> - unlikely() on the second I_DIRTY_TIME check
>
> Given the revamp I think it makes sense to attribute the change to me,
> hence a "proper" mail.
>
> The thing surviving from the submission by Josef is:
> + if (atomic_add_unless(&inode->i_count, -1, 1))
> + return;
>
> And of course he is the one who brought up the spurious refcount trip in
> the first place.
>
> I'm happy with Reported-by, Co-developed-by or whatever other credit
> as you guys see fit.
>
> That aside I think it would be nice if NULL inodes passed to iput
> became illegal, but that's a different story for another day.
>
> fs/inode.c | 46 +++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 35 insertions(+), 11 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 01ebdc40021e..01a554e11279 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1908,20 +1908,44 @@ static void iput_final(struct inode *inode)
> */
> void iput(struct inode *inode)
> {
> - if (!inode)
> + if (unlikely(!inode))
> return;
> - BUG_ON(inode->i_state & I_CLEAR);
> +
> retry:
> - if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> - if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> - atomic_inc(&inode->i_count);
> - spin_unlock(&inode->i_lock);
> - trace_writeback_lazytime_iput(inode);
> - mark_inode_dirty_sync(inode);
> - goto retry;
> - }
> - iput_final(inode);
> + lockdep_assert_not_held(&inode->i_lock);
> + VFS_BUG_ON_INODE(inode->i_state & I_CLEAR, inode);
> + /*
> + * Note this assert is technically racy as if the count is bogusly
> + * equal to one, then two CPUs racing to further drop it can both
> + * conclude it's fine.
> + */
> + VFS_BUG_ON_INODE(atomic_read(&inode->i_count) < 1, inode);
> +
> + if (atomic_add_unless(&inode->i_count, -1, 1))
> + return;
> +
> + if ((inode->i_state & I_DIRTY_TIME) && inode->i_nlink) {
> + trace_writeback_lazytime_iput(inode);
> + mark_inode_dirty_sync(inode);
> + goto retry;
> }
> +
> + spin_lock(&inode->i_lock);
> + if (unlikely((inode->i_state & I_DIRTY_TIME) && inode->i_nlink)) {
> + spin_unlock(&inode->i_lock);
> + goto retry;
> + }
> +
> + if (!atomic_dec_and_test(&inode->i_count)) {
> + spin_unlock(&inode->i_lock);
> + return;
> + }
> +
> + /*
> + * iput_final() drops ->i_lock, we can't assert on it as the inode may
> + * be deallocated by the time the call returns.
> + */
> + iput_final(inode);
> }
> EXPORT_SYMBOL(iput);
>
> --
> 2.43.0
>
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply related [flat|nested] 105+ messages in thread
* Re: [PATCH] fs: revamp iput()
2025-08-30 15:54 ` Mateusz Guzik
@ 2025-09-01 8:50 ` Jan Kara
2025-09-01 10:39 ` Christian Brauner
2025-09-01 10:41 ` Christian Brauner
1 sibling, 1 reply; 105+ messages in thread
From: Jan Kara @ 2025-09-01 8:50 UTC (permalink / raw)
To: Mateusz Guzik
Cc: brauner, viro, jack, linux-kernel, linux-fsdevel, josef,
kernel-team, amir73il, linux-btrfs, linux-ext4, linux-xfs
On Sat 30-08-25 17:54:35, Mateusz Guzik wrote:
> I'm writing a long response to this series, in the meantime I noticed
> this bit landed in
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6.18.inode.refcount.preliminaries&id=3cba19f6a00675fbc2af0987dfc90e216e6cfb74
> but with some whitespace issues in comments -- they are indented with
> spaces instead of tabs after the opening line.
Interesting. I didn't see an email about inclusion. Anyway, the change
looks good to me so Christian, feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH] fs: revamp iput()
2025-09-01 8:50 ` Jan Kara
@ 2025-09-01 10:39 ` Christian Brauner
0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-09-01 10:39 UTC (permalink / raw)
To: Jan Kara
Cc: Mateusz Guzik, viro, linux-kernel, linux-fsdevel, josef,
kernel-team, amir73il, linux-btrfs, linux-ext4, linux-xfs
On Mon, Sep 01, 2025 at 10:50:59AM +0200, Jan Kara wrote:
> On Sat 30-08-25 17:54:35, Mateusz Guzik wrote:
> > I'm writing a long response to this series, in the meantime I noticed
> > this bit landed in
> > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6.18.inode.refcount.preliminaries&id=3cba19f6a00675fbc2af0987dfc90e216e6cfb74
> > but with some whitespace issues in comments -- they are indented with
> > spaces instead of tabs after the opening line.
>
> Interesting. I didn't see an email about inclusion. Anyway, the change
Sorry, that waas my bad. I talked with Josef off-list and told him that
I would apply Mateusz suggestions with his CdB and SoB added. I forgot
to repeat that on the list.
> looks good to me so Christian, feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
Thanks!
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH] fs: revamp iput()
2025-08-30 15:54 ` Mateusz Guzik
2025-09-01 8:50 ` Jan Kara
@ 2025-09-01 10:41 ` Christian Brauner
1 sibling, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2025-09-01 10:41 UTC (permalink / raw)
To: Mateusz Guzik
Cc: viro, jack, linux-kernel, linux-fsdevel, josef, kernel-team,
amir73il, linux-btrfs, linux-ext4, linux-xfs
> Christian, I think it would be the most expedient if you just made
Ok, thanks.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir
2025-08-28 11:46 ` Josef Bacik
@ 2025-09-02 1:48 ` Dave Chinner
0 siblings, 0 replies; 105+ messages in thread
From: Dave Chinner @ 2025-09-02 1:48 UTC (permalink / raw)
To: Josef Bacik
Cc: Christian Brauner, linux-fsdevel, linux-btrfs, kernel-team,
linux-ext4, linux-xfs, viro, amir73il
On Thu, Aug 28, 2025 at 07:46:13AM -0400, Josef Bacik wrote:
> On Thu, Aug 28, 2025 at 08:01:39AM +1000, Dave Chinner wrote:
> > On Wed, Aug 27, 2025 at 02:32:49PM +0200, Christian Brauner wrote:
> > > On Tue, Aug 26, 2025 at 11:39:17AM -0400, Josef Bacik wrote:
> > > > We can end up with an inode on the LRU list or the cached list, then at
> > > > some point in the future go to unlink that inode and then still have an
> > > > elevated i_count reference for that inode because it is on one of these
> > > > lists.
> > > >
> > > > The more common case is the cached list. We open a file, write to it,
> > > > truncate some of it which triggers the inode_add_lru code in the
> > > > pagecache, adding it to the cached LRU. Then we unlink this inode, and
> > > > it exists until writeback or reclaim kicks in and removes the inode.
> > > >
> > > > To handle this case, delete the inode from the LRU list when it is
> > > > unlinked, so we have the best case scenario for immediately freeing the
> > > > inode.
> > > >
> > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > > ---
> > >
> > > I'm not too fond of this particular change I think it's really misplaced
> > > and the correct place is indeed drop_nlink() and clear_nlink().
> >
> > I don't really like putting it in drop_nlink because that then puts
> > the inode LRU in the middle of filesystem transactions when lots of
> > different filesystem locks are held.
> >
> > IF the LRU operations are in the VFS, then we know exactly what
> > locks are held when it is performed (current behaviour). However,
> > when done from the filesystem transaction context running
> > drop_nlink, we'll have different sets of locks and/or execution
> > contexts held for each different fs type.
> >
> > > I'm pretty sure that the number of callers that hold i_lock around
> > > drop_nlink() and clear_nlink() is relatively small.
> >
> > I think the calling context problem is wider than the obvious issue
> > with i_lock....
>
> This is an internal LRU, so yes potentially we could have locking issues, but
> right now all LRU operations are nested inside of the i_lock, and this is purely
> about object lifetime. I'm not concerned about this being in the bowls of any
> filesystem because it's purely list manipulation.
Yet it now puts the LRU inside freeze contexts, held nested
inode->i_rwsem contexts, etc. Instead of it being largely outside of
all VFS, filesystem and inode locking, it's now deeply embedded in a
complex lock chain. That may be fine, but there is a non-zero risk
that we overlooked something and it's deadlocks ahoy....
> And if it makes you feel better, the next patchset queued up for after the next
> merge window is deleting the LRU, so you won't have to worry about it for long
> :). Thanks,
Sure, but the risk is that we end up with a release that has
unfixable deadlocks in it, and so is largely unsafe for anyone to
use in production.... :/
I get it that this is already a long patch series, but changing lock
orders like this "just for a short time" isn't something that fills
me with joy. Weird temporary code behaviours like this also makes
for an awful backport experience for anyone trying to maintain a LTS
kernel....
I suspect it would be simpler overall to add the reference counted
cached object list to cover the writeback/mm requirement for the
LRU, then immediately remove the LRU instead of adding reference
counts for the LRU and sprinkling new LRU removal points around to
make the reference counting work correctly in all conditions.
Especially as you plan to remove the LRU pretty much straight
away...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 16/54] fs: delete the inode from the LRU list on lookup
2025-08-28 11:42 ` Josef Bacik
@ 2025-09-02 4:07 ` Dave Chinner
0 siblings, 0 replies; 105+ messages in thread
From: Dave Chinner @ 2025-09-02 4:07 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Thu, Aug 28, 2025 at 07:42:25AM -0400, Josef Bacik wrote:
> On Thu, Aug 28, 2025 at 07:46:56AM +1000, Dave Chinner wrote:
> > On Tue, Aug 26, 2025 at 11:39:16AM -0400, Josef Bacik wrote:
> > > When we move to holding a full reference on the inode when it is on an
> > > LRU list we need to have a mechanism to re-run the LRU add logic. The
> > > use case for this is btrfs's snapshot delete, we will lookup all the
> > > inodes and try to drop them, but if they're on the LRU we will not call
> > > ->drop_inode() because their refcount will be elevated, so we won't know
> > > that we need to drop the inode.
> > >
> > > Fix this by simply removing the inode from it's respective LRU list when
> > > we grab a reference to it in a way that we have active users. This will
> > > ensure that the logic to add the inode to the LRU or drop the inode will
> > > be run on the final iput from the user.
> > >
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> >
> > Have you benchmarked this for scalability?
> >
> > The whole point of lazy LRU removal was to remove LRU lock
> > contention from the hot lookup path. I suspect that putting the LRU
> > locks back inside the lookup path is going to cause performance
> > regressions...
> >
> > FWIW, why do we even need the inode LRU anymore?
> >
> > We certainly don't need it anymore to keep the working set in memory
> > because that's what the dentry cache LRU does (i.e. by pinning a
> > reference to the inode whilst the dentry is active).
> >
> > And with the introduction of the cached inode list, we don't need
> > the inode LRU to track unreferenced dirty inodes around whilst
> > they hang out on writeback lists. The inodes on the writeback lists
> > are now referenced and tracked on the cached inode list, so they
> > don't need special hooks in the mm/ code to handle the special
> > transition from "unreferenced writeback" to "unreferenced LRU"
> > anymore, they can just be dropped from the cached inode list....
> >
> > So rather than jumping through hoops to maintain an LRU we likely
> > don't actually need and is likely to re-introduce old scalability
> > issues, why not remove it completely?
>
> That's next on the list, but we're already at 54 patches. This won't be a hot
> path, we're not going to consistently find inodes on the LRU to remove.
IME, there are some workloads that hit the inode cache hard
(typically anything that has a large set of working inodes and
memory pressure is causing reclaim to run all the time). In these
cases, we are finding inodes on the inode cache it's because we've
missed the dentry cache (due to reclaim) and so the inode we hit is
unreferenced and on the LRU. i.e. if we don't have lazy LRU removal,
when we hit the inode cache we'll also typically hit the LRU....
> My rough plans are
>
> 1. Get this series merged.
> 2. Let it bake and see if any issues arise.
> 3. Remove the inode LRU completely.
> 4. Remove the i_hash and use an xarray for inode lookups.
>
> The inode LRU removal is going to be a big change,
When I last looked at it, it wasn't a particularly big code change
at all. It was just that mm/ depended on the LRU existing that
prevented it from being easily removable. You've addressed that
dependency by adding the cached inode list for inodes with populated
address spaces....
> and I want it to be separate
> from this work from the LRU work in case we find that we do really need the LRU.
> If that turns out to be the case then we can revisit if this is a scalability
> issue. Thanks,
Understood, but I would prefer to do that the other way around.
i.e. rather than add complexity and potential scalability issues to
the LRU management on the way to removing the LRU at a later date,
we remove the LRU at the earliest possible opportunity.
If we have any sort of perf regression caused by the LRU removal, we
can address those cases via temporary residence on the new cached
inode list...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 00/54] fs: rework inode reference counting
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
` (56 preceding siblings ...)
2025-08-28 12:51 ` Christian Brauner
@ 2025-09-02 10:06 ` Mateusz Guzik
2025-09-02 21:16 ` Josef Bacik
57 siblings, 1 reply; 105+ messages in thread
From: Mateusz Guzik @ 2025-09-02 10:06 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Tue, Aug 26, 2025 at 11:39:00AM -0400, Josef Bacik wrote:
Hi Josef,
I read through the entire patchset and I think I got the hang of it.
Bottom line is I disagree with the core idea of the patchset and
majority of the justification raised in the cover letter. :)
I'll be very to the point, trying to be as clear as possible and
consequently lacking in soft-speak. Based on your name I presume you are
also of Slavic descent, hopefully making it fine ;-)
I don't have a vote per se so this is not really a NAK. Instead I'm
making a case to you and VFS maintaienrs to not include this.
ACHTUNG: this is *really* long and I probably forgot to mention
something.
Frankly the patchset seems to be a way to help btrfs by providing a new
refcount (but not in a generic-friendly manner) while taking issue with
refcount 0 having a "the inode is good to go if need be" meaning. I
provide detailed reasoning below.
It warrants noting there is a lot of plain crap in the VFS layer.
Between the wtf flags, bad docs for them, poor assert coverage,
open-coded & repeated access to stuff (including internal state), I have
to say someone(tm) needs to take a hammer to it.
However, as far as I can tell, modulo the quality of how things are
expressed in the code (so to speak), the crux of what the layer is doing
in terms of inode management follows idiomatic behavior I would expect
to see, I just needs to be done better.
While there are perfectly legitimate reasons to introduce a "hold"
reference counter, I pose the patchset at hand does not justify its
introduction. If anything I will argue it would be a regression to do it
the way it is proposed here, even if some variant of the new counter
will find a use case.
> This series is the first part of a larger body of work geared towards solving a
> variety of scalability issues in the VFS.
>
Elsewhere in the thread it is mentioned that there is a plan to remove
the inode LRU and replace the inode hash with xarray after these changes.
I don't understand how this patchset paves the way for either of those
things.
If anything, per notes from other people, it would probably be best if
the inode LRU got removed first and this patchset got rebased on it (if
it is to land at all).
For the inode hash the real difficulty is not really in terms of
implementing something, but evaluating available options. Even if the
statically-allocated hash should go (it probably should), the hashing
function is not doing a good job (read: the hash is artificially
underperforming) and merely replacing it with something else might not
give an accurate picture whether the new pick for the data structure is
genuinely the right choice (due to skewed comparison as the hash is
gimped, both in terms of hashing func and global locking).
The minor technical problem which is there in the stock kernel and which
remains unaddressed by your patchset is the need to take ->i_lock. Some
of later commentary in this cover letter claims this is sorted out,
but that's only true if someone already has a ref (as in the lock is
only optionally ommitted).
In particular, if one was to implement fine-grained locking for the hash
with bitlocks, I'm told the resulting ordering of bitlock -> spinlock
would be problematic on RT kernels as the former type is a hack which
literally only spins and does not support any form of preemption. The
ordering can be swapped around to spinlock -> bitlock thanks to RCU
(e.g., for deletion from the hash you would find the inode using RCU
traversal, lock it, lock the chain and only then delete etc.).
Since your patchset keeps the lock in place, the kernel is in the same
boat in both cases (also if the new thing only uses spinlocks).
As far as I know the other non-fs specific bottlenecks for inode
handling are the super block list and dentry LRU, neither of which
benefit from the patchset either.
So again I don't see how scalability work is facilitated by this patchset.
> We have historically had a variety of foot-guns related to inode freeing. We
> have I_WILL_FREE and I_FREEING flags that indicated when the inode was in the
> different stages of being reclaimed. This lead to confusion, and bugs in cases
> where one was checked but the other wasn't. Additionally, it's frankly
> confusing to have both of these flags and to deal with them in practice.
>
Per my opening remark I agree this situation is very poorly handled in
the current code.
If my grep is right the only real consumer of I_WILL_FREE is ocfs2. In
your patchset your just remove the usage. Given that other filesystems
manage without it, I suspect the real solution is to change its
->drop_inode to generic_delete_inode() and handle the write in
->evict_inode.
The doc for the flag is most unhelpful, documenting how the flag is used
but not explaining what for.
If I understood things correctly the flag is only there to prevent
->i_count acquire by other threads while the spin lock is dropped during
inode write out.
Whether your ocfs patch lands or this bit gets reworked as described
above, the flag is gone and we are only left with I_FREEING.
Hiding this behind a proper accessor (letting you know what's up with
the inode) should cover your concern (again see bottom of the e-mail for
a longer explanation).
> However, this exists because we have an odd behavior with inodes, we allow them
> to have a 0 reference count and still be usable. This again is a pretty unfun
> footgun, because generally speaking we want reference counts to be meaningful.
>
This is not an odd behavior. This in fact the idiomatic handling of
objects which remain cached if there are no active users. I don't know
about the entirety of the Linux kernel, but dentries are also handled
the same way.
I come from the BSD land but I had also seen my share of Solaris and I
can tell you all of these also follow this core idea in places I looked.
If anything deviating from this should raise eyebrows.
I can however agree that the current magic flags + refcount do make for
a buggy combination, but that's not an inherent property of using this
method.
> The problem with the way we reference inodes is the final iput(). The majority
> of file systems do their final truncate of a unlinked inode in their
> ->evict_inode() callback, which happens when the inode is actually being
> evicted. This can be a long process for large inodes, and thus isn't safe to
> happen in a variety of contexts. Btrfs, for example, has an entire delayed iput
> infrastructure to make sure that we do not do the final iput() in a dangerous
> context. We cannot expand the use of this reference count to all the places the
> inode is used, because there are cases where we would need to iput() in an IRQ
> context (end folio writeback) or other unsafe context, which is not allowed.
>
I don't believe ->i_obj_count is needed to facilitate this.
Suppose iput() needs to become callable from any context, just like
fput().
What it can do is atomically drop the ref it is not the last one or punt
all of it to task_work/a dedicated task queue.
Basically same thing as fput(), except the ref is expected to be dropped
by the code doing deferred processing if ->i_count == 1.
Note that with your patchset iput() still takes spinlocks, which
prevents it from being callable from IRQs at least.
But suppose ->i_obj_count makes sense to add. Below I explain why I
disagree with the way it is done.
> To that end, resolve this by introducing a new i_obj_count reference count. This
> will be used to control when we can actually free the inode. We then can use
> this reference count in all the places where we may reference the inode. This
> removes another huge footgun, having ways to access the inode itself without
> having an actual reference to it. The writeback code is one of the main places
> where we see this. Inodes end up on all sorts of lists here without a proper
> reference count. This allows us to protect the inode from being freed by giving
> this an other code mechanisms to protect their access to the inode.
>
I read through writeback vs iput() handling and it is very oddly
written, indeed looking fishy. I don't know the history here, given the
state of the code I 300% believe there were bugs in terms of lifetime
management/racing against iput().
But the crux of what the code is doing is perfectly sane and in fact
what I would expect to happen unless there is a good reason not to.
The crucial point here is setting up the inode for teardown (and thus
preventing new refs from showing up) and stalling it as long as there
are pending consumers. That way they can still safely access everything
they need.
For this work the code needs a proper handshake (if you will), which
*is* arranged with locking -- writeback (or other code with similar
needs) either wins against teardown and does the write or loses and
pretends the inode is not there (or fails to see it). If writeback wins,
teardown waits. This only needs readable helpers to not pose a problem,
which is not hard to implement.
Note your patchset does not remove the need to do this, it merely
possibly simplifies clean up after (but see below).
This brings me to the problem with how ->i_obj_count is proposed. In
this patchset it merely gates the actual free of the inode, allowing all
other teardown to progress.
Suppose one was to use ->i_obj_count in writeback to guarantee inode
liveness -- worst case iobj_put() from writeback ends up freeing the
inode.
As mentioned above, the first side of the problem is still there with
your patchset: you still need to synchronize against writeback starting
to work on the inode.
But let's assume the other side -- just the freeing -- is now sorted out
with the count.
The problem with it is the writeback code historically was able to
access the entire of the inode. With teardown progressing in parallel
this is no longer true an what is no longer accessible depends entirely
on timing. If there are "bad" accesses, you are going to find the hard
way.
In order to feel safe here one would need to audit the entire of
writeback code to make sure it does not do anything wrong here and
probably do quite a bit of fuzzing with KMSAN et al.
Furthermore, imagine some time in the future one would need to add
something which needs to remain valid for the duration of writeback in
progress. Then you are back to the current state vs waiting on writeback
or you need to move more things around after i_obj_count drops to 0.
Or you can make sure iput() can safely wait for a wakeup from writeback
and not worry about a thorough audit of all inode accessess nor any
future work adding more. This is the current approach.
General note is that a hold count merely gating the actual free invites
misuse where consumers race against teardown thinking something is still
accessible and only crapping out when they get unlucky.
The ->i_obj_count refs/puts around hash and super block list
manipulation only serve as overhead. Suppose they are not there. With
the rest of your proposal it is an invariant that i_obj_count is at
least 1 when iput() is being called. Meaning whatever refs are present
or not on super block or the hash literally play no role. In fact, if
they are there, it is an invariant they are not the last refs to drop.
Even in the btrfs case you are just trying to defer actual free of the
inode, which is not necessarily all that safe in the long run given the
remarks above.
But suppose for whatever reason you really want to punt ->evict_inodes()
processing.
My suggestion would be the following:
The hooks for ->evict_inodes() can start returning -EAGAIN. Then if you
conclude you can't do the work in context you got called from, evict()
can defer you elsewhere and then you get called from a spot where you
CAN do it, after which the rest of evict() is progressing.
Something like:
the_rest_of_evict() {
if (S_ISCHR(inode->i_mode) && inode->i_cdev)
cd_forget(inode);
remove_inode_hash(inode);
....
}
/* runs from task_work, some task queue or whatever applicable */
evict_deferred() {
ret = op->evict_inode(inode);
BUG_ON(ret == -EAGAIN);
the_rest_of_evict(inode);
}
evict() {
....
if (op->evict_inode) {
ret = op->evict_inode(inode);
if (ret == -EAGAIN) {
evict_defer(inode);
return;
}
} else {
truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
}
the_rest_of_evict(inode);
}
Optionally ->evict_inodes() func can get gain an argument denoting who
is doing the call (evict() or evict_deferred()).
> With this we can separate the concept of the inode being usable, and the inode
> being freed.
[snip]
> With not allowing inodes to hit a refcount of 0, we can take advantage of that
> common pattern of using refcount_inc_not_zero() in all of the lockless places
> where we do inode lookup in cache. From there we can change all the users who
> check I_WILL_FREE or I_FREEING to simply check the i_count. If it is 0 then they
> aren't allowed to do their work, othrwise they can proceed as normal.
But this is already doable, just avoidably open-coded.
In your patchset this is open-coded with icount_read() == 0, which is
also leaking state it should not.
You could hide this behind can_you_grab_a_ref().
On the current kernel the new helper would check the count + flags
instead.
Your consumers which no longer openly do it in this patchset would look
the same.
So here is an outline of what I suggest. First I'm going to talk about
sorting out ->i_state and then about inode transition tracking.
Accesses to ->i_state are open-coded everywhere, some places use
READ_ONCE/WRITE_ONCE while others use plain loads/stores. None of this
validates whether ->i_lock is held and for cases where the caller is
fine with unstable flags, there is no way to validate this is what they
are signing up for (for example maybe the place assumes ->i_lock is in
fact held?).
As an absolute minimum this should hide behind 3 accessors:
1. istate_store, asserting the lock is held. WRITE_ONCE
2. istate_load, asserting the lock is held. READ_ONCE or plain load
3. istate_load_unlocked, no asserts. the consumer explicitly spells out
they understand the value can change from under them. another READ_ONCE
to prevent the compiler from fucking with reloads.
Maybe hide the field behind a struct so that spelled out i_state access
fails to compile (similarly to how atomics are handled).
Suppose the I_WILL_FREE flag got sorted out.
Then the kernel is left with I_NEW, I_CLEAR, I_FREEING and maybe
something extra.
I think this is much more manageable but still primitive.
An equivalent can be done with enums in a way which imo is much more
handy.
Then various spots all over the VFS layer can validate they got a state
which can be legally observed for their usage. Note mere refcount being
0 or not does not provide that granularity as a collection of flags or
an enum.
For illustrative purposes, suppose:
DEAD -- either hanging out after rcu freed or never used to begin with
UNDER_CONSTRUCTION -- handed out by the allocator, still being created.
invalid (equivalent to I_NEW?)
CONSTRUCTED -- all done (equivalent to no flags?)
DESTROYING -- equivalent to I_FREEING?
With this in place it is handy to validate that for example you are
transitionting from CONSTRUCTED to DESTROYING, but not from CONSTRUCTED
to DEAD.
You can also assert no UNDER_CONSTRUCTION inode escaped into the wild
(this would happen in various vfs primitives, e.g., prior to taking the
inode rwsem)
This is all equivalent to the flag manipulation, except imo clearer.
Suppose the flags are to stay. They can definitely hide behind helpers,
there is no good reason for anyone outside of fs.h or inode.c to know
about their meaning.
I claim the enums *can* escape as they can be easily reasoned about.
So... I don't offer to do any of this, I hope I made a convincing case
against the patchset at least.
Cheers.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: [PATCH v2 00/54] fs: rework inode reference counting
2025-09-02 10:06 ` Mateusz Guzik
@ 2025-09-02 21:16 ` Josef Bacik
0 siblings, 0 replies; 105+ messages in thread
From: Josef Bacik @ 2025-09-02 21:16 UTC (permalink / raw)
To: Mateusz Guzik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro, amir73il
On Tue, Sep 02, 2025 at 12:06:01PM +0200, Mateusz Guzik wrote:
> On Tue, Aug 26, 2025 at 11:39:00AM -0400, Josef Bacik wrote:
>
> Hi Josef,
>
> I read through the entire patchset and I think I got the hang of it.
>
> Bottom line is I disagree with the core idea of the patchset and
> majority of the justification raised in the cover letter. :)
>
> I'll be very to the point, trying to be as clear as possible and
> consequently lacking in soft-speak. Based on your name I presume you are
> also of Slavic descent, hopefully making it fine ;-)
>
> I don't have a vote per se so this is not really a NAK. Instead I'm
> making a case to you and VFS maintaienrs to not include this.
Mateusz, I always value your feedback and your views. As long as you aren't
personally attacking anybody I have no problems being told that you think I'm
wrong, and I've never seen you be rude or combative, so I wasn't expecting to
see anything in this email I wouldn't want to hear. I did this as an RFC
specifically hoping you would look at this and come up with a solution I hadn't
thought of. Thank you for thoroughly digging through this and giving a quite
thorough and well thought out response.
>
> ACHTUNG: this is *really* long and I probably forgot to mention
> something.
>
> Frankly the patchset seems to be a way to help btrfs by providing a new
> refcount (but not in a generic-friendly manner) while taking issue with
> refcount 0 having a "the inode is good to go if need be" meaning. I
> provide detailed reasoning below.
>
> It warrants noting there is a lot of plain crap in the VFS layer.
> Between the wtf flags, bad docs for them, poor assert coverage,
> open-coded & repeated access to stuff (including internal state), I have
> to say someone(tm) needs to take a hammer to it.
>
> However, as far as I can tell, modulo the quality of how things are
> expressed in the code (so to speak), the crux of what the layer is doing
> in terms of inode management follows idiomatic behavior I would expect
> to see, I just needs to be done better.
>
> While there are perfectly legitimate reasons to introduce a "hold"
> reference counter, I pose the patchset at hand does not justify its
> introduction. If anything I will argue it would be a regression to do it
> the way it is proposed here, even if some variant of the new counter
> will find a use case.
>
> > This series is the first part of a larger body of work geared towards solving a
> > variety of scalability issues in the VFS.
> >
>
> Elsewhere in the thread it is mentioned that there is a plan to remove
> the inode LRU and replace the inode hash with xarray after these changes.
>
> I don't understand how this patchset paves the way for either of those
> things.
>
> If anything, per notes from other people, it would probably be best if
> the inode LRU got removed first and this patchset got rebased on it (if
> it is to land at all).
>
> For the inode hash the real difficulty is not really in terms of
> implementing something, but evaluating available options. Even if the
> statically-allocated hash should go (it probably should), the hashing
> function is not doing a good job (read: the hash is artificially
> underperforming) and merely replacing it with something else might not
> give an accurate picture whether the new pick for the data structure is
> genuinely the right choice (due to skewed comparison as the hash is
> gimped, both in terms of hashing func and global locking).
>
> The minor technical problem which is there in the stock kernel and which
> remains unaddressed by your patchset is the need to take ->i_lock. Some
> of later commentary in this cover letter claims this is sorted out,
> but that's only true if someone already has a ref (as in the lock is
> only optionally ommitted).
>
> In particular, if one was to implement fine-grained locking for the hash
> with bitlocks, I'm told the resulting ordering of bitlock -> spinlock
> would be problematic on RT kernels as the former type is a hack which
> literally only spins and does not support any form of preemption. The
> ordering can be swapped around to spinlock -> bitlock thanks to RCU
> (e.g., for deletion from the hash you would find the inode using RCU
> traversal, lock it, lock the chain and only then delete etc.).
>
> Since your patchset keeps the lock in place, the kernel is in the same
> boat in both cases (also if the new thing only uses spinlocks).
>
> As far as I know the other non-fs specific bottlenecks for inode
> handling are the super block list and dentry LRU, neither of which
> benefit from the patchset either.
>
> So again I don't see how scalability work is facilitated by this patchset.
>
Agreed, my wording is misleading at best here.
I'm tackling a wide range of things inside of the VFS. My priorities are
1. Simplify. Make everything easier to reason about. Most of our bugs come from
subtle interactions that are hard to reason about. Case in point, Christian took
2 full days to figure out the state of inode refcounting to be able to review
this code. This is a failure. We need core code to be easier to reason about so
it is harder to introduce regressions.
2. Efficiency. We do so many random things that make no sense. We have 4
different things where we loop through all of the inodes. The i_hash no longer
serves us. The LRU is unneeded overhead.
3. Scalability. I think in addressing the above 2, we can get to this one.
You're correct. This patchset doesn't directly address scalability. But it sets
the stage to do these other things safely.
I do not feel safe changing some of these core parts of VFS without a clearer
view of how inode lifetimes exist.
> > We have historically had a variety of foot-guns related to inode freeing. We
> > have I_WILL_FREE and I_FREEING flags that indicated when the inode was in the
> > different stages of being reclaimed. This lead to confusion, and bugs in cases
> > where one was checked but the other wasn't. Additionally, it's frankly
> > confusing to have both of these flags and to deal with them in practice.
> >
>
> Per my opening remark I agree this situation is very poorly handled in
> the current code.
>
> If my grep is right the only real consumer of I_WILL_FREE is ocfs2. In
> your patchset your just remove the usage. Given that other filesystems
> manage without it, I suspect the real solution is to change its
> ->drop_inode to generic_delete_inode() and handle the write in
> ->evict_inode.
>
> The doc for the flag is most unhelpful, documenting how the flag is used
> but not explaining what for.
>
> If I understood things correctly the flag is only there to prevent
> ->i_count acquire by other threads while the spin lock is dropped during
> inode write out.
>
> Whether your ocfs patch lands or this bit gets reworked as described
> above, the flag is gone and we are only left with I_FREEING.
>
> Hiding this behind a proper accessor (letting you know what's up with
> the inode) should cover your concern (again see bottom of the e-mail for
> a longer explanation).
>
> > However, this exists because we have an odd behavior with inodes, we allow them
> > to have a 0 reference count and still be usable. This again is a pretty unfun
> > footgun, because generally speaking we want reference counts to be meaningful.
> >
>
> This is not an odd behavior. This in fact the idiomatic handling of
> objects which remain cached if there are no active users. I don't know
> about the entirety of the Linux kernel, but dentries are also handled
> the same way.
>
> I come from the BSD land but I had also seen my share of Solaris and I
> can tell you all of these also follow this core idea in places I looked.
>
> If anything deviating from this should raise eyebrows.
>
Yes, it is typical in dcache an icache. It is not typical in every other
reference counting system. My argument is that 0 == "potentially ok to access
under X circumstances" is a bad paradigm to have. We should strive to stick to 0
== this object cannot be used, because this is a far more common practice WRT
reference counting.
Now I'm not saying we shouldn't every do something different, but having been in
file systems for a while, I don't think icache is a place where we need to be
special.
> I can however agree that the current magic flags + refcount do make for
> a buggy combination, but that's not an inherent property of using this
> method.
>
> > The problem with the way we reference inodes is the final iput(). The majority
> > of file systems do their final truncate of a unlinked inode in their
> > ->evict_inode() callback, which happens when the inode is actually being
> > evicted. This can be a long process for large inodes, and thus isn't safe to
> > happen in a variety of contexts. Btrfs, for example, has an entire delayed iput
> > infrastructure to make sure that we do not do the final iput() in a dangerous
> > context. We cannot expand the use of this reference count to all the places the
> > inode is used, because there are cases where we would need to iput() in an IRQ
> > context (end folio writeback) or other unsafe context, which is not allowed.
> >
>
> I don't believe ->i_obj_count is needed to facilitate this.
>
> Suppose iput() needs to become callable from any context, just like
> fput().
>
> What it can do is atomically drop the ref it is not the last one or punt
> all of it to task_work/a dedicated task queue.
Agreed. Btrfs does this with the delayed iput. I had actually thought of doing
this originally. I was worried that the overhead of adding this would be
unwanted, so I went for the dual refcount solution instead.
I'm totally happy if we want to say delayed iput is the solution and then we can
avoid the second reference count.
However, I think that i_obj_count does provide value in the cases where we want
a lighter-weight refcount for internal tracking. ->s_inodes and the various
writeback lists are where this is used and I think it makes the most sense. But
again, delayed iput also accomplishes the same thing so I'm totally open to this
being the desired solution.
>
> Basically same thing as fput(), except the ref is expected to be dropped
> by the code doing deferred processing if ->i_count == 1.
>
> Note that with your patchset iput() still takes spinlocks, which
> prevents it from being callable from IRQs at least.
>
> But suppose ->i_obj_count makes sense to add. Below I explain why I
> disagree with the way it is done.
>
> > To that end, resolve this by introducing a new i_obj_count reference count. This
> > will be used to control when we can actually free the inode. We then can use
> > this reference count in all the places where we may reference the inode. This
> > removes another huge footgun, having ways to access the inode itself without
> > having an actual reference to it. The writeback code is one of the main places
> > where we see this. Inodes end up on all sorts of lists here without a proper
> > reference count. This allows us to protect the inode from being freed by giving
> > this an other code mechanisms to protect their access to the inode.
> >
>
> I read through writeback vs iput() handling and it is very oddly
> written, indeed looking fishy. I don't know the history here, given the
> state of the code I 300% believe there were bugs in terms of lifetime
> management/racing against iput().
Exactly, this is my main argument, and I didn't do a good job articulating that
in my summary email, my apologies.
It takes a ridiculuous amount of effort to reason about what we're doing in
these places. I want to make it simpler to reason about. Because from there we
can start making bigger changes.
>
> But the crux of what the code is doing is perfectly sane and in fact
> what I would expect to happen unless there is a good reason not to.
>
> The crucial point here is setting up the inode for teardown (and thus
> preventing new refs from showing up) and stalling it as long as there
> are pending consumers. That way they can still safely access everything
> they need.
>
> For this work the code needs a proper handshake (if you will), which
> *is* arranged with locking -- writeback (or other code with similar
> needs) either wins against teardown and does the write or loses and
> pretends the inode is not there (or fails to see it). If writeback wins,
> teardown waits. This only needs readable helpers to not pose a problem,
> which is not hard to implement.
>
> Note your patchset does not remove the need to do this, it merely
> possibly simplifies clean up after (but see below).
Agreed, that is my goal.
>
> This brings me to the problem with how ->i_obj_count is proposed. In
> this patchset it merely gates the actual free of the inode, allowing all
> other teardown to progress.
>
> Suppose one was to use ->i_obj_count in writeback to guarantee inode
> liveness -- worst case iobj_put() from writeback ends up freeing the
> inode.
>
> As mentioned above, the first side of the problem is still there with
> your patchset: you still need to synchronize against writeback starting
> to work on the inode.
>
> But let's assume the other side -- just the freeing -- is now sorted out
> with the count.
>
> The problem with it is the writeback code historically was able to
> access the entire of the inode. With teardown progressing in parallel
> this is no longer true an what is no longer accessible depends entirely
> on timing. If there are "bad" accesses, you are going to find the hard
> way.
>
Agreed, but that was also always the case before. Now we at least have
i_obj_count to make sure the object itself doesn't go away.
A file system could always (and still can) redirty an inode while it's going
down and writeback could miss it. These patches do not eliminate this, it just
makes sure we are super clear that the object itself will not be deleted.
> In order to feel safe here one would need to audit the entire of
> writeback code to make sure it does not do anything wrong here and
> probably do quite a bit of fuzzing with KMSAN et al.
I think that i_obj_count accomplishes this without all of that work.
>
> Furthermore, imagine some time in the future one would need to add
> something which needs to remain valid for the duration of writeback in
> progress. Then you are back to the current state vs waiting on writeback
> or you need to move more things around after i_obj_count drops to 0.
>
> Or you can make sure iput() can safely wait for a wakeup from writeback
> and not worry about a thorough audit of all inode accessess nor any
> future work adding more. This is the current approach.
>
> General note is that a hold count merely gating the actual free invites
> misuse where consumers race against teardown thinking something is still
> accessible and only crapping out when they get unlucky.
>
> The ->i_obj_count refs/puts around hash and super block list
> manipulation only serve as overhead. Suppose they are not there. With
> the rest of your proposal it is an invariant that i_obj_count is at
> least 1 when iput() is being called. Meaning whatever refs are present
> or not on super block or the hash literally play no role. In fact, if
> they are there, it is an invariant they are not the last refs to drop.
>
> Even in the btrfs case you are just trying to defer actual free of the
> inode, which is not necessarily all that safe in the long run given the
> remarks above.
>
> But suppose for whatever reason you really want to punt ->evict_inodes()
> processing.
>
> My suggestion would be the following:
>
> The hooks for ->evict_inodes() can start returning -EAGAIN. Then if you
> conclude you can't do the work in context you got called from, evict()
> can defer you elsewhere and then you get called from a spot where you
> CAN do it, after which the rest of evict() is progressing.
>
> Something like:
>
> the_rest_of_evict() {
> if (S_ISCHR(inode->i_mode) && inode->i_cdev)
> cd_forget(inode);
>
> remove_inode_hash(inode);
> ....
> }
>
> /* runs from task_work, some task queue or whatever applicable */
> evict_deferred() {
> ret = op->evict_inode(inode);
> BUG_ON(ret == -EAGAIN);
> the_rest_of_evict(inode);
> }
>
> evict() {
> ....
> if (op->evict_inode) {
> ret = op->evict_inode(inode);
> if (ret == -EAGAIN) {
> evict_defer(inode);
> return;
> }
> } else {
> truncate_inode_pages_final(&inode->i_data);
> clear_inode(inode);
> }
>
> the_rest_of_evict(inode);
> }
>
> Optionally ->evict_inodes() func can get gain an argument denoting who
> is doing the call (evict() or evict_deferred()).
>
> > With this we can separate the concept of the inode being usable, and the inode
> > being freed.
> [snip]
> > With not allowing inodes to hit a refcount of 0, we can take advantage of that
> > common pattern of using refcount_inc_not_zero() in all of the lockless places
> > where we do inode lookup in cache. From there we can change all the users who
> > check I_WILL_FREE or I_FREEING to simply check the i_count. If it is 0 then they
> > aren't allowed to do their work, othrwise they can proceed as normal.
>
> But this is already doable, just avoidably open-coded.
>
> In your patchset this is open-coded with icount_read() == 0, which is
> also leaking state it should not.
>
> You could hide this behind can_you_grab_a_ref().
>
> On the current kernel the new helper would check the count + flags
> instead.
>
> Your consumers which no longer openly do it in this patchset would look
> the same.
>
> So here is an outline of what I suggest. First I'm going to talk about
> sorting out ->i_state and then about inode transition tracking.
>
> Accesses to ->i_state are open-coded everywhere, some places use
> READ_ONCE/WRITE_ONCE while others use plain loads/stores. None of this
> validates whether ->i_lock is held and for cases where the caller is
> fine with unstable flags, there is no way to validate this is what they
> are signing up for (for example maybe the place assumes ->i_lock is in
> fact held?).
>
> As an absolute minimum this should hide behind 3 accessors:
>
> 1. istate_store, asserting the lock is held. WRITE_ONCE
> 2. istate_load, asserting the lock is held. READ_ONCE or plain load
> 3. istate_load_unlocked, no asserts. the consumer explicitly spells out
> they understand the value can change from under them. another READ_ONCE
> to prevent the compiler from fucking with reloads.
>
> Maybe hide the field behind a struct so that spelled out i_state access
> fails to compile (similarly to how atomics are handled).
>
> Suppose the I_WILL_FREE flag got sorted out.
>
> Then the kernel is left with I_NEW, I_CLEAR, I_FREEING and maybe
> something extra.
>
> I think this is much more manageable but still primitive.
>
> An equivalent can be done with enums in a way which imo is much more
> handy.
>
> Then various spots all over the VFS layer can validate they got a state
> which can be legally observed for their usage. Note mere refcount being
> 0 or not does not provide that granularity as a collection of flags or
> an enum.
>
> For illustrative purposes, suppose:
> DEAD -- either hanging out after rcu freed or never used to begin with
> UNDER_CONSTRUCTION -- handed out by the allocator, still being created.
> invalid (equivalent to I_NEW?)
> CONSTRUCTED -- all done (equivalent to no flags?)
> DESTROYING -- equivalent to I_FREEING?
>
> With this in place it is handy to validate that for example you are
> transitionting from CONSTRUCTED to DESTROYING, but not from CONSTRUCTED
> to DEAD.
>
> You can also assert no UNDER_CONSTRUCTION inode escaped into the wild
> (this would happen in various vfs primitives, e.g., prior to taking the
> inode rwsem)
>
> This is all equivalent to the flag manipulation, except imo clearer.
>
> Suppose the flags are to stay. They can definitely hide behind helpers,
> there is no good reason for anyone outside of fs.h or inode.c to know
> about their meaning.
>
> I claim the enums *can* escape as they can be easily reasoned about.
>
> So... I don't offer to do any of this, I hope I made a convincing case
> against the patchset at least.
Alright I see what you're suggesting. What I want is to have the refcounts be
the ultimate arbiter of the state of the inode. We still need I_NEW and
I_CREATING. I want to separate the dirty flags off to the side so we can use
bitops for I_CREATING and I_NEW. From there we can do simple things about
waiting where we need to, and eliminate i_lock for those accesses. That way
inode lookup becomes xarray walk under RCU,
refcount_inc_not_zero(&inode->i_count), if (unlikely(test_bit(I_NEW))) etc.
This has all been long and I think I've got the gist of what you're suggesting.
I'm going to restate it here so I'm sure we're on the same page.
1. Don't do the i_obj_count thing.
2. Clean-up all the current weirdness by defining helpers that clearly define
the flow of the inode lifetime.
3. Remove the flags that are no longer necessary.
4. Continue on with my other work to remove i_hash and the i_lru.
I don't disagree with this approach. I would however like to argue that changing
the refcounting rules to be clear accomplishes a lot of the above goals, and
gives us access to refcount_t which allows us to capture all sorts of bad
behavior without needing to duplicate the effort.
As an alternative approach, I could do the following.
1. Pull the delayed iput work from btrfs into the core VFS.
2. In all the places where we have dubious lifetime stuff (aka where I use
i_obj_count), replace it with more i_count usage, and rely on the delayed iput
infrastructure to save us here.
3. Change the rules so we never have 0 refcount objects.
4. Convert to refcount_t.
5. Remove the various flags.
6. Continue with my existing plans.
Does this sound like a reasonable compromise? Do my explanations make sense?
Did I misunderstand something fundamentally in your response?
I'm not married to my work, I want to find a solution we're all happy with. I'm
starting a new job this week so my ability to pay a lot of attention to this is
going to be slightly diminished, so I apologize if I missed something. Thanks,
Josef
^ permalink raw reply [flat|nested] 105+ messages in thread
end of thread, other threads:[~2025-09-02 21:16 UTC | newest]
Thread overview: 105+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-26 15:39 [PATCH v2 00/54] fs: rework inode reference counting Josef Bacik
2025-08-26 15:39 ` [PATCH v2 01/54] fs: make the i_state flags an enum Josef Bacik
2025-08-26 15:39 ` [PATCH v2 02/54] fs: add an icount_read helper Josef Bacik
2025-08-26 22:18 ` Mateusz Guzik
2025-08-27 11:25 ` (subset) " Christian Brauner
2025-08-26 15:39 ` [PATCH v2 03/54] fs: rework iput logic Josef Bacik
2025-08-27 12:58 ` Mateusz Guzik
2025-08-27 14:18 ` Mateusz Guzik
2025-08-27 14:54 ` Josef Bacik
2025-08-27 14:57 ` Christian Brauner
2025-08-27 16:24 ` [PATCH] fs: revamp iput() Mateusz Guzik
2025-08-30 15:54 ` Mateusz Guzik
2025-09-01 8:50 ` Jan Kara
2025-09-01 10:39 ` Christian Brauner
2025-09-01 10:41 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 04/54] fs: add an i_obj_count refcount to the inode Josef Bacik
2025-08-26 15:39 ` [PATCH v2 05/54] fs: hold an i_obj_count reference in wait_sb_inodes Josef Bacik
2025-08-26 15:39 ` [PATCH v2 06/54] fs: hold an i_obj_count reference for the i_wb_list Josef Bacik
2025-08-26 15:39 ` [PATCH v2 07/54] fs: hold an i_obj_count reference for the i_io_list Josef Bacik
2025-08-26 15:39 ` [PATCH v2 08/54] fs: hold an i_obj_count reference in writeback_sb_inodes Josef Bacik
2025-08-26 15:39 ` [PATCH v2 09/54] fs: hold an i_obj_count reference while on the hashtable Josef Bacik
2025-08-26 15:39 ` [PATCH v2 10/54] fs: hold an i_obj_count reference while on the LRU list Josef Bacik
2025-08-26 15:39 ` [PATCH v2 11/54] fs: hold an i_obj_count reference while on the sb inode list Josef Bacik
2025-08-26 15:39 ` [PATCH v2 12/54] fs: stop accessing ->i_count directly in f2fs and gfs2 Josef Bacik
2025-08-26 15:39 ` [PATCH v2 13/54] fs: hold an i_obj_count when we have an i_count reference Josef Bacik
2025-08-26 15:39 ` [PATCH v2 14/54] fs: add an I_LRU flag to the inode Josef Bacik
2025-08-26 15:39 ` [PATCH v2 15/54] fs: maintain a list of pinned inodes Josef Bacik
2025-08-27 15:20 ` Christian Brauner
2025-08-27 16:07 ` Josef Bacik
2025-08-28 8:24 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 16/54] fs: delete the inode from the LRU list on lookup Josef Bacik
2025-08-27 21:46 ` Dave Chinner
2025-08-28 11:42 ` Josef Bacik
2025-09-02 4:07 ` Dave Chinner
2025-08-26 15:39 ` [PATCH v2 17/54] fs: remove the inode from the LRU list on unlink/rmdir Josef Bacik
2025-08-27 12:32 ` Christian Brauner
2025-08-27 16:08 ` Josef Bacik
2025-08-27 22:01 ` Dave Chinner
2025-08-28 11:46 ` Josef Bacik
2025-09-02 1:48 ` Dave Chinner
2025-08-28 9:00 ` Christian Brauner
2025-08-28 9:06 ` Christian Brauner
2025-08-28 10:43 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 18/54] fs: change evict_inodes to use iput instead of evict directly Josef Bacik
2025-08-28 10:18 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 19/54] fs: hold a full ref while the inode is on a LRU Josef Bacik
2025-08-28 10:51 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 20/54] fs: disallow 0 reference count inodes Josef Bacik
2025-08-28 11:02 ` Christian Brauner
2025-08-28 11:44 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 21/54] fs: make evict_inodes add to the dispose list under the i_lock Josef Bacik
2025-08-26 15:39 ` [PATCH v2 22/54] fs: convert i_count to refcount_t Josef Bacik
2025-08-28 12:00 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 23/54] fs: use refcount_inc_not_zero in igrab Josef Bacik
2025-08-28 22:08 ` Eric Biggers
2025-08-29 13:42 ` Josef Bacik
2025-08-26 15:39 ` [PATCH v2 24/54] fs: use inode_tryget in find_inode* Josef Bacik
2025-08-26 15:39 ` [PATCH v2 25/54] fs: update find_inode_*rcu to check the i_count count Josef Bacik
2025-08-26 15:39 ` [PATCH v2 26/54] fs: use igrab in insert_inode_locked Josef Bacik
2025-08-28 12:15 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 27/54] fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru Josef Bacik
2025-08-26 15:39 ` [PATCH v2 28/54] fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating Josef Bacik
2025-08-26 15:39 ` [PATCH v2 29/54] fs: use inode_tryget in evict_inodes Josef Bacik
2025-08-26 15:39 ` [PATCH v2 30/54] fs: change evict_dentries_for_decrypted_inodes to use refcount Josef Bacik
2025-08-28 12:25 ` Christian Brauner
2025-08-28 22:26 ` Eric Biggers
2025-08-29 7:38 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 31/54] block: use igrab in sync_bdevs Josef Bacik
2025-08-26 15:39 ` [PATCH v2 32/54] bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING Josef Bacik
2025-08-26 15:39 ` [PATCH v2 33/54] btrfs: don't check I_WILL_FREE|I_FREEING Josef Bacik
2025-08-26 15:39 ` [PATCH v2 34/54] fs: use igrab in drop_pagecache_sb Josef Bacik
2025-08-26 15:39 ` [PATCH v2 35/54] fs: stop checking I_FREEING in d_find_alias_rcu Josef Bacik
2025-08-26 15:39 ` [PATCH v2 36/54] ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env Josef Bacik
2025-08-26 15:39 ` [PATCH v2 37/54] fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c Josef Bacik
2025-08-26 15:39 ` [PATCH v2 38/54] gfs2: remove I_WILL_FREE|I_FREEING usage Josef Bacik
2025-08-26 15:39 ` [PATCH v2 39/54] fs: remove I_WILL_FREE|I_FREEING check from dquot.c Josef Bacik
2025-08-28 12:35 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 40/54] notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes Josef Bacik
2025-08-26 15:39 ` [PATCH v2 41/54] xfs: remove I_FREEING check Josef Bacik
2025-08-26 15:39 ` [PATCH v2 42/54] landlock: remove I_FREEING|I_WILL_FREE check Josef Bacik
2025-08-26 15:39 ` [PATCH v2 43/54] fs: change inode_is_dirtytime_only to use refcount Josef Bacik
2025-08-26 22:06 ` Mateusz Guzik
2025-08-28 12:38 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 44/54] btrfs: remove references to I_FREEING Josef Bacik
2025-08-26 15:39 ` [PATCH v2 45/54] ext4: remove reference to I_FREEING in inode.c Josef Bacik
2025-08-26 15:39 ` [PATCH v2 46/54] ext4: remove reference to I_FREEING in orphan.c Josef Bacik
2025-08-26 15:39 ` [PATCH v2 47/54] pnfs: use i_count refcount to determine if the inode is going away Josef Bacik
2025-08-26 15:39 ` [PATCH v2 48/54] fs: remove some spurious I_FREEING references in inode.c Josef Bacik
2025-08-28 12:40 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 49/54] xfs: remove reference to I_FREEING|I_WILL_FREE Josef Bacik
2025-08-26 15:39 ` [PATCH v2 50/54] ocfs2: do not set I_WILL_FREE Josef Bacik
2025-08-26 15:39 ` [PATCH v2 51/54] fs: remove I_FREEING|I_WILL_FREE Josef Bacik
2025-08-28 12:42 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 52/54] fs: remove I_REFERENCED Josef Bacik
2025-08-28 12:47 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 53/54] fs: remove I_LRU_ISOLATING flag Josef Bacik
2025-08-28 0:26 ` Dave Chinner
2025-08-28 10:35 ` Christian Brauner
2025-08-26 15:39 ` [PATCH v2 54/54] fs: add documentation explaining the reference count rules for inodes Josef Bacik
2025-08-27 8:03 ` [syzbot ci] Re: fs: rework inode reference counting syzbot ci
2025-08-27 11:14 ` (subset) [PATCH v2 00/54] " Christian Brauner
2025-08-28 12:51 ` Christian Brauner
2025-08-28 21:22 ` Josef Bacik
2025-09-02 10:06 ` Mateusz Guzik
2025-09-02 21:16 ` Josef Bacik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).