* [PATCH 00/50] fs: rework inode reference counting
@ 2025-08-21 20:18 Josef Bacik
2025-08-21 20:18 ` [PATCH 01/50] fs: add an i_obj_count refcount to the inode Josef Bacik
` (50 more replies)
0 siblings, 51 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Hello,
This series is the first part of a larger body of work geared towards solving a
variety of scalability issues in the VFS.
We have historically had a variety of foot-guns related to inode freeing. We
have I_WILL_FREE and I_FREEING flags that indicated when the inode was in the
different stages of being reclaimed. This lead to confusion, and bugs in cases
where one was checked but the other wasn't. Additionally, it's frankly
confusing to have both of these flags and to deal with them in practice.
However, this exists because we have an odd behavior with inodes, we allow them
to have a 0 reference count and still be usable. This again is a pretty unfun
footgun, because generally speaking we want reference counts to be meaningful.
The problem with the way we reference inodes is the final iput(). The majority
of file systems do their final truncate of a unlinked inode in their
->evict_inode() callback, which happens when the inode is actually being
evicted. This can be a long process for large inodes, and thus isn't safe to
happen in a variety of contexts. Btrfs, for example, has an entire delayed iput
infrastructure to make sure that we do not do the final iput() in a dangerous
context. We cannot expand the use of this reference count to all the places the
inode is used, because there are cases where we would need to iput() in an IRQ
context (end folio writeback) or other unsafe context, which is not allowed.
To that end, resolve this by introducing a new i_obj_count reference count. This
will be used to control when we can actually free the inode. We then can use
this reference count in all the places where we may reference the inode. This
removes another huge footgun, having ways to access the inode itself without
having an actual reference to it. The writeback code is one of the main places
where we see this. Inodes end up on all sorts of lists here without a proper
reference count. This allows us to protect the inode from being freed by giving
this an other code mechanisms to protect their access to the inode.
With this we can separate the concept of the inode being usable, and the inode
being freed. The next part of the patch series is to stop allowing for inodes
to have an i_count of 0 and still be viable. This comes with some warts. The
biggest wart is now if we choose to cache inodes in the LRU list we have to
remove the inode from the LRU list if we access it once it's on the LRU list.
This will result in more contention on the lru list lock, but in practice we
rarely have inodes that do not have a dentry, and if we do that inode is not
long for this world.
With not allowing inodes to hit a refcount of 0, we can take advantage of that
common pattern of using refcount_inc_not_zero() in all of the lockless places
where we do inode lookup in cache. From there we can change all the users who
check I_WILL_FREE or I_FREEING to simply check the i_count. If it is 0 then they
aren't allowed to do their work, othrwise they can proceed as normal.
With all of that in place we can finally remove these two flags.
This is a large series, but it is mostly mechanical. I've kept the patches very
small, to make it easy to review and logic about each change. I have run this
through fstests for btrfs and ext4, xfs is currently going. I wanted to get this
out for review to make sure this big design changes are reasonable to everybody.
The series is based on vfs/vfs.all branch, which is based on 6.9-rc1. Thanks,
Josef
Josef Bacik (50):
fs: add an i_obj_count refcount to the inode
fs: make the i_state flags an enum
fs: hold an i_obj_count reference in wait_sb_inodes
fs: hold an i_obj_count reference for the i_wb_list
fs: hold an i_obj_count reference for the i_io_list
fs: hold an i_obj_count reference in writeback_sb_inodes
fs: hold an i_obj_count reference while on the hashtable
fs: hold an i_obj_count reference while on the LRU list
fs: hold an i_obj_count reference while on the sb inode list
fs: stop accessing ->i_count directly in f2fs and gfs2
fs: hold an i_obj_count when we have an i_count reference
fs: rework iput logic
fs: add an I_LRU flag to the inode
fs: maintain a list of pinned inodes
fs: delete the inode from the LRU list on lookup
fs: change evict_inodes to use iput instead of evict directly
fs: hold a full ref while the inode is on a LRU
fs: disallow 0 reference count inodes
fs: make evict_inodes add to the dispose list under the i_lock
fs: convert i_count to refcount_t
fs: use refcount_inc_not_zero in igrab
fs: use inode_tryget in find_inode*
fs: update find_inode_*rcu to check the i_count count
fs: use igrab in insert_inode_locked
fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru
fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating
fs: use inode_tryget in evict_inodes
fs: change evict_dentries_for_decrypted_inodes to use refcount
block: use igrab in sync_bdevs
bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING
btrfs: don't check I_WILL_FREE|I_FREEING
fs: use igrab in drop_pagecache_sb
fs: stop checking I_FREEING in d_find_alias_rcu
ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env
fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c
gfs2: remove I_WILL_FREE|I_FREEING usage
fs: remove I_WILL_FREE|I_FREEING check from dquot.c
notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes
xfs: remove I_FREEING check
landlock: remove I_FREEING|I_WILL_FREE check
fs: change inode_is_dirtytime_only to use refcount
btrfs: remove references to I_FREEING
ext4: remove reference to I_FREEING in inode.c
ext4: remove reference to I_FREEING in orphan.c
pnfs: use i_count refcount to determine if the inode is going away
fs: remove some spurious I_FREEING references in inode.c
xfs: remove reference to I_FREEING|I_WILL_FREE
ocfs2: do not set I_WILL_FREE
fs: remove I_FREEING|I_WILL_FREE
fs: add documentation explaining the reference count rules for inodes
Documentation/filesystems/vfs.rst | 23 ++
arch/powerpc/platforms/cell/spufs/file.c | 2 +-
block/bdev.c | 8 +-
fs/bcachefs/fs.c | 3 +-
fs/btrfs/inode.c | 11 +-
fs/ceph/mds_client.c | 2 +-
fs/crypto/keyring.c | 7 +-
fs/dcache.c | 4 +-
fs/drop_caches.c | 11 +-
fs/ext4/ialloc.c | 4 +-
fs/ext4/inode.c | 8 +-
fs/ext4/orphan.c | 6 +-
fs/f2fs/super.c | 4 +-
fs/fs-writeback.c | 105 +++++--
fs/gfs2/ops_fstype.c | 17 +-
fs/hpfs/inode.c | 2 +-
fs/inode.c | 371 ++++++++++++++++-------
fs/internal.h | 1 +
fs/nfs/inode.c | 4 +-
fs/nfs/pnfs.c | 2 +-
fs/notify/fsnotify.c | 26 +-
fs/ocfs2/inode.c | 4 -
fs/quota/dquot.c | 6 +-
fs/super.c | 3 +
fs/ubifs/super.c | 2 +-
fs/xfs/scrub/common.c | 3 +-
fs/xfs/xfs_bmap_util.c | 2 +-
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_trace.h | 2 +-
include/linux/fs.h | 284 ++++++++++-------
include/trace/events/filelock.h | 2 +-
include/trace/events/writeback.h | 6 +-
security/landlock/fs.c | 22 +-
33 files changed, 607 insertions(+), 352 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH 01/50] fs: add an i_obj_count refcount to the inode
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 02/50] fs: make the i_state flags an enum Josef Bacik
` (49 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Currently the inode's lifetime is controlled by it's main refcount,
i_count. We overload the eviction of the inode (the final unlink) with
the deletion of the in-memory object, which leads to some hilarity when
we iput() in unfortunate places.
In order to make this less of a footgun, we want to separate the notion
of "is this inode in use by a user" and "is this inode object currently
in use", since deleting an inode is a much heavier operation that
deleting the object and cleaning up its memory.
To that end, introduce ->i_obj_count to the inode. This will be used to
control the lifetime of the inode object itself. We will continue to use
the ->i_count refcount as normal to reduce the churn of adding a new
refcount to inode. Subsequent patches will expand the usage of
->i_obj_count for internal uses, and then I will separate out the
tear down operations of the inode, and then finally adjust the refount
rules for ->i_count to be more consistent with all other refcounts.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 20 ++++++++++++++++++++
include/linux/fs.h | 7 +++++++
2 files changed, 27 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index cc0f717a140d..454770393fef 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -235,6 +235,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
inode->i_flags = 0;
inode->i_state = 0;
atomic64_set(&inode->i_sequence, 0);
+ refcount_set(&inode->i_obj_count, 1);
atomic_set(&inode->i_count, 1);
inode->i_op = &empty_iops;
inode->i_fop = &no_open_fops;
@@ -831,6 +832,11 @@ static void evict(struct inode *inode)
inode_wake_up_bit(inode, __I_NEW);
BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
+ /*
+ * refcount_dec_and_test must be used here to avoid the underflow
+ * warning.
+ */
+ WARN_ON(!refcount_dec_and_test(&inode->i_obj_count));
destroy_inode(inode);
}
@@ -1925,6 +1931,20 @@ void iput(struct inode *inode)
}
EXPORT_SYMBOL(iput);
+/**
+ * iobj_put - put a object reference on an inode
+ * @inode: inode to put
+ *
+ * Puts a object reference on an inode.
+ */
+void iobj_put(struct inode *inode)
+{
+ if (!inode)
+ return;
+ refcount_dec(&inode->i_obj_count);
+}
+EXPORT_SYMBOL(iobj_put);
+
#ifdef CONFIG_BLOCK
/**
* bmap - find a block number in a file
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a346422f5066..9a1ce67eed33 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -755,6 +755,7 @@ struct inode {
#if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
atomic_t i_readcount; /* struct files open RO */
#endif
+ refcount_t i_obj_count;
union {
const struct file_operations *i_fop; /* former ->i_op->default_file_ops */
void (*free_inode)(struct inode *);
@@ -2804,6 +2805,7 @@ extern int current_umask(void);
extern void ihold(struct inode * inode);
extern void iput(struct inode *);
+extern void iobj_put(struct inode *inode);
int inode_update_timestamps(struct inode *inode, int flags);
int generic_update_time(struct inode *, int);
@@ -3359,6 +3361,11 @@ static inline bool is_zero_ino(ino_t ino)
return (u32)ino == 0;
}
+static inline void iobj_get(struct inode *inode)
+{
+ refcount_inc(&inode->i_obj_count);
+}
+
/*
* inode->i_lock must be held
*/
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 02/50] fs: make the i_state flags an enum
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
2025-08-21 20:18 ` [PATCH 01/50] fs: add an i_obj_count refcount to the inode Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-22 11:08 ` Christian Brauner
2025-08-22 11:18 ` Sun YangKai
2025-08-21 20:18 ` [PATCH 03/50] fs: hold an i_obj_count reference in wait_sb_inodes Josef Bacik
` (48 subsequent siblings)
50 siblings, 2 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Adjusting i_state flags always means updating the values manually. Bring
these forward into the 2020's and make a nice clean macro for defining
the i_state values as an enum, providing __ variants for the cases where
we need the bit position instead of the actual value, and leaving the
actual NAME as the 1U << bit value.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
include/linux/fs.h | 234 +++++++++++++++++++++++----------------------
1 file changed, 122 insertions(+), 112 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9a1ce67eed33..e741dc453c2c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -665,6 +665,127 @@ is_uncached_acl(struct posix_acl *acl)
#define IOP_MGTIME 0x0020
#define IOP_CACHED_LINK 0x0040
+/*
+ * Inode state bits. Protected by inode->i_lock
+ *
+ * Four bits determine the dirty state of the inode: I_DIRTY_SYNC,
+ * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME.
+ *
+ * Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
+ * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
+ * various stages of removing an inode.
+ *
+ * Two bits are used for locking and completion notification, I_NEW and I_SYNC.
+ *
+ * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
+ * fdatasync() (unless I_DIRTY_DATASYNC is also set).
+ * Timestamp updates are the usual cause.
+ * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
+ * these changes separately from I_DIRTY_SYNC so that we
+ * don't have to write inode on fdatasync() when only
+ * e.g. the timestamps have changed.
+ * I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean.
+ * I_DIRTY_TIME The inode itself has dirty timestamps, and the
+ * lazytime mount option is enabled. We keep track of this
+ * separately from I_DIRTY_SYNC in order to implement
+ * lazytime. This gets cleared if I_DIRTY_INODE
+ * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
+ * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
+ * in place because writeback might already be in progress
+ * and we don't want to lose the time update
+ * I_NEW Serves as both a mutex and completion notification.
+ * New inodes set I_NEW. If two processes both create
+ * the same inode, one of them will release its inode and
+ * wait for I_NEW to be released before returning.
+ * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
+ * also cause waiting on I_NEW, without I_NEW actually
+ * being set. find_inode() uses this to prevent returning
+ * nearly-dead inodes.
+ * I_WILL_FREE Must be set when calling write_inode_now() if i_count
+ * is zero. I_FREEING must be set when I_WILL_FREE is
+ * cleared.
+ * I_FREEING Set when inode is about to be freed but still has dirty
+ * pages or buffers attached or the inode itself is still
+ * dirty.
+ * I_CLEAR Added by clear_inode(). In this state the inode is
+ * clean and can be destroyed. Inode keeps I_FREEING.
+ *
+ * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
+ * prohibited for many purposes. iget() must wait for
+ * the inode to be completely released, then create it
+ * anew. Other functions will just ignore such inodes,
+ * if appropriate. I_NEW is used for waiting.
+ *
+ * I_SYNC Writeback of inode is running. The bit is set during
+ * data writeback, and cleared with a wakeup on the bit
+ * address once it is done. The bit is also used to pin
+ * the inode in memory for flusher thread.
+ *
+ * I_REFERENCED Marks the inode as recently references on the LRU list.
+ *
+ * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
+ * synchronize competing switching instances and to tell
+ * wb stat updates to grab the i_pages lock. See
+ * inode_switch_wbs_work_fn() for details.
+ *
+ * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
+ * and work dirs among overlayfs mounts.
+ *
+ * I_CREATING New object's inode in the middle of setting up.
+ *
+ * I_DONTCACHE Evict inode as soon as it is not used anymore.
+ *
+ * I_SYNC_QUEUED Inode is queued in b_io or b_more_io writeback lists.
+ * Used to detect that mark_inode_dirty() should not move
+ * inode between dirty lists.
+ *
+ * I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback.
+ *
+ * I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding
+ * i_count.
+ *
+ * Q: What is the difference between I_WILL_FREE and I_FREEING?
+ *
+ * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
+ * upon. There's one free address left.
+ */
+
+/*
+ * As simple macro to define the inode state bits, __NAME will be the bit value
+ * (0, 1, 2, ...), and NAME will be the bit mask (1U << __NAME). The __NAME_SEQ
+ * is used to reset the sequence number so the next name gets the next bit value
+ * in the sequence.
+ */
+#define INODE_BIT(name) \
+ __ ## name, \
+ name = (1U << __ ## name), \
+ __ ## name ## _SEQ = __ ## name
+
+enum inode_state_bits {
+ INODE_BIT(I_NEW),
+ INODE_BIT(I_SYNC),
+ INODE_BIT(I_LRU_ISOLATING),
+ INODE_BIT(I_DIRTY_SYNC),
+ INODE_BIT(I_DIRTY_DATASYNC),
+ INODE_BIT(I_DIRTY_PAGES),
+ INODE_BIT(I_WILL_FREE),
+ INODE_BIT(I_FREEING),
+ INODE_BIT(I_CLEAR),
+ INODE_BIT(I_REFERENCED),
+ INODE_BIT(I_LINKABLE),
+ INODE_BIT(I_DIRTY_TIME),
+ INODE_BIT(I_WB_SWITCH),
+ INODE_BIT(I_OVL_INUSE),
+ INODE_BIT(I_CREATING),
+ INODE_BIT(I_DONTCACHE),
+ INODE_BIT(I_SYNC_QUEUED),
+ INODE_BIT(I_PINNING_NETFS_WB),
+};
+
+#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
+#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
+#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
+
/*
* Keep mostly read-only and often accessed (especially for
* the RCU path lookup and 'stat' data) fields at the beginning
@@ -723,7 +844,7 @@ struct inode {
#endif
/* Misc */
- u32 i_state;
+ enum inode_state_bits i_state;
/* 32-bit hole */
struct rw_semaphore i_rwsem;
@@ -2484,117 +2605,6 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
};
}
-/*
- * Inode state bits. Protected by inode->i_lock
- *
- * Four bits determine the dirty state of the inode: I_DIRTY_SYNC,
- * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME.
- *
- * Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
- * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
- * various stages of removing an inode.
- *
- * Two bits are used for locking and completion notification, I_NEW and I_SYNC.
- *
- * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
- * fdatasync() (unless I_DIRTY_DATASYNC is also set).
- * Timestamp updates are the usual cause.
- * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
- * these changes separately from I_DIRTY_SYNC so that we
- * don't have to write inode on fdatasync() when only
- * e.g. the timestamps have changed.
- * I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean.
- * I_DIRTY_TIME The inode itself has dirty timestamps, and the
- * lazytime mount option is enabled. We keep track of this
- * separately from I_DIRTY_SYNC in order to implement
- * lazytime. This gets cleared if I_DIRTY_INODE
- * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
- * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
- * in place because writeback might already be in progress
- * and we don't want to lose the time update
- * I_NEW Serves as both a mutex and completion notification.
- * New inodes set I_NEW. If two processes both create
- * the same inode, one of them will release its inode and
- * wait for I_NEW to be released before returning.
- * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
- * also cause waiting on I_NEW, without I_NEW actually
- * being set. find_inode() uses this to prevent returning
- * nearly-dead inodes.
- * I_WILL_FREE Must be set when calling write_inode_now() if i_count
- * is zero. I_FREEING must be set when I_WILL_FREE is
- * cleared.
- * I_FREEING Set when inode is about to be freed but still has dirty
- * pages or buffers attached or the inode itself is still
- * dirty.
- * I_CLEAR Added by clear_inode(). In this state the inode is
- * clean and can be destroyed. Inode keeps I_FREEING.
- *
- * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
- * prohibited for many purposes. iget() must wait for
- * the inode to be completely released, then create it
- * anew. Other functions will just ignore such inodes,
- * if appropriate. I_NEW is used for waiting.
- *
- * I_SYNC Writeback of inode is running. The bit is set during
- * data writeback, and cleared with a wakeup on the bit
- * address once it is done. The bit is also used to pin
- * the inode in memory for flusher thread.
- *
- * I_REFERENCED Marks the inode as recently references on the LRU list.
- *
- * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
- * synchronize competing switching instances and to tell
- * wb stat updates to grab the i_pages lock. See
- * inode_switch_wbs_work_fn() for details.
- *
- * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
- * and work dirs among overlayfs mounts.
- *
- * I_CREATING New object's inode in the middle of setting up.
- *
- * I_DONTCACHE Evict inode as soon as it is not used anymore.
- *
- * I_SYNC_QUEUED Inode is queued in b_io or b_more_io writeback lists.
- * Used to detect that mark_inode_dirty() should not move
- * inode between dirty lists.
- *
- * I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback.
- *
- * I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding
- * i_count.
- *
- * Q: What is the difference between I_WILL_FREE and I_FREEING?
- *
- * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
- * upon. There's one free address left.
- */
-#define __I_NEW 0
-#define I_NEW (1 << __I_NEW)
-#define __I_SYNC 1
-#define I_SYNC (1 << __I_SYNC)
-#define __I_LRU_ISOLATING 2
-#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
-
-#define I_DIRTY_SYNC (1 << 3)
-#define I_DIRTY_DATASYNC (1 << 4)
-#define I_DIRTY_PAGES (1 << 5)
-#define I_WILL_FREE (1 << 6)
-#define I_FREEING (1 << 7)
-#define I_CLEAR (1 << 8)
-#define I_REFERENCED (1 << 9)
-#define I_LINKABLE (1 << 10)
-#define I_DIRTY_TIME (1 << 11)
-#define I_WB_SWITCH (1 << 12)
-#define I_OVL_INUSE (1 << 13)
-#define I_CREATING (1 << 14)
-#define I_DONTCACHE (1 << 15)
-#define I_SYNC_QUEUED (1 << 16)
-#define I_PINNING_NETFS_WB (1 << 17)
-
-#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
-#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
-#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
-
extern void __mark_inode_dirty(struct inode *, int);
static inline void mark_inode_dirty(struct inode *inode)
{
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 03/50] fs: hold an i_obj_count reference in wait_sb_inodes
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
2025-08-21 20:18 ` [PATCH 01/50] fs: add an i_obj_count refcount to the inode Josef Bacik
2025-08-21 20:18 ` [PATCH 02/50] fs: make the i_state flags an enum Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 04/50] fs: hold an i_obj_count reference for the i_wb_list Josef Bacik
` (47 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
In wait_sb_inodes we need to hold a reference for the inode while we're
waiting on writeback to complete, hold a reference on the inode object
during this operation.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/fs-writeback.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a6cc3d305b84..001773e6e95c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2704,6 +2704,7 @@ static void wait_sb_inodes(struct super_block *sb)
continue;
}
__iget(inode);
+ iobj_get(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
@@ -2717,6 +2718,7 @@ static void wait_sb_inodes(struct super_block *sb)
cond_resched();
iput(inode);
+ iobj_put(inode);
rcu_read_lock();
spin_lock_irq(&sb->s_inode_wblist_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 04/50] fs: hold an i_obj_count reference for the i_wb_list
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (2 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 03/50] fs: hold an i_obj_count reference in wait_sb_inodes Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-22 11:27 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 05/50] fs: hold an i_obj_count reference for the i_io_list Josef Bacik
` (46 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
If we're holding the inode on one of the writeback lists we need to have
a reference on that inode. Grab a reference when we add i_wb_list to
something, drop it when it's removed.
This is potentially dangerous, because we remove the inode from the
i_wb_list potentially under IRQ via folio_end_writeback(). This will be
mitigated by making sure all writeback is completed on the final iput,
before the final iobj_put, preventing a potential free under IRQ.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/fs-writeback.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 001773e6e95c..c2437e3d320a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1332,6 +1332,7 @@ void sb_mark_inode_writeback(struct inode *inode)
if (list_empty(&inode->i_wb_list)) {
spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
if (list_empty(&inode->i_wb_list)) {
+ iobj_get(inode);
list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
trace_sb_mark_inode_writeback(inode);
}
@@ -1346,15 +1347,26 @@ void sb_clear_inode_writeback(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
unsigned long flags;
+ bool drop = false;
if (!list_empty(&inode->i_wb_list)) {
spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
if (!list_empty(&inode->i_wb_list)) {
+ drop = true;
list_del_init(&inode->i_wb_list);
trace_sb_clear_inode_writeback(inode);
}
spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
}
+
+ /*
+ * This can be called in IRQ context when we're clearing writeback on
+ * the folio. This should not be the last iobj_put() on the inode, we
+ * run all of the writeback before we free the inode in order to avoid
+ * this possibility.
+ */
+ if (drop)
+ iobj_put(inode);
}
/*
@@ -2683,6 +2695,8 @@ static void wait_sb_inodes(struct super_block *sb)
* to preserve consistency between i_wb_list and the mapping
* writeback tag. Writeback completion is responsible to remove
* the inode from either list once the writeback tag is cleared.
+ * At that point the i_obj_count reference will be dropped for
+ * the i_wb_list reference.
*/
list_move_tail(&inode->i_wb_list, &sb->s_inodes_wb);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 05/50] fs: hold an i_obj_count reference for the i_io_list
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (3 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 04/50] fs: hold an i_obj_count reference for the i_wb_list Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 06/50] fs: hold an i_obj_count reference in writeback_sb_inodes Josef Bacik
` (45 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
While the inode is attached to a list with its i_io_list member we need
to hold a reference on the object.
The put is under the i_lock in some cases which could potentially be
unsafe. It isn't currently because we're holding the i_obj_count
throughout the entire lifetime of the inode, so it won't be the last
currently. Subsequent patches will make sure we're holding an
i_obj_count reference while we're manipulating this list to ensure this
continues to be safe.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/fs-writeback.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c2437e3d320a..24fccb299de4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -72,6 +72,14 @@ static inline struct inode *wb_inode(struct list_head *head)
return list_entry(head, struct inode, i_io_list);
}
+static inline void inode_delete_from_io_list(struct inode *inode)
+{
+ if (!list_empty(&inode->i_io_list)) {
+ list_del_init(&inode->i_io_list);
+ iobj_put(inode);
+ }
+}
+
/*
* Include the creation of the trace points after defining the
* wb_writeback_work structure and inline functions so that the definition
@@ -123,6 +131,8 @@ static bool inode_io_list_move_locked(struct inode *inode,
assert_spin_locked(&inode->i_lock);
WARN_ON_ONCE(inode->i_state & I_FREEING);
+ if (list_empty(&inode->i_io_list))
+ iobj_get(inode);
list_move(&inode->i_io_list, head);
/* dirty_time doesn't count as dirty_io until expiration */
@@ -310,7 +320,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
if (wb != &wb->bdi->wb)
list_move(&inode->i_io_list, &wb->b_attached);
else
- list_del_init(&inode->i_io_list);
+ inode_delete_from_io_list(inode);
wb_io_lists_depopulated(wb);
}
@@ -1200,7 +1210,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
WARN_ON_ONCE(inode->i_state & I_FREEING);
inode->i_state &= ~I_SYNC_QUEUED;
- list_del_init(&inode->i_io_list);
+ inode_delete_from_io_list(inode);
wb_io_lists_depopulated(wb);
}
@@ -1308,16 +1318,23 @@ void wb_start_background_writeback(struct bdi_writeback *wb)
void inode_io_list_del(struct inode *inode)
{
struct bdi_writeback *wb;
+ bool drop = false;
wb = inode_to_wb_and_lock_list(inode);
spin_lock(&inode->i_lock);
inode->i_state &= ~I_SYNC_QUEUED;
- list_del_init(&inode->i_io_list);
+ if (!list_empty(&inode->i_io_list)) {
+ drop = true;
+ list_del_init(&inode->i_io_list);
+ }
wb_io_lists_depopulated(wb);
spin_unlock(&inode->i_lock);
spin_unlock(&wb->list_lock);
+
+ if (drop)
+ iobj_put(inode);
}
EXPORT_SYMBOL(inode_io_list_del);
@@ -1389,7 +1406,7 @@ static void redirty_tail_locked(struct inode *inode, struct bdi_writeback *wb)
* trigger assertions in inode_io_list_move_locked().
*/
if (inode->i_state & I_FREEING) {
- list_del_init(&inode->i_io_list);
+ inode_delete_from_io_list(inode);
wb_io_lists_depopulated(wb);
return;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 06/50] fs: hold an i_obj_count reference in writeback_sb_inodes
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (4 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 05/50] fs: hold an i_obj_count reference for the i_io_list Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-22 12:20 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 07/50] fs: hold an i_obj_count reference while on the hashtable Josef Bacik
` (44 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We drop the wb list_lock while writing back inodes, and we could
manipulate the i_io_list while this is happening and drop our reference
for the inode. Protect this by holding the i_obj_count reference during
the writeback.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/fs-writeback.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 24fccb299de4..2b0d26a58a5a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1977,6 +1977,7 @@ static long writeback_sb_inodes(struct super_block *sb,
trace_writeback_sb_inodes_requeue(inode);
continue;
}
+ iobj_get(inode);
spin_unlock(&wb->list_lock);
/*
@@ -1987,6 +1988,7 @@ static long writeback_sb_inodes(struct super_block *sb,
if (inode->i_state & I_SYNC) {
/* Wait for I_SYNC. This function drops i_lock... */
inode_sleep_on_writeback(inode);
+ iobj_put(inode);
/* Inode may be gone, start again */
spin_lock(&wb->list_lock);
continue;
@@ -2035,10 +2037,9 @@ static long writeback_sb_inodes(struct super_block *sb,
inode_sync_complete(inode);
spin_unlock(&inode->i_lock);
- if (unlikely(tmp_wb != wb)) {
- spin_unlock(&tmp_wb->list_lock);
- spin_lock(&wb->list_lock);
- }
+ spin_unlock(&tmp_wb->list_lock);
+ iobj_put(inode);
+ spin_lock(&wb->list_lock);
/*
* bail out to wb_writeback() often enough to check
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 07/50] fs: hold an i_obj_count reference while on the hashtable
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (5 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 06/50] fs: hold an i_obj_count reference in writeback_sb_inodes Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 08/50] fs: hold an i_obj_count reference while on the LRU list Josef Bacik
` (43 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
While the inode is on the hashtable we need to hold a reference to the
object itself.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index 454770393fef..1ff46d9417de 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -667,6 +667,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
spin_lock(&inode_hash_lock);
spin_lock(&inode->i_lock);
+ iobj_get(inode);
hlist_add_head_rcu(&inode->i_hash, b);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_hash_lock);
@@ -681,11 +682,16 @@ EXPORT_SYMBOL(__insert_inode_hash);
*/
void __remove_inode_hash(struct inode *inode)
{
+ bool putref;
+
spin_lock(&inode_hash_lock);
spin_lock(&inode->i_lock);
+ putref = !hlist_unhashed(&inode->i_hash) && !hlist_fake(&inode->i_hash);
hlist_del_init_rcu(&inode->i_hash);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_hash_lock);
+ if (putref)
+ iobj_put(inode);
}
EXPORT_SYMBOL(__remove_inode_hash);
@@ -1314,6 +1320,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
* caller is responsible for filling in the contents
*/
spin_lock(&inode->i_lock);
+ iobj_get(inode);
inode->i_state |= I_NEW;
hlist_add_head_rcu(&inode->i_hash, head);
spin_unlock(&inode->i_lock);
@@ -1451,6 +1458,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
if (!old) {
inode->i_ino = ino;
spin_lock(&inode->i_lock);
+ iobj_get(inode);
inode->i_state = I_NEW;
hlist_add_head_rcu(&inode->i_hash, head);
spin_unlock(&inode->i_lock);
@@ -1803,6 +1811,7 @@ int insert_inode_locked(struct inode *inode)
}
if (likely(!old)) {
spin_lock(&inode->i_lock);
+ iobj_get(inode);
inode->i_state |= I_NEW | I_CREATING;
hlist_add_head_rcu(&inode->i_hash, head);
spin_unlock(&inode->i_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 08/50] fs: hold an i_obj_count reference while on the LRU list
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (6 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 07/50] fs: hold an i_obj_count reference while on the hashtable Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 09/50] fs: hold an i_obj_count reference while on the sb inode list Josef Bacik
` (42 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
While on the LRU list we need to make sure the object itself does not
disappear, so hold an i_obj_count reference.
This is a little wonky currently as we're dropping the reference before
we call evict(), because currently we drop the last reference right
before we free the inode. This will be fixed in a future patch when the
freeing of the inode is moved under the control of the i_obj_count
reference.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 1ff46d9417de..7e506050a0bc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -542,10 +542,12 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
if (!mapping_shrinkable(&inode->i_data))
return;
- if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru))
+ if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
+ iobj_get(inode);
this_cpu_inc(nr_unused);
- else if (rotate)
+ } else if (rotate) {
inode->i_state |= I_REFERENCED;
+ }
}
struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
@@ -571,8 +573,10 @@ void inode_add_lru(struct inode *inode)
static void inode_lru_list_del(struct inode *inode)
{
- if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru))
+ if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
+ iobj_put(inode);
this_cpu_dec(nr_unused);
+ }
}
static void inode_pin_lru_isolating(struct inode *inode)
@@ -861,6 +865,15 @@ static void dispose_list(struct list_head *head)
inode = list_first_entry(head, struct inode, i_lru);
list_del_init(&inode->i_lru);
+ /*
+ * This is going right here for now only because we are
+ * currently not using the i_obj_count reference for anything,
+ * and it needs to hit 0 when we call evict().
+ *
+ * This will be moved when we change the lifetime rules in a
+ * future patch.
+ */
+ iobj_put(inode);
evict(inode);
cond_resched();
}
@@ -897,6 +910,7 @@ void evict_inodes(struct super_block *sb)
}
inode->i_state |= I_FREEING;
+ iobj_get(inode);
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
list_add(&inode->i_lru, &dispose);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 09/50] fs: hold an i_obj_count reference while on the sb inode list
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (7 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 08/50] fs: hold an i_obj_count reference while on the LRU list Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 10/50] fs: stop accessing ->i_count directly in f2fs and gfs2 Josef Bacik
` (41 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We are holding this inode on a sb list, make sure we're holding an
i_obj_count reference while it exists on the list.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index 7e506050a0bc..12e2e01aae0c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -630,6 +630,7 @@ void inode_sb_list_add(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
+ iobj_get(inode);
spin_lock(&sb->s_inode_list_lock);
list_add(&inode->i_sb_list, &sb->s_inodes);
spin_unlock(&sb->s_inode_list_lock);
@@ -644,6 +645,7 @@ static inline void inode_sb_list_del(struct inode *inode)
spin_lock(&sb->s_inode_list_lock);
list_del_init(&inode->i_sb_list);
spin_unlock(&sb->s_inode_list_lock);
+ iobj_put(inode);
}
}
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 10/50] fs: stop accessing ->i_count directly in f2fs and gfs2
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (8 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 09/50] fs: hold an i_obj_count reference while on the sb inode list Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-22 12:38 ` (subset) " Christian Brauner
2025-08-21 20:18 ` [PATCH 11/50] fs: hold an i_obj_count when we have an i_count reference Josef Bacik
` (40 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Instead of accessing ->i_count directly in these file systems, use the
appropriate __iget and iput helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/f2fs/super.c | 4 ++--
fs/gfs2/ops_fstype.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 1db024b20e29..2045642cfe3b 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1750,7 +1750,7 @@ static int f2fs_drop_inode(struct inode *inode)
if ((!inode_unhashed(inode) && inode->i_state & I_SYNC)) {
if (!inode->i_nlink && !is_bad_inode(inode)) {
/* to avoid evict_inode call simultaneously */
- atomic_inc(&inode->i_count);
+ __iget(inode);
spin_unlock(&inode->i_lock);
/* should remain fi->extent_tree for writepage */
@@ -1769,7 +1769,7 @@ static int f2fs_drop_inode(struct inode *inode)
sb_end_intwrite(inode->i_sb);
spin_lock(&inode->i_lock);
- atomic_dec(&inode->i_count);
+ iput(inode);
}
trace_f2fs_drop_inode(inode, 0);
return 0;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index efe99b732551..c770006f8889 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1754,7 +1754,7 @@ static void gfs2_evict_inodes(struct super_block *sb)
spin_unlock(&inode->i_lock);
continue;
}
- atomic_inc(&inode->i_count);
+ __iget(inode);
spin_unlock(&inode->i_lock);
spin_unlock(&sb->s_inode_list_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 11/50] fs: hold an i_obj_count when we have an i_count reference
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (9 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 10/50] fs: stop accessing ->i_count directly in f2fs and gfs2 Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 12/50] fs: rework iput logic Josef Bacik
` (39 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
This is the start of the semantic changes of inode lifetimes.
Unfortunately we have to do two things in one patch to be properly safe,
but this is the only case where this happens.
First we take and drop an i_obj_count reference every time we get an
i_count reference. This is because we will be changing the i_count
reference to be the indicator of a "live" inode.
The second thing we do is move the life time of the memory allocation
for the inode under the control of the i_obj_count reference.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/inode.c | 4 +++-
fs/fs-writeback.c | 2 --
fs/inode.c | 29 +++++++++++------------------
include/linux/fs.h | 1 +
4 files changed, 15 insertions(+), 21 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 784bd48b4da9..bbbcd96e8f5c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3418,8 +3418,10 @@ void btrfs_add_delayed_iput(struct btrfs_inode *inode)
struct btrfs_fs_info *fs_info = inode->root->fs_info;
unsigned long flags;
- if (atomic_add_unless(&inode->vfs_inode.i_count, -1, 1))
+ if (atomic_add_unless(&inode->vfs_inode.i_count, -1, 1)) {
+ iobj_put(&inode->vfs_inode);
return;
+ }
WARN_ON_ONCE(test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state));
atomic_inc(&fs_info->nr_delayed_iputs);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2b0d26a58a5a..d2e1fb1a0787 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2736,7 +2736,6 @@ static void wait_sb_inodes(struct super_block *sb)
continue;
}
__iget(inode);
- iobj_get(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
@@ -2750,7 +2749,6 @@ static void wait_sb_inodes(struct super_block *sb)
cond_resched();
iput(inode);
- iobj_put(inode);
rcu_read_lock();
spin_lock_irq(&sb->s_inode_wblist_lock);
diff --git a/fs/inode.c b/fs/inode.c
index 12e2e01aae0c..16acad5583fc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -527,6 +527,7 @@ static void init_once(void *foo)
*/
void ihold(struct inode *inode)
{
+ iobj_get(inode);
WARN_ON(atomic_inc_return(&inode->i_count) < 2);
}
EXPORT_SYMBOL(ihold);
@@ -843,13 +844,6 @@ static void evict(struct inode *inode)
*/
inode_wake_up_bit(inode, __I_NEW);
BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
-
- /*
- * refcount_dec_and_test must be used here to avoid the underflow
- * warning.
- */
- WARN_ON(!refcount_dec_and_test(&inode->i_obj_count));
- destroy_inode(inode);
}
/*
@@ -867,16 +861,8 @@ static void dispose_list(struct list_head *head)
inode = list_first_entry(head, struct inode, i_lru);
list_del_init(&inode->i_lru);
- /*
- * This is going right here for now only because we are
- * currently not using the i_obj_count reference for anything,
- * and it needs to hit 0 when we call evict().
- *
- * This will be moved when we change the lifetime rules in a
- * future patch.
- */
- iobj_put(inode);
evict(inode);
+ iobj_put(inode);
cond_resched();
}
}
@@ -1945,6 +1931,11 @@ void iput(struct inode *inode)
retry:
if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
+ /*
+ * Increment i_count directly as we still have our
+ * i_obj_count reference still. This is temporary and
+ * will go away in a future patch.
+ */
atomic_inc(&inode->i_count);
spin_unlock(&inode->i_lock);
trace_writeback_lazytime_iput(inode);
@@ -1953,6 +1944,7 @@ void iput(struct inode *inode)
}
iput_final(inode);
}
+ iobj_put(inode);
}
EXPORT_SYMBOL(iput);
@@ -1960,13 +1952,14 @@ EXPORT_SYMBOL(iput);
* iobj_put - put a object reference on an inode
* @inode: inode to put
*
- * Puts a object reference on an inode.
+ * Puts a object reference on an inode, free's it if we get to zero.
*/
void iobj_put(struct inode *inode)
{
if (!inode)
return;
- refcount_dec(&inode->i_obj_count);
+ if (refcount_dec_and_test(&inode->i_obj_count))
+ destroy_inode(inode);
}
EXPORT_SYMBOL(iobj_put);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e741dc453c2c..b2048fd9c300 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3381,6 +3381,7 @@ static inline void iobj_get(struct inode *inode)
*/
static inline void __iget(struct inode *inode)
{
+ iobj_get(inode);
atomic_inc(&inode->i_count);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 12/50] fs: rework iput logic
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (10 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 11/50] fs: hold an i_obj_count when we have an i_count reference Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-22 12:54 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 13/50] fs: add an I_LRU flag to the inode Josef Bacik
` (38 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Currently, if we are the last iput, and we have the I_DIRTY_TIME bit
set, we will grab a reference on the inode again and then mark it dirty
and then redo the put. This is to make sure we delay the time update
for as long as possible.
We can rework this logic to simply dec i_count if it is not 1, and if it
is do the time update while still holding the i_count reference.
Then we can replace the atomic_dec_and_lock with locking the ->i_lock
and doing atomic_dec_and_test, since we did the atomic_add_unless above.
This is preparation for no longer allowing 0 i_count inodes to exist.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 16acad5583fc..814c03f5dbb1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1928,22 +1928,23 @@ void iput(struct inode *inode)
if (!inode)
return;
BUG_ON(inode->i_state & I_CLEAR);
-retry:
- if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
- if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
- /*
- * Increment i_count directly as we still have our
- * i_obj_count reference still. This is temporary and
- * will go away in a future patch.
- */
- atomic_inc(&inode->i_count);
- spin_unlock(&inode->i_lock);
- trace_writeback_lazytime_iput(inode);
- mark_inode_dirty_sync(inode);
- goto retry;
- }
- iput_final(inode);
+
+ if (atomic_add_unless(&inode->i_count, -1, 1)) {
+ iobj_put(inode);
+ return;
}
+
+ if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
+ trace_writeback_lazytime_iput(inode);
+ mark_inode_dirty_sync(inode);
+ }
+
+ spin_lock(&inode->i_lock);
+ if (atomic_dec_and_test(&inode->i_count))
+ iput_final(inode);
+ else
+ spin_unlock(&inode->i_lock);
+
iobj_put(inode);
}
EXPORT_SYMBOL(iput);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 13/50] fs: add an I_LRU flag to the inode
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (11 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 12/50] fs: rework iput logic Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 14/50] fs: maintain a list of pinned inodes Josef Bacik
` (37 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We will be adding another list for the inode to keep track of inodes
that are being cached for other reasons. This is necessary to make sure
we know which list the inode is on, and to differentiate it from the
private dispose lists.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 7 +++++++
include/linux/fs.h | 6 ++++++
include/trace/events/writeback.h | 3 ++-
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/fs/inode.c b/fs/inode.c
index 814c03f5dbb1..94769b356224 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -545,6 +545,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
iobj_get(inode);
+ inode->i_state |= I_LRU;
this_cpu_inc(nr_unused);
} else if (rotate) {
inode->i_state |= I_REFERENCED;
@@ -574,7 +575,11 @@ void inode_add_lru(struct inode *inode)
static void inode_lru_list_del(struct inode *inode)
{
+ if (!(inode->i_state & I_LRU))
+ return;
+
if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
+ inode->i_state &= ~I_LRU;
iobj_put(inode);
this_cpu_dec(nr_unused);
}
@@ -955,6 +960,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
(inode->i_state & ~I_REFERENCED) ||
!mapping_shrinkable(&inode->i_data)) {
list_lru_isolate(lru, &inode->i_lru);
+ inode->i_state &= ~I_LRU;
spin_unlock(&inode->i_lock);
this_cpu_dec(nr_unused);
return LRU_REMOVED;
@@ -991,6 +997,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
+ inode->i_state &= ~I_LRU;
list_lru_isolate_move(lru, &inode->i_lru, freeable);
spin_unlock(&inode->i_lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b2048fd9c300..509e696a4df0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -744,6 +744,11 @@ is_uncached_acl(struct posix_acl *acl)
* I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding
* i_count.
*
+ * I_LRU Inode is on the LRU list and has an associated LRU
+ * reference count. Used to distinguish inodes where
+ * ->i_lru is on the LRU and those that are using ->i_lru
+ * for some other means.
+ *
* Q: What is the difference between I_WILL_FREE and I_FREEING?
*
* __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
@@ -780,6 +785,7 @@ enum inode_state_bits {
INODE_BIT(I_DONTCACHE),
INODE_BIT(I_SYNC_QUEUED),
INODE_BIT(I_PINNING_NETFS_WB),
+ INODE_BIT(I_LRU),
};
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 1e23919c0da9..486f85aca84d 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -28,7 +28,8 @@
{I_DONTCACHE, "I_DONTCACHE"}, \
{I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \
{I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \
- {I_LRU_ISOLATING, "I_LRU_ISOLATING"} \
+ {I_LRU_ISOLATING, "I_LRU_ISOLATING"}, \
+ {I_LRU, "I_LRU"} \
)
/* enums need to be exported to user space */
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 14/50] fs: maintain a list of pinned inodes
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (12 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 13/50] fs: add an I_LRU flag to the inode Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-22 14:55 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 15/50] fs: delete the inode from the LRU list on lookup Josef Bacik
` (36 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Currently we have relied on dirty inodes and inodes with cache on them
to simply be left hanging around on the system outside of an LRU. The
only way to make sure these inodes are eventually reclaimed is because
dirty writeback will grab a reference on the inode and then iput it when
it's done, potentially getting it on the LRU. For the cached case the
page cache deletion path will call inode_add_lru when the inode no
longer has cached pages in order to make sure the inode object can be
freed eventually. In the unmount case we walk all inodes and free them
so this all works out fine.
But we want to eliminate 0 i_count objects as a concept, so we need a
mechanism to hold a reference on these pinned inodes. To that end, add a
list to the super block that contains any inodes that are cached for one
reason or another.
When we call inode_add_lru(), if the inode falls into one of these
categories, we will add it to the cached inode list and hold an
i_obj_count reference. If the inode does not fall into one of these
categories it will be moved to the normal LRU, which is already holds an
i_obj_count reference.
The dirty case we will delete it from the LRU if it is on one, and then
the iput after the writeout will make sure it's placed onto the correct
list at that point.
The page cache case will migrate it when it calls inode_add_lru() when
deleting pages from the page cache.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/fs-writeback.c | 8 +++
fs/inode.c | 102 +++++++++++++++++++++++++++++--
fs/internal.h | 1 +
fs/super.c | 3 +
include/linux/fs.h | 11 ++++
include/trace/events/writeback.h | 3 +-
6 files changed, 121 insertions(+), 7 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d2e1fb1a0787..111a9d8215bf 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2736,6 +2736,14 @@ static void wait_sb_inodes(struct super_block *sb)
continue;
}
__iget(inode);
+
+ /*
+ * We could have potentially ended up on the cached LRU list, so
+ * remove ourselves from this list now that we have a reference,
+ * the iput will handle placing it back on the appropriate LRU
+ * list if necessary.
+ */
+ inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
diff --git a/fs/inode.c b/fs/inode.c
index 94769b356224..adcba0a4d776 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -319,6 +319,23 @@ void free_inode_nonrcu(struct inode *inode)
}
EXPORT_SYMBOL(free_inode_nonrcu);
+/*
+ * Some inodes need to stay pinned in memory because they are dirty or there are
+ * cached pages that the VM wants to keep around to avoid thrashing. This does
+ * the appropriate checks to see if we want to sheild this inode from periodic
+ * reclaim. Must be called with ->i_lock held.
+ */
+static bool inode_needs_cached(struct inode *inode)
+{
+ lockdep_assert_held(&inode->i_lock);
+
+ if (inode->i_state & (I_DIRTY_ALL | I_SYNC))
+ return true;
+ if (!mapping_shrinkable(&inode->i_data))
+ return true;
+ return false;
+}
+
static void i_callback(struct rcu_head *head)
{
struct inode *inode = container_of(head, struct inode, i_rcu);
@@ -532,20 +549,67 @@ void ihold(struct inode *inode)
}
EXPORT_SYMBOL(ihold);
+static void inode_add_cached_lru(struct inode *inode)
+{
+ lockdep_assert_held(&inode->i_lock);
+
+ if (inode->i_state & I_CACHED_LRU)
+ return;
+ if (!list_empty(&inode->i_lru))
+ return;
+
+ inode->i_state |= I_CACHED_LRU;
+ spin_lock(&inode->i_sb->s_cached_inodes_lock);
+ list_add(&inode->i_lru, &inode->i_sb->s_cached_inodes);
+ spin_unlock(&inode->i_sb->s_cached_inodes_lock);
+ iobj_get(inode);
+}
+
+static bool __inode_del_cached_lru(struct inode *inode)
+{
+ lockdep_assert_held(&inode->i_lock);
+
+ if (!(inode->i_state & I_CACHED_LRU))
+ return false;
+
+ inode->i_state &= ~I_CACHED_LRU;
+ spin_lock(&inode->i_sb->s_cached_inodes_lock);
+ list_del_init(&inode->i_lru);
+ spin_unlock(&inode->i_sb->s_cached_inodes_lock);
+ return true;
+}
+
+static bool inode_del_cached_lru(struct inode *inode)
+{
+ if (__inode_del_cached_lru(inode)) {
+ iobj_put(inode);
+ return true;
+ }
+ return false;
+}
+
static void __inode_add_lru(struct inode *inode, bool rotate)
{
- if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE))
+ bool need_ref = true;
+
+ lockdep_assert_held(&inode->i_lock);
+
+ if (inode->i_state & (I_FREEING | I_WILL_FREE))
return;
if (atomic_read(&inode->i_count))
return;
if (!(inode->i_sb->s_flags & SB_ACTIVE))
return;
- if (!mapping_shrinkable(&inode->i_data))
+ if (inode_needs_cached(inode)) {
+ inode_add_cached_lru(inode);
return;
+ }
+ need_ref = __inode_del_cached_lru(inode) == false;
if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
- iobj_get(inode);
inode->i_state |= I_LRU;
+ if (need_ref)
+ iobj_get(inode);
this_cpu_inc(nr_unused);
} else if (rotate) {
inode->i_state |= I_REFERENCED;
@@ -573,8 +637,19 @@ void inode_add_lru(struct inode *inode)
__inode_add_lru(inode, false);
}
-static void inode_lru_list_del(struct inode *inode)
+/*
+ * Caller must be holding it's own i_count reference on this inode in order to
+ * prevent this being the final iput.
+ *
+ * Needs inode->i_lock held.
+ */
+void inode_lru_list_del(struct inode *inode)
{
+ lockdep_assert_held(&inode->i_lock);
+
+ if (inode_del_cached_lru(inode))
+ return;
+
if (!(inode->i_state & I_LRU))
return;
@@ -950,6 +1025,22 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
if (!spin_trylock(&inode->i_lock))
return LRU_SKIP;
+ /*
+ * This inode is either dirty or has page cache we want to keep around,
+ * so move it to the cached list.
+ *
+ * We drop the extra i_obj_count reference we grab when adding it to the
+ * cached lru.
+ */
+ if (inode_needs_cached(inode)) {
+ list_lru_isolate(lru, &inode->i_lru);
+ inode_add_cached_lru(inode);
+ iobj_put(inode);
+ spin_unlock(&inode->i_lock);
+ this_cpu_dec(nr_unused);
+ return LRU_REMOVED;
+ }
+
/*
* Inodes can get referenced, redirtied, or repopulated while
* they're already on the LRU, and this can make them
@@ -957,8 +1048,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
* sync, or the last page cache deletion will requeue them.
*/
if (atomic_read(&inode->i_count) ||
- (inode->i_state & ~I_REFERENCED) ||
- !mapping_shrinkable(&inode->i_data)) {
+ (inode->i_state & ~I_REFERENCED)) {
list_lru_isolate(lru, &inode->i_lru);
inode->i_state &= ~I_LRU;
spin_unlock(&inode->i_lock);
diff --git a/fs/internal.h b/fs/internal.h
index 38e8aab27bbd..17ecee7056d5 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -207,6 +207,7 @@ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
int dentry_needs_remove_privs(struct mnt_idmap *, struct dentry *dentry);
bool in_group_or_capable(struct mnt_idmap *idmap,
const struct inode *inode, vfsgid_t vfsgid);
+void inode_lru_list_del(struct inode *inode);
/*
* fs-writeback.c
diff --git a/fs/super.c b/fs/super.c
index a038848e8d1f..bf3e6d9055af 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -364,6 +364,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
spin_lock_init(&s->s_inode_list_lock);
INIT_LIST_HEAD(&s->s_inodes_wb);
spin_lock_init(&s->s_inode_wblist_lock);
+ INIT_LIST_HEAD(&s->s_cached_inodes);
+ spin_lock_init(&s->s_cached_inodes_lock);
s->s_count = 1;
atomic_set(&s->s_active, 1);
@@ -409,6 +411,7 @@ static void __put_super(struct super_block *s)
WARN_ON(s->s_dentry_lru.node);
WARN_ON(s->s_inode_lru.node);
WARN_ON(!list_empty(&s->s_mounts));
+ WARN_ON(!list_empty(&s->s_cached_inodes));
call_rcu(&s->rcu, destroy_super_rcu);
}
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 509e696a4df0..8384ed81a5ad 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -749,6 +749,9 @@ is_uncached_acl(struct posix_acl *acl)
* ->i_lru is on the LRU and those that are using ->i_lru
* for some other means.
*
+ * I_CACHED_LRU Inode is cached because it is dirty or isn't shrinkable,
+ * and thus is on the s_cached_inode_lru list.
+ *
* Q: What is the difference between I_WILL_FREE and I_FREEING?
*
* __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
@@ -786,6 +789,7 @@ enum inode_state_bits {
INODE_BIT(I_SYNC_QUEUED),
INODE_BIT(I_PINNING_NETFS_WB),
INODE_BIT(I_LRU),
+ INODE_BIT(I_CACHED_LRU),
};
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
@@ -1584,6 +1588,13 @@ struct super_block {
spinlock_t s_inode_wblist_lock;
struct list_head s_inodes_wb; /* writeback inodes */
+
+ /*
+ * Cached inodes, any inodes that their reference is held by another
+ * mechanism, such as dirty inodes or unshrinkable inodes.
+ */
+ spinlock_t s_cached_inodes_lock;
+ struct list_head s_cached_inodes;
} __randomize_layout;
static inline struct user_namespace *i_user_ns(const struct inode *inode)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 486f85aca84d..6949329c744a 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -29,7 +29,8 @@
{I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \
{I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \
{I_LRU_ISOLATING, "I_LRU_ISOLATING"}, \
- {I_LRU, "I_LRU"} \
+ {I_LRU, "I_LRU"}, \
+ {I_CACHED_LRU, "I_CACHED_LRU"} \
)
/* enums need to be exported to user space */
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 15/50] fs: delete the inode from the LRU list on lookup
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (13 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 14/50] fs: maintain a list of pinned inodes Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-22 15:27 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 16/50] fs: change evict_inodes to use iput instead of evict directly Josef Bacik
` (35 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
When we move to holding a full reference on the inode when it is on an
LRU list we need to have a mechanism to re-run the LRU add logic. The
use case for this is btrfs's snapshot delete, we will lookup all the
inodes and try to drop them, but if they're on the LRU we will not call
->drop_inode() because their refcount will be elevated, so we won't know
that we need to drop the inode.
Fix this by simply removing the inode from it's respective LRU list when
we grab a reference to it in a way that we have active users. This will
ensure that the logic to add the inode to the LRU or drop the inode will
be run on the final iput from the user.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index adcba0a4d776..72981b890ec6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1146,6 +1146,7 @@ static struct inode *find_inode(struct super_block *sb,
return ERR_PTR(-ESTALE);
}
__iget(inode);
+ inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
return inode;
@@ -1187,6 +1188,7 @@ static struct inode *find_inode_fast(struct super_block *sb,
return ERR_PTR(-ESTALE);
}
__iget(inode);
+ inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
return inode;
@@ -1653,6 +1655,7 @@ struct inode *igrab(struct inode *inode)
spin_lock(&inode->i_lock);
if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
__iget(inode);
+ inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
} else {
spin_unlock(&inode->i_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 16/50] fs: change evict_inodes to use iput instead of evict directly
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (14 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 15/50] fs: delete the inode from the LRU list on lookup Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-25 9:07 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 17/50] fs: hold a full ref while the inode is on a LRU Josef Bacik
` (34 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
At evict_inodes() time, we no longer have SB_ACTIVE set, so we can
easily go through the normal iput path to clear any inodes. Update
dispose_list() to check how we need to free the inode, and then grab a
full reference to the inode while we're looping through the remaining
inodes, and simply iput them at the end.
Since we're just calling iput we don't really care about the i_count on
the inode at the current time. Remove the i_count checks and just call
iput on every inode we find.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 26 +++++++++++---------------
1 file changed, 11 insertions(+), 15 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 72981b890ec6..80ad327746a7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -933,7 +933,7 @@ static void evict(struct inode *inode)
* Dispose-list gets a local list with local inodes in it, so it doesn't
* need to worry about list corruption and SMP locks.
*/
-static void dispose_list(struct list_head *head)
+static void dispose_list(struct list_head *head, bool for_lru)
{
while (!list_empty(head)) {
struct inode *inode;
@@ -941,8 +941,12 @@ static void dispose_list(struct list_head *head)
inode = list_first_entry(head, struct inode, i_lru);
list_del_init(&inode->i_lru);
- evict(inode);
- iobj_put(inode);
+ if (for_lru) {
+ evict(inode);
+ iobj_put(inode);
+ } else {
+ iput(inode);
+ }
cond_resched();
}
}
@@ -964,21 +968,13 @@ void evict_inodes(struct super_block *sb)
again:
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
- if (atomic_read(&inode->i_count))
- continue;
-
spin_lock(&inode->i_lock);
- if (atomic_read(&inode->i_count)) {
- spin_unlock(&inode->i_lock);
- continue;
- }
if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
spin_unlock(&inode->i_lock);
continue;
}
- inode->i_state |= I_FREEING;
- iobj_get(inode);
+ __iget(inode);
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
list_add(&inode->i_lru, &dispose);
@@ -991,13 +987,13 @@ void evict_inodes(struct super_block *sb)
if (need_resched()) {
spin_unlock(&sb->s_inode_list_lock);
cond_resched();
- dispose_list(&dispose);
+ dispose_list(&dispose, false);
goto again;
}
}
spin_unlock(&sb->s_inode_list_lock);
- dispose_list(&dispose);
+ dispose_list(&dispose, false);
}
EXPORT_SYMBOL_GPL(evict_inodes);
@@ -1108,7 +1104,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
inode_lru_isolate, &freeable);
- dispose_list(&freeable);
+ dispose_list(&freeable, true);
return freed;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 17/50] fs: hold a full ref while the inode is on a LRU
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (15 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 16/50] fs: change evict_inodes to use iput instead of evict directly Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-25 9:20 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 18/50] fs: disallow 0 reference count inodes Josef Bacik
` (33 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We want to eliminate 0 refcount inodes that can be used. To that end,
make the LRU's hold a full reference on the inode while it is on an LRU
list. From there we can change the eviction code to always just iput the
inode, and the LRU operations will just add or drop a full reference
where appropriate.
We also now must take into account unlink, and drop our LRU reference
when we go to an nlink of 0. We will also avoid adding inodes with a
nlink of 0 as they can be reclaimed immediately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 105 +++++++++++++++++++++++++++++------------------------
1 file changed, 57 insertions(+), 48 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 80ad327746a7..de0ec791f9a3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -434,8 +434,18 @@ void drop_nlink(struct inode *inode)
{
WARN_ON(inode->i_nlink == 0);
inode->__i_nlink--;
- if (!inode->i_nlink)
+ if (!inode->i_nlink) {
+ /*
+ * LRU's hold a full ref on the inode, but if we've unlinked it
+ * then we want the inode to be freed when the last user goes,
+ * so delete the inode from the LRU list.
+ */
+ spin_lock(&inode->i_lock);
+ inode_lru_list_del(inode);
+ spin_unlock(&inode->i_lock);
+
atomic_long_inc(&inode->i_sb->s_remove_count);
+ }
}
EXPORT_SYMBOL(drop_nlink);
@@ -451,6 +461,12 @@ void clear_nlink(struct inode *inode)
{
if (inode->i_nlink) {
inode->__i_nlink = 0;
+
+ /* See comment in drop_nlink(). */
+ spin_lock(&inode->i_lock);
+ inode_lru_list_del(inode);
+ spin_unlock(&inode->i_lock);
+
atomic_long_inc(&inode->i_sb->s_remove_count);
}
}
@@ -555,6 +571,8 @@ static void inode_add_cached_lru(struct inode *inode)
if (inode->i_state & I_CACHED_LRU)
return;
+ if (inode->__i_nlink == 0)
+ return;
if (!list_empty(&inode->i_lru))
return;
@@ -562,7 +580,7 @@ static void inode_add_cached_lru(struct inode *inode)
spin_lock(&inode->i_sb->s_cached_inodes_lock);
list_add(&inode->i_lru, &inode->i_sb->s_cached_inodes);
spin_unlock(&inode->i_sb->s_cached_inodes_lock);
- iobj_get(inode);
+ __iget(inode);
}
static bool __inode_del_cached_lru(struct inode *inode)
@@ -582,7 +600,7 @@ static bool __inode_del_cached_lru(struct inode *inode)
static bool inode_del_cached_lru(struct inode *inode)
{
if (__inode_del_cached_lru(inode)) {
- iobj_put(inode);
+ iput(inode);
return true;
}
return false;
@@ -598,6 +616,8 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
return;
if (atomic_read(&inode->i_count))
return;
+ if (inode->__i_nlink == 0)
+ return;
if (!(inode->i_sb->s_flags & SB_ACTIVE))
return;
if (inode_needs_cached(inode)) {
@@ -609,7 +629,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
inode->i_state |= I_LRU;
if (need_ref)
- iobj_get(inode);
+ __iget(inode);
this_cpu_inc(nr_unused);
} else if (rotate) {
inode->i_state |= I_REFERENCED;
@@ -655,7 +675,7 @@ void inode_lru_list_del(struct inode *inode)
if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
inode->i_state &= ~I_LRU;
- iobj_put(inode);
+ iput(inode);
this_cpu_dec(nr_unused);
}
}
@@ -926,6 +946,7 @@ static void evict(struct inode *inode)
BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
}
+static void iput_evict(struct inode *inode);
/*
* dispose_list - dispose of the contents of a local list
* @head: the head of the list to free
@@ -933,20 +954,14 @@ static void evict(struct inode *inode)
* Dispose-list gets a local list with local inodes in it, so it doesn't
* need to worry about list corruption and SMP locks.
*/
-static void dispose_list(struct list_head *head, bool for_lru)
+static void dispose_list(struct list_head *head)
{
while (!list_empty(head)) {
struct inode *inode;
inode = list_first_entry(head, struct inode, i_lru);
list_del_init(&inode->i_lru);
-
- if (for_lru) {
- evict(inode);
- iobj_put(inode);
- } else {
- iput(inode);
- }
+ iput_evict(inode);
cond_resched();
}
}
@@ -987,13 +1002,13 @@ void evict_inodes(struct super_block *sb)
if (need_resched()) {
spin_unlock(&sb->s_inode_list_lock);
cond_resched();
- dispose_list(&dispose, false);
+ dispose_list(&dispose);
goto again;
}
}
spin_unlock(&sb->s_inode_list_lock);
- dispose_list(&dispose, false);
+ dispose_list(&dispose);
}
EXPORT_SYMBOL_GPL(evict_inodes);
@@ -1031,22 +1046,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
if (inode_needs_cached(inode)) {
list_lru_isolate(lru, &inode->i_lru);
inode_add_cached_lru(inode);
- iobj_put(inode);
- spin_unlock(&inode->i_lock);
- this_cpu_dec(nr_unused);
- return LRU_REMOVED;
- }
-
- /*
- * Inodes can get referenced, redirtied, or repopulated while
- * they're already on the LRU, and this can make them
- * unreclaimable for a while. Remove them lazily here; iput,
- * sync, or the last page cache deletion will requeue them.
- */
- if (atomic_read(&inode->i_count) ||
- (inode->i_state & ~I_REFERENCED)) {
- list_lru_isolate(lru, &inode->i_lru);
- inode->i_state &= ~I_LRU;
+ iput(inode);
spin_unlock(&inode->i_lock);
this_cpu_dec(nr_unused);
return LRU_REMOVED;
@@ -1082,7 +1082,6 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
}
WARN_ON(inode->i_state & I_NEW);
- inode->i_state |= I_FREEING;
inode->i_state &= ~I_LRU;
list_lru_isolate_move(lru, &inode->i_lru, freeable);
spin_unlock(&inode->i_lock);
@@ -1104,7 +1103,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
inode_lru_isolate, &freeable);
- dispose_list(&freeable, true);
+ dispose_list(&freeable);
return freed;
}
@@ -1967,7 +1966,7 @@ EXPORT_SYMBOL(generic_delete_inode);
* in cache if fs is alive, sync and evict if fs is
* shutting down.
*/
-static void iput_final(struct inode *inode)
+static void iput_final(struct inode *inode, bool skip_lru)
{
struct super_block *sb = inode->i_sb;
const struct super_operations *op = inode->i_sb->s_op;
@@ -1981,7 +1980,7 @@ static void iput_final(struct inode *inode)
else
drop = generic_drop_inode(inode);
- if (!drop &&
+ if (!drop && !skip_lru &&
!(inode->i_state & I_DONTCACHE) &&
(sb->s_flags & SB_ACTIVE)) {
__inode_add_lru(inode, true);
@@ -1989,6 +1988,8 @@ static void iput_final(struct inode *inode)
return;
}
+ WARN_ON(!list_empty(&inode->i_lru));
+
state = inode->i_state;
if (!drop) {
WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
@@ -2003,23 +2004,12 @@ static void iput_final(struct inode *inode)
}
WRITE_ONCE(inode->i_state, state | I_FREEING);
- if (!list_empty(&inode->i_lru))
- inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
evict(inode);
}
-/**
- * iput - put an inode
- * @inode: inode to put
- *
- * Puts an inode, dropping its usage count. If the inode use count hits
- * zero, the inode is then freed and may also be destroyed.
- *
- * Consequently, iput() can sleep.
- */
-void iput(struct inode *inode)
+static void __iput(struct inode *inode, bool skip_lru)
{
if (!inode)
return;
@@ -2037,12 +2027,31 @@ void iput(struct inode *inode)
spin_lock(&inode->i_lock);
if (atomic_dec_and_test(&inode->i_count))
- iput_final(inode);
+ iput_final(inode, skip_lru);
else
spin_unlock(&inode->i_lock);
iobj_put(inode);
}
+
+static void iput_evict(struct inode *inode)
+{
+ __iput(inode, true);
+}
+
+/**
+ * iput - put an inode
+ * @inode: inode to put
+ *
+ * Puts an inode, dropping its usage count. If the inode use count hits
+ * zero, the inode is then freed and may also be destroyed.
+ *
+ * Consequently, iput() can sleep.
+ */
+void iput(struct inode *inode)
+{
+ __iput(inode, false);
+}
EXPORT_SYMBOL(iput);
/**
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 18/50] fs: disallow 0 reference count inodes
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (16 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 17/50] fs: hold a full ref while the inode is on a LRU Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-25 10:54 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 19/50] fs: make evict_inodes add to the dispose list under the i_lock Josef Bacik
` (32 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Now that we take a full reference for inodes on the LRU, move the logic
to add the inode to the LRU to before we drop our last reference. This
allows us to ensure that if the inode has a reference count it can be
used, and we no longer hold onto inodes that have a 0 reference count.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 53 +++++++++++++++++++++++++++++++++--------------------
1 file changed, 33 insertions(+), 20 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index de0ec791f9a3..b4145ddbaf8e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -614,7 +614,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
if (inode->i_state & (I_FREEING | I_WILL_FREE))
return;
- if (atomic_read(&inode->i_count))
+ if (atomic_read(&inode->i_count) != 1)
return;
if (inode->__i_nlink == 0)
return;
@@ -1966,28 +1966,11 @@ EXPORT_SYMBOL(generic_delete_inode);
* in cache if fs is alive, sync and evict if fs is
* shutting down.
*/
-static void iput_final(struct inode *inode, bool skip_lru)
+static void iput_final(struct inode *inode, bool drop)
{
- struct super_block *sb = inode->i_sb;
- const struct super_operations *op = inode->i_sb->s_op;
unsigned long state;
- int drop;
WARN_ON(inode->i_state & I_NEW);
-
- if (op->drop_inode)
- drop = op->drop_inode(inode);
- else
- drop = generic_drop_inode(inode);
-
- if (!drop && !skip_lru &&
- !(inode->i_state & I_DONTCACHE) &&
- (sb->s_flags & SB_ACTIVE)) {
- __inode_add_lru(inode, true);
- spin_unlock(&inode->i_lock);
- return;
- }
-
WARN_ON(!list_empty(&inode->i_lru));
state = inode->i_state;
@@ -2009,8 +1992,29 @@ static void iput_final(struct inode *inode, bool skip_lru)
evict(inode);
}
+static bool maybe_add_lru(struct inode *inode, bool skip_lru)
+{
+ const struct super_operations *op = inode->i_sb->s_op;
+ struct super_block *sb = inode->i_sb;
+ bool drop = false;
+
+ if (op->drop_inode)
+ drop = op->drop_inode(inode);
+ else
+ drop = generic_drop_inode(inode);
+
+ if (!drop && !skip_lru &&
+ !(inode->i_state & I_DONTCACHE) &&
+ (sb->s_flags & SB_ACTIVE))
+ __inode_add_lru(inode, true);
+
+ return drop;
+}
+
static void __iput(struct inode *inode, bool skip_lru)
{
+ bool drop;
+
if (!inode)
return;
BUG_ON(inode->i_state & I_CLEAR);
@@ -2026,8 +2030,17 @@ static void __iput(struct inode *inode, bool skip_lru)
}
spin_lock(&inode->i_lock);
+
+ /*
+ * If we want to keep the inode around on an LRU we will grab a ref to
+ * the inode when we add it to the LRU list, so we can safely drop the
+ * callers reference after this. If we didn't add the inode to the LRU
+ * then the refcount will still be 1 and we can do the final iput.
+ */
+ drop = maybe_add_lru(inode, skip_lru);
+
if (atomic_dec_and_test(&inode->i_count))
- iput_final(inode, skip_lru);
+ iput_final(inode, drop);
else
spin_unlock(&inode->i_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 19/50] fs: make evict_inodes add to the dispose list under the i_lock
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (17 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 18/50] fs: disallow 0 reference count inodes Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 20/50] fs: convert i_count to refcount_t Josef Bacik
` (31 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
In the future when we only serialize the freeing of the inode on the
reference count we could potentially be relying on ->i_lru to be
consistent, which means we need it to be consistent under the ->i_lock.
Move the list_add in evict_inodes() to under the ->i_lock to prevent
potential races where we think the inode isn't on a list but is going to
be added to the private dispose list.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/inode.c b/fs/inode.c
index b4145ddbaf8e..07c8edb4b58a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -991,8 +991,8 @@ void evict_inodes(struct super_block *sb)
__iget(inode);
inode_lru_list_del(inode);
- spin_unlock(&inode->i_lock);
list_add(&inode->i_lru, &dispose);
+ spin_unlock(&inode->i_lock);
/*
* We can have a ton of inodes to evict at unmount time given
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 20/50] fs: convert i_count to refcount_t
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (18 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 19/50] fs: make evict_inodes add to the dispose list under the i_lock Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-22 12:10 ` Amir Goldstein
` (2 more replies)
2025-08-21 20:18 ` [PATCH 21/50] fs: use refcount_inc_not_zero in igrab Josef Bacik
` (30 subsequent siblings)
50 siblings, 3 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Now that we do not allow i_count to drop to 0 and be used we can convert
it to a refcount_t and benefit from the protections those helpers add.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
arch/powerpc/platforms/cell/spufs/file.c | 2 +-
fs/btrfs/inode.c | 4 ++--
fs/ceph/mds_client.c | 2 +-
fs/ext4/ialloc.c | 4 ++--
fs/fs-writeback.c | 2 +-
fs/hpfs/inode.c | 2 +-
fs/inode.c | 11 ++++++-----
fs/nfs/inode.c | 4 ++--
fs/notify/fsnotify.c | 2 +-
fs/ubifs/super.c | 2 +-
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_trace.h | 2 +-
include/linux/fs.h | 4 ++--
include/trace/events/filelock.h | 2 +-
security/landlock/fs.c | 2 +-
15 files changed, 24 insertions(+), 23 deletions(-)
diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c
index d5a2c77bc908..3f768b003838 100644
--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -1430,7 +1430,7 @@ static int spufs_mfc_open(struct inode *inode, struct file *file)
if (ctx->owner != current->mm)
return -EINVAL;
- if (atomic_read(&inode->i_count) != 1)
+ if (refcount_read(&inode->i_count) != 1)
return -EBUSY;
mutex_lock(&ctx->mapping_lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bbbcd96e8f5c..e85e38df3ea0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3418,7 +3418,7 @@ void btrfs_add_delayed_iput(struct btrfs_inode *inode)
struct btrfs_fs_info *fs_info = inode->root->fs_info;
unsigned long flags;
- if (atomic_add_unless(&inode->vfs_inode.i_count, -1, 1)) {
+ if (refcount_dec_not_one(&inode->vfs_inode.i_count)) {
iobj_put(&inode->vfs_inode);
return;
}
@@ -4559,7 +4559,7 @@ static void btrfs_prune_dentries(struct btrfs_root *root)
inode = btrfs_find_first_inode(root, min_ino);
while (inode) {
- if (atomic_read(&inode->vfs_inode.i_count) > 1)
+ if (refcount_read(&inode->vfs_inode.i_count) > 1)
d_prune_aliases(&inode->vfs_inode);
min_ino = btrfs_ino(inode) + 1;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 0f497c39ff82..ff666d18f6ad 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2221,7 +2221,7 @@ static int trim_caps_cb(struct inode *inode, int mds, void *arg)
int count;
dput(dentry);
d_prune_aliases(inode);
- count = atomic_read(&inode->i_count);
+ count = refcount_read(&inode->i_count);
if (count == 1)
(*remaining)--;
doutc(cl, "%p %llx.%llx cap %p pruned, count now %d\n",
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index df4051613b29..9a3c7f22a57e 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -252,10 +252,10 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
"nonexistent device\n", __func__, __LINE__);
return;
}
- if (atomic_read(&inode->i_count) > 1) {
+ if (refcount_read(&inode->i_count) > 1) {
ext4_msg(sb, KERN_ERR, "%s:%d: inode #%lu: count=%d",
__func__, __LINE__, inode->i_ino,
- atomic_read(&inode->i_count));
+ refcount_read(&inode->i_count));
return;
}
if (inode->i_nlink) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 111a9d8215bf..789c4228412c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1796,7 +1796,7 @@ static int writeback_single_inode(struct inode *inode,
int ret = 0;
spin_lock(&inode->i_lock);
- if (!atomic_read(&inode->i_count))
+ if (!refcount_read(&inode->i_count))
WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
else
WARN_ON(inode->i_state & I_WILL_FREE);
diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
index a59e8fa630db..ee23a941d8f5 100644
--- a/fs/hpfs/inode.c
+++ b/fs/hpfs/inode.c
@@ -184,7 +184,7 @@ void hpfs_write_inode(struct inode *i)
struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
struct inode *parent;
if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
- if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
+ if (hpfs_inode->i_rddir_off && !refcount_read(&i->i_count)) {
if (*hpfs_inode->i_rddir_off)
pr_err("write_inode: some position still there\n");
kfree(hpfs_inode->i_rddir_off);
diff --git a/fs/inode.c b/fs/inode.c
index 07c8edb4b58a..28d197731914 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -236,7 +236,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
inode->i_state = 0;
atomic64_set(&inode->i_sequence, 0);
refcount_set(&inode->i_obj_count, 1);
- atomic_set(&inode->i_count, 1);
+ refcount_set(&inode->i_count, 1);
inode->i_op = &empty_iops;
inode->i_fop = &no_open_fops;
inode->i_ino = 0;
@@ -561,7 +561,8 @@ static void init_once(void *foo)
void ihold(struct inode *inode)
{
iobj_get(inode);
- WARN_ON(atomic_inc_return(&inode->i_count) < 2);
+ refcount_inc(&inode->i_count);
+ WARN_ON(refcount_read(&inode->i_count) < 2);
}
EXPORT_SYMBOL(ihold);
@@ -614,7 +615,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
if (inode->i_state & (I_FREEING | I_WILL_FREE))
return;
- if (atomic_read(&inode->i_count) != 1)
+ if (refcount_read(&inode->i_count) != 1)
return;
if (inode->__i_nlink == 0)
return;
@@ -2019,7 +2020,7 @@ static void __iput(struct inode *inode, bool skip_lru)
return;
BUG_ON(inode->i_state & I_CLEAR);
- if (atomic_add_unless(&inode->i_count, -1, 1)) {
+ if (refcount_dec_not_one(&inode->i_count)) {
iobj_put(inode);
return;
}
@@ -2039,7 +2040,7 @@ static void __iput(struct inode *inode, bool skip_lru)
*/
drop = maybe_add_lru(inode, skip_lru);
- if (atomic_dec_and_test(&inode->i_count))
+ if (refcount_dec_and_test(&inode->i_count))
iput_final(inode, drop);
else
spin_unlock(&inode->i_lock);
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 338ef77ae423..9cc84f0afa9a 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -608,7 +608,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
inode->i_sb->s_id,
(unsigned long long)NFS_FILEID(inode),
nfs_display_fhandle_hash(fh),
- atomic_read(&inode->i_count));
+ refcount_read(&inode->i_count));
out:
return inode;
@@ -2229,7 +2229,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
dfprintk(VFS, "NFS: %s(%s/%lu fh_crc=0x%08x ct=%d info=0x%llx)\n",
__func__, inode->i_sb->s_id, inode->i_ino,
nfs_display_fhandle_hash(NFS_FH(inode)),
- atomic_read(&inode->i_count), fattr->valid);
+ refcount_read(&inode->i_count), fattr->valid);
if (!(fattr->valid & NFS_ATTR_FATTR_FILEID)) {
/* Only a mounted-on-fileid? Just exit */
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 079b868552c2..0883696f873d 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -66,7 +66,7 @@ static void fsnotify_unmount_inodes(struct super_block *sb)
* removed all zero refcount inodes, in any case. Test to
* be sure.
*/
- if (!atomic_read(&inode->i_count)) {
+ if (!refcount_read(&inode->i_count)) {
spin_unlock(&inode->i_lock);
continue;
}
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index f3e3b2068608..79526f71fa8a 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -358,7 +358,7 @@ static void ubifs_evict_inode(struct inode *inode)
goto out;
dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
- ubifs_assert(c, !atomic_read(&inode->i_count));
+ ubifs_assert(c, !refcount_read(&inode->i_count));
truncate_inode_pages_final(&inode->i_data);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9c39251961a3..06af749fe5f3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1035,7 +1035,7 @@ xfs_itruncate_extents_flags(
int error = 0;
xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
- if (atomic_read(&VFS_I(ip)->i_count))
+ if (refcount_read(&VFS_I(ip)->i_count))
xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
ASSERT(new_size <= XFS_ISIZE(ip));
ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index ac344e42846c..167d33b8095c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1152,7 +1152,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
TP_fast_assign(
__entry->dev = VFS_I(ip)->i_sb->s_dev;
__entry->ino = ip->i_ino;
- __entry->count = atomic_read(&VFS_I(ip)->i_count);
+ __entry->count = refcount_read(&VFS_I(ip)->i_count);
__entry->pincount = atomic_read(&ip->i_pincount);
__entry->iflags = ip->i_flags;
__entry->caller_ip = caller_ip;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8384ed81a5ad..34fb40ba8a94 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -880,7 +880,7 @@ struct inode {
};
atomic64_t i_version;
atomic64_t i_sequence; /* see futex */
- atomic_t i_count;
+ refcount_t i_count;
atomic_t i_dio_count;
atomic_t i_writecount;
#if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
@@ -3399,7 +3399,7 @@ static inline void iobj_get(struct inode *inode)
static inline void __iget(struct inode *inode)
{
iobj_get(inode);
- atomic_inc(&inode->i_count);
+ refcount_inc(&inode->i_count);
}
extern void iget_failed(struct inode *);
diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
index b8d1e00a7982..e745436cfcd2 100644
--- a/include/trace/events/filelock.h
+++ b/include/trace/events/filelock.h
@@ -189,7 +189,7 @@ TRACE_EVENT(generic_add_lease,
__entry->i_ino = inode->i_ino;
__entry->wcount = atomic_read(&inode->i_writecount);
__entry->rcount = atomic_read(&inode->i_readcount);
- __entry->icount = atomic_read(&inode->i_count);
+ __entry->icount = refcount_read(&inode->i_count);
__entry->owner = fl->c.flc_owner;
__entry->flags = fl->c.flc_flags;
__entry->type = fl->c.flc_type;
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index c04f8879ad03..570f851dc469 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -1281,7 +1281,7 @@ static void hook_sb_delete(struct super_block *const sb)
struct landlock_object *object;
/* Only handles referenced inodes. */
- if (!atomic_read(&inode->i_count))
+ if (!refcount_read(&inode->i_count))
continue;
/*
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 21/50] fs: use refcount_inc_not_zero in igrab
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (19 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 20/50] fs: convert i_count to refcount_t Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-25 11:21 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 22/50] fs: use inode_tryget in find_inode* Josef Bacik
` (29 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We are going to use igrab everywhere we want to acquire a live inode.
Update it to do a refcount_inc_not_zero on the i_count, and if
successful grab an reference to i_obj_count. Add a comment explaining
why we do this and the safety.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 26 +++++++++++++-------------
include/linux/fs.h | 27 +++++++++++++++++++++++++++
2 files changed, 40 insertions(+), 13 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 28d197731914..b9122c1eee1d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1648,20 +1648,20 @@ EXPORT_SYMBOL(iunique);
struct inode *igrab(struct inode *inode)
{
+ lockdep_assert_not_held(&inode->i_lock);
+
+ inode = inode_tryget(inode);
+ if (!inode)
+ return NULL;
+
+ /*
+ * If this inode is on the LRU, take it off so that we can re-run the
+ * LRU logic on the next iput().
+ */
spin_lock(&inode->i_lock);
- if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
- __iget(inode);
- inode_lru_list_del(inode);
- spin_unlock(&inode->i_lock);
- } else {
- spin_unlock(&inode->i_lock);
- /*
- * Handle the case where s_op->clear_inode is not been
- * called yet, and somebody is calling igrab
- * while the inode is getting freed.
- */
- inode = NULL;
- }
+ inode_lru_list_del(inode);
+ spin_unlock(&inode->i_lock);
+
return inode;
}
EXPORT_SYMBOL(igrab);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 34fb40ba8a94..b731224708be 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3393,6 +3393,33 @@ static inline void iobj_get(struct inode *inode)
refcount_inc(&inode->i_obj_count);
}
+static inline struct inode *inode_tryget(struct inode *inode)
+{
+ /*
+ * We are using inode_tryget() because we're interested in getting a
+ * live reference to the inode, which is ->i_count. Normally we would
+ * grab i_obj_count first, as it is the highe priority reference.
+ * However we're only interested in making sure we have a live inode,
+ * and we know that if we get a reference for i_count then we can safely
+ * acquire i_obj_count because we always drop i_obj_count after dropping
+ * an i_count reference.
+ *
+ * This is meant to be used either in a place where we have an existing
+ * i_obj_count reference on the inode, or under rcu_read_lock() so we
+ * know we're safe in accessing this inode still.
+ */
+ if (!refcount_inc_not_zero(&inode->i_count)) {
+ /*
+ * If we failed to increment the reference count, then the
+ * inode is being freed or has been freed. We return NULL
+ * in this case.
+ */
+ return NULL;
+ }
+ iobj_get(inode);
+ return inode;
+}
+
/*
* inode->i_lock must be held
*/
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 22/50] fs: use inode_tryget in find_inode*
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (20 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 21/50] fs: use refcount_inc_not_zero in igrab Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-25 11:26 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 23/50] fs: update find_inode_*rcu to check the i_count count Josef Bacik
` (28 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Now that we never drop the i_count to 0 for valid objects, rework the
logic in the find_inode* helpers to use inode_tryget() to see if they
have a live inode. If this fails we can wait for the inode to be freed
as we know it's currently being evicted.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index b9122c1eee1d..893ac902268b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1109,6 +1109,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
}
static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked);
+
/*
* Called with the inode lock held.
*/
@@ -1132,16 +1133,15 @@ static struct inode *find_inode(struct super_block *sb,
if (!test(inode, data))
continue;
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
- __wait_on_freeing_inode(inode, is_inode_hash_locked);
- goto repeat;
- }
if (unlikely(inode->i_state & I_CREATING)) {
spin_unlock(&inode->i_lock);
rcu_read_unlock();
return ERR_PTR(-ESTALE);
}
- __iget(inode);
+ if (!inode_tryget(inode)) {
+ __wait_on_freeing_inode(inode, is_inode_hash_locked);
+ goto repeat;
+ }
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
@@ -1174,16 +1174,15 @@ static struct inode *find_inode_fast(struct super_block *sb,
if (inode->i_sb != sb)
continue;
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
- __wait_on_freeing_inode(inode, is_inode_hash_locked);
- goto repeat;
- }
if (unlikely(inode->i_state & I_CREATING)) {
spin_unlock(&inode->i_lock);
rcu_read_unlock();
return ERR_PTR(-ESTALE);
}
- __iget(inode);
+ if (!inode_tryget(inode)) {
+ __wait_on_freeing_inode(inode, is_inode_hash_locked);
+ goto repeat;
+ }
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 23/50] fs: update find_inode_*rcu to check the i_count count
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (21 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 22/50] fs: use inode_tryget in find_inode* Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-25 11:27 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 24/50] fs: use igrab in insert_inode_locked Josef Bacik
` (27 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
These two helpers are always used under the RCU and don't appear to mind
if the inode state changes in between time of check and time of use.
Update them to use the i_count refcount instead of I_WILL_FREE or
I_FREEING.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 893ac902268b..63ccd32fa221 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1839,7 +1839,7 @@ struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
hlist_for_each_entry_rcu(inode, head, i_hash) {
if (inode->i_sb == sb &&
- !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)) &&
+ refcount_read(&inode->i_count) > 0 &&
test(inode, data))
return inode;
}
@@ -1878,8 +1878,8 @@ struct inode *find_inode_by_ino_rcu(struct super_block *sb,
hlist_for_each_entry_rcu(inode, head, i_hash) {
if (inode->i_ino == ino &&
inode->i_sb == sb &&
- !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)))
- return inode;
+ refcount_read(&inode->i_count) > 0)
+ return inode;
}
return NULL;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 24/50] fs: use igrab in insert_inode_locked
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (22 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 23/50] fs: update find_inode_*rcu to check the i_count count Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 25/50] fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru Josef Bacik
` (26 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Follow the same pattern in find_inode*. Instead of checking for
I_WILL_FREE|I_FREEING simply call igrab() and if it succeeds we're done.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 63ccd32fa221..6b772b9883ec 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1899,11 +1899,8 @@ int insert_inode_locked(struct inode *inode)
continue;
if (old->i_sb != sb)
continue;
- spin_lock(&old->i_lock);
- if (old->i_state & (I_FREEING|I_WILL_FREE)) {
- spin_unlock(&old->i_lock);
+ if (!igrab(old))
continue;
- }
break;
}
if (likely(!old)) {
@@ -1915,12 +1912,13 @@ int insert_inode_locked(struct inode *inode)
spin_unlock(&inode_hash_lock);
return 0;
}
+ spin_lock(&old->i_lock);
if (unlikely(old->i_state & I_CREATING)) {
spin_unlock(&old->i_lock);
spin_unlock(&inode_hash_lock);
+ iput(old);
return -EBUSY;
}
- __iget(old);
spin_unlock(&old->i_lock);
spin_unlock(&inode_hash_lock);
wait_on_inode(old);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 25/50] fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (23 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 24/50] fs: use igrab in insert_inode_locked Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 26/50] fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating Josef Bacik
` (25 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We only want to add to the LRU if the current caller is potentially the
last one dropping a reference, so if our refcount is 0 we're being
deleted, and if the refcount is > 1 then there is another ref holder and
they can add the inode to the LRU list.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 6b772b9883ec..c61400f39876 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -613,8 +613,6 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
lockdep_assert_held(&inode->i_lock);
- if (inode->i_state & (I_FREEING | I_WILL_FREE))
- return;
if (refcount_read(&inode->i_count) != 1)
return;
if (inode->__i_nlink == 0)
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 26/50] fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (24 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 25/50] fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 27/50] fs: use inode_tryget in evict_inodes Josef Bacik
` (24 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
If the inode is on the LRU list then it has a valid reference and we do
not need to check for these flags.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/inode.c b/fs/inode.c
index c61400f39876..a14b3a54c4b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -682,7 +682,7 @@ void inode_lru_list_del(struct inode *inode)
static void inode_pin_lru_isolating(struct inode *inode)
{
lockdep_assert_held(&inode->i_lock);
- WARN_ON(inode->i_state & (I_LRU_ISOLATING | I_FREEING | I_WILL_FREE));
+ WARN_ON(inode->i_state & I_LRU_ISOLATING);
inode->i_state |= I_LRU_ISOLATING;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 27/50] fs: use inode_tryget in evict_inodes
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (25 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 26/50] fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-25 11:43 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 28/50] fs: change evict_dentries_for_decrypted_inodes to use refcount Josef Bacik
` (23 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Instead of checking I_WILL_FREE|I_FREEING we can simply use
inode_tryget() to determine if we have a live inode that can be evicted.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index a14b3a54c4b5..4e1eeb0c3889 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -983,12 +983,16 @@ void evict_inodes(struct super_block *sb)
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+ if (inode->i_state & I_NEW) {
+ spin_unlock(&inode->i_lock);
+ continue;
+ }
+
+ if (!inode_tryget(inode)) {
spin_unlock(&inode->i_lock);
continue;
}
- __iget(inode);
inode_lru_list_del(inode);
list_add(&inode->i_lru, &dispose);
spin_unlock(&inode->i_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 28/50] fs: change evict_dentries_for_decrypted_inodes to use refcount
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (26 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 27/50] fs: use inode_tryget in evict_inodes Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 29/50] block: use igrab in sync_bdevs Josef Bacik
` (22 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Instead of checking for I_WILL_FREE|I_FREEING simply use the refcount to
make sure we have a live inode.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/crypto/keyring.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/crypto/keyring.c b/fs/crypto/keyring.c
index 7557f6a88b8f..969db498149a 100644
--- a/fs/crypto/keyring.c
+++ b/fs/crypto/keyring.c
@@ -956,13 +956,16 @@ static void evict_dentries_for_decrypted_inodes(struct fscrypt_master_key *mk)
list_for_each_entry(ci, &mk->mk_decrypted_inodes, ci_master_key_link) {
inode = ci->ci_inode;
+
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) {
+ if (inode->i_state & I_NEW) {
spin_unlock(&inode->i_lock);
continue;
}
- __iget(inode);
spin_unlock(&inode->i_lock);
+
+ if (!igrab(inode))
+ continue;
spin_unlock(&mk->mk_decrypted_inodes_lock);
shrink_dcache_inode(inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 29/50] block: use igrab in sync_bdevs
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (27 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 28/50] fs: change evict_dentries_for_decrypted_inodes to use refcount Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 30/50] bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING Josef Bacik
` (21 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Instead of checking I_WILL_FREE or I_FREEING simply grab a reference to
the inode, as it will only succeed if the inode is live.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
block/bdev.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index b77ddd12dc06..94ffc0b5a68c 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1265,13 +1265,15 @@ void sync_bdevs(bool wait)
struct block_device *bdev;
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
- mapping->nrpages == 0) {
+ if (inode->i_state & I_NEW || mapping->nrpages == 0) {
spin_unlock(&inode->i_lock);
continue;
}
- __iget(inode);
spin_unlock(&inode->i_lock);
+
+ if (!igrab(inode))
+ continue;
+
spin_unlock(&blockdev_superblock->s_inode_list_lock);
/*
* We hold a reference to 'inode' so it couldn't have been
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 30/50] bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (28 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 29/50] block: use igrab in sync_bdevs Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 31/50] btrfs: don't check I_WILL_FREE|I_FREEING Josef Bacik
` (20 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We can use the refcount to decide if the inode is alive instead of these
flags.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/bcachefs/fs.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 687af0eea0c2..de1c3c740a9d 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -347,7 +347,7 @@ static struct bch_inode_info *bch2_inode_hash_find(struct bch_fs *c, struct btre
spin_unlock(&inode->v.i_lock);
return NULL;
}
- if ((inode->v.i_state & (I_FREEING|I_WILL_FREE))) {
+ if (!refcount_read(&inode->v.i_count)) {
if (!trans) {
__wait_on_freeing_inode(c, inode, inum);
} else {
@@ -2225,7 +2225,6 @@ void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s)
continue;
if (!(inode->v.i_state & I_DONTCACHE) &&
- !(inode->v.i_state & I_FREEING) &&
igrab(&inode->v)) {
this_pass_clean = false;
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 31/50] btrfs: don't check I_WILL_FREE|I_FREEING
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (29 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 30/50] bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 32/50] fs: use igrab in drop_pagecache_sb Josef Bacik
` (19 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
btrfs has it's own per-root inode list for snapshot uses, and it has a
sanity check to make sure we're not overwriting a live inode when we add
one to the root's xarray. Change this to check the refcount to validate
it's not a live inode.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e85e38df3ea0..03fc3cbdb4af 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3860,7 +3860,7 @@ static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc)
ASSERT(ret != -ENOMEM);
return ret;
} else if (existing) {
- WARN_ON(!(existing->vfs_inode.i_state & (I_WILL_FREE | I_FREEING)));
+ WARN_ON(!refcount_read(&existing->vfs_inode.i_count));
}
return 0;
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 32/50] fs: use igrab in drop_pagecache_sb
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (30 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 31/50] btrfs: don't check I_WILL_FREE|I_FREEING Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 33/50] fs: stop checking I_FREEING in d_find_alias_rcu Josef Bacik
` (18 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Just use igrab to see if the inode is valid instead of checking
I_FREEING|I_WILL_FREE.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/drop_caches.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 019a8b4eaaf9..852ccf8e84cb 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -23,18 +23,15 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
- /*
- * We must skip inodes in unusual state. We may also skip
- * inodes without pages but we deliberately won't in case
- * we need to reschedule to avoid softlockups.
- */
- if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+ if ((inode->i_state & I_NEW) ||
(mapping_empty(inode->i_mapping) && !need_resched())) {
spin_unlock(&inode->i_lock);
continue;
}
- __iget(inode);
spin_unlock(&inode->i_lock);
+
+ if (!igrab(inode))
+ continue;
spin_unlock(&sb->s_inode_list_lock);
invalidate_mapping_pages(inode->i_mapping, 0, -1);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 33/50] fs: stop checking I_FREEING in d_find_alias_rcu
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (31 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 32/50] fs: use igrab in drop_pagecache_sb Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 34/50] ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env Josef Bacik
` (17 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Instead of checking for I_FREEING, check the refcount of the inode to
see if it is alive.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/dcache.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 60046ae23d51..fa72a7922517 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1072,8 +1072,8 @@ struct dentry *d_find_alias_rcu(struct inode *inode)
spin_lock(&inode->i_lock);
// ->i_dentry and ->i_rcu are colocated, but the latter won't be
- // used without having I_FREEING set, which means no aliases left
- if (likely(!(inode->i_state & I_FREEING) && !hlist_empty(l))) {
+ // used without having an i_count reference, which means no aliases left
+ if (likely(refcount_read(&inode->i_count) && !hlist_empty(l))) {
if (S_ISDIR(inode->i_mode)) {
de = hlist_entry(l->first, struct dentry, d_u.d_alias);
} else {
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 34/50] ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (32 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 33/50] fs: stop checking I_FREEING in d_find_alias_rcu Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 35/50] fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c Josef Bacik
` (16 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Instead check the refcount to see if the inode is alive.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/ext4/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5b7a15db4953..7674c1f614b1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -425,7 +425,7 @@ void ext4_check_map_extents_env(struct inode *inode)
if (!S_ISREG(inode->i_mode) ||
IS_NOQUOTA(inode) || IS_VERITY(inode) ||
is_special_ino(inode->i_sb, inode->i_ino) ||
- (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) ||
+ ((inode->i_state & I_NEW) || !refcount_read(&inode->i_count)) ||
ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE) ||
ext4_verity_in_progress(inode))
return;
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 35/50] fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (33 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 34/50] ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-25 11:46 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 36/50] gfs2: remove I_WILL_FREE|I_FREEING usage Josef Bacik
` (15 subsequent siblings)
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Now that we have the reference count to indicate live inodes, we can
remove all of the uses of I_WILL_FREE and I_FREEING in fs-writeback.c
and use the appropriate reference count checks.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/fs-writeback.c | 47 ++++++++++++++++++++++++++---------------------
1 file changed, 26 insertions(+), 21 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 789c4228412c..0ea6d0e86791 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -129,7 +129,7 @@ static bool inode_io_list_move_locked(struct inode *inode,
{
assert_spin_locked(&wb->list_lock);
assert_spin_locked(&inode->i_lock);
- WARN_ON_ONCE(inode->i_state & I_FREEING);
+ WARN_ON_ONCE(!refcount_read(&inode->i_count));
if (list_empty(&inode->i_io_list))
iobj_get(inode);
@@ -314,7 +314,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
{
assert_spin_locked(&wb->list_lock);
assert_spin_locked(&inode->i_lock);
- WARN_ON_ONCE(inode->i_state & I_FREEING);
+ WARN_ON_ONCE(!refcount_read(&inode->i_count));
inode->i_state &= ~I_SYNC_QUEUED;
if (wb != &wb->bdi->wb)
@@ -415,10 +415,10 @@ static bool inode_do_switch_wbs(struct inode *inode,
xa_lock_irq(&mapping->i_pages);
/*
- * Once I_FREEING or I_WILL_FREE are visible under i_lock, the eviction
- * path owns the inode and we shouldn't modify ->i_io_list.
+ * Once the refcount is 0, the eviction path owns the inode and we
+ * shouldn't modify ->i_io_list.
*/
- if (unlikely(inode->i_state & (I_FREEING | I_WILL_FREE)))
+ if (unlikely(!refcount_read(&inode->i_count)))
goto skip_switch;
trace_inode_switch_wbs(inode, old_wb, new_wb);
@@ -570,13 +570,16 @@ static bool inode_prepare_wbs_switch(struct inode *inode,
/* while holding I_WB_SWITCH, no one else can update the association */
spin_lock(&inode->i_lock);
if (!(inode->i_sb->s_flags & SB_ACTIVE) ||
- inode->i_state & (I_WB_SWITCH | I_FREEING | I_WILL_FREE) ||
+ inode->i_state & I_WB_SWITCH ||
inode_to_wb(inode) == new_wb) {
spin_unlock(&inode->i_lock);
return false;
}
+ if (!inode_tryget(inode)) {
+ spin_unlock(&inode->i_lock);
+ return false;
+ }
inode->i_state |= I_WB_SWITCH;
- __iget(inode);
spin_unlock(&inode->i_lock);
return true;
@@ -1207,7 +1210,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
{
assert_spin_locked(&wb->list_lock);
assert_spin_locked(&inode->i_lock);
- WARN_ON_ONCE(inode->i_state & I_FREEING);
+ WARN_ON_ONCE(!refcount_read(&inode->i_count));
inode->i_state &= ~I_SYNC_QUEUED;
inode_delete_from_io_list(inode);
@@ -1405,7 +1408,7 @@ static void redirty_tail_locked(struct inode *inode, struct bdi_writeback *wb)
* tracking. Flush worker will ignore this inode anyway and it will
* trigger assertions in inode_io_list_move_locked().
*/
- if (inode->i_state & I_FREEING) {
+ if (!refcount_read(&inode->i_count)) {
inode_delete_from_io_list(inode);
wb_io_lists_depopulated(wb);
return;
@@ -1621,7 +1624,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
struct writeback_control *wbc,
unsigned long dirtied_before)
{
- if (inode->i_state & I_FREEING)
+ if (!refcount_read(&inode->i_count))
return;
/*
@@ -1787,7 +1790,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
* whether it is a data-integrity sync (%WB_SYNC_ALL) or not (%WB_SYNC_NONE).
*
* To prevent the inode from going away, either the caller must have a reference
- * to the inode, or the inode must have I_WILL_FREE or I_FREEING set.
+ * to the inode, or the inode must have a zero refcount.
*/
static int writeback_single_inode(struct inode *inode,
struct writeback_control *wbc)
@@ -1797,9 +1800,7 @@ static int writeback_single_inode(struct inode *inode,
spin_lock(&inode->i_lock);
if (!refcount_read(&inode->i_count))
- WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
- else
- WARN_ON(inode->i_state & I_WILL_FREE);
+ WARN_ON(inode->i_state & I_NEW);
if (inode->i_state & I_SYNC) {
/*
@@ -1837,7 +1838,7 @@ static int writeback_single_inode(struct inode *inode,
* If the inode is freeing, its i_io_list shoudn't be updated
* as it can be finally deleted at this moment.
*/
- if (!(inode->i_state & I_FREEING)) {
+ if (refcount_read(&inode->i_count)) {
/*
* If the inode is now fully clean, then it can be safely
* removed from its writeback list (if any). Otherwise the
@@ -1957,7 +1958,7 @@ static long writeback_sb_inodes(struct super_block *sb,
* kind writeout is handled by the freer.
*/
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+ if ((inode->i_state & I_NEW) || !refcount_read(&inode->i_count)) {
redirty_tail_locked(inode, wb);
spin_unlock(&inode->i_lock);
continue;
@@ -2615,7 +2616,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (inode_unhashed(inode))
goto out_unlock;
}
- if (inode->i_state & I_FREEING)
+ if (!refcount_read(&inode->i_count))
goto out_unlock;
/*
@@ -2729,13 +2730,17 @@ static void wait_sb_inodes(struct super_block *sb)
spin_unlock_irq(&sb->s_inode_wblist_lock);
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+ if (inode->i_state & I_NEW) {
+ spin_unlock(&inode->i_lock);
+ spin_lock_irq(&sb->s_inode_wblist_lock);
+ continue;
+ }
+
+ if (!inode_tryget(inode)) {
spin_unlock(&inode->i_lock);
-
spin_lock_irq(&sb->s_inode_wblist_lock);
continue;
}
- __iget(inode);
/*
* We could have potentially ended up on the cached LRU list, so
@@ -2886,7 +2891,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
* This function commits an inode to disk immediately if it is dirty. This is
* primarily needed by knfsd.
*
- * The caller must either have a ref on the inode or must have set I_WILL_FREE.
+ * The caller must either have a ref on the inode or the inode must have a zero refcount.
*/
int write_inode_now(struct inode *inode, int sync)
{
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 36/50] gfs2: remove I_WILL_FREE|I_FREEING usage
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (34 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 35/50] fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 37/50] fs: remove I_WILL_FREE|I_FREEING check from dquot.c Josef Bacik
` (14 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Now that we have the reference count to check if the inode is live, use
that instead of checking I_WILL_FREE|I_FREEING.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/gfs2/ops_fstype.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index c770006f8889..2b481fdc903d 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1745,17 +1745,26 @@ static void gfs2_evict_inodes(struct super_block *sb)
struct gfs2_sbd *sdp = sb->s_fs_info;
set_bit(SDF_EVICTING, &sdp->sd_flags);
-
+again:
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
- if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) &&
- !need_resched()) {
+ if ((inode->i_state & I_NEW) && !need_resched()) {
spin_unlock(&inode->i_lock);
continue;
}
- __iget(inode);
spin_unlock(&inode->i_lock);
+
+ if (!igrab(inode)) {
+ if (need_resched()) {
+ spin_unlock(&sb->s_inode_list_lock);
+ iput(toput_inode);
+ toput_inode = NULL;
+ cond_resched();
+ goto again;
+ }
+ continue;
+ }
spin_unlock(&sb->s_inode_list_lock);
iput(toput_inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 37/50] fs: remove I_WILL_FREE|I_FREEING check from dquot.c
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (35 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 36/50] gfs2: remove I_WILL_FREE|I_FREEING usage Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 38/50] notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes Josef Bacik
` (13 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We can use the reference count to see if the inode is live.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/quota/dquot.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index df4a9b348769..90e69653c261 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -1030,14 +1030,16 @@ static int add_dquot_ref(struct super_block *sb, int type)
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
- if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+ if ((inode->i_state & I_NEW) ||
!atomic_read(&inode->i_writecount) ||
!dqinit_needed(inode, type)) {
spin_unlock(&inode->i_lock);
continue;
}
- __iget(inode);
spin_unlock(&inode->i_lock);
+
+ if (!igrab(inode))
+ continue;
spin_unlock(&sb->s_inode_list_lock);
#ifdef CONFIG_QUOTA_DEBUG
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 38/50] notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (36 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 37/50] fs: remove I_WILL_FREE|I_FREEING check from dquot.c Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 39/50] xfs: remove I_FREEING check Josef Bacik
` (12 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We can now just use igrab() to make sure we've got a live inode and
remove the I_WILL_FREE|I_FREEING checks.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/notify/fsnotify.c | 26 ++++----------------------
1 file changed, 4 insertions(+), 22 deletions(-)
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 0883696f873d..25996ad2a130 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -46,33 +46,15 @@ static void fsnotify_unmount_inodes(struct super_block *sb)
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
- /*
- * We cannot __iget() an inode in state I_FREEING,
- * I_WILL_FREE, or I_NEW which is fine because by that point
- * the inode cannot have any associated watches.
- */
spin_lock(&inode->i_lock);
- if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+ if (inode->i_state & I_NEW) {
spin_unlock(&inode->i_lock);
continue;
}
-
- /*
- * If i_count is zero, the inode cannot have any watches and
- * doing an __iget/iput with SB_ACTIVE clear would actually
- * evict all inodes with zero i_count from icache which is
- * unnecessarily violent and may in fact be illegal to do.
- * However, we should have been called /after/ evict_inodes
- * removed all zero refcount inodes, in any case. Test to
- * be sure.
- */
- if (!refcount_read(&inode->i_count)) {
- spin_unlock(&inode->i_lock);
- continue;
- }
-
- __iget(inode);
spin_unlock(&inode->i_lock);
+
+ if (!igrab(inode))
+ continue;
spin_unlock(&sb->s_inode_list_lock);
iput(iput_inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 39/50] xfs: remove I_FREEING check
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (37 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 38/50] notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 40/50] landlock: remove I_FREEING|I_WILL_FREE check Josef Bacik
` (11 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We can simply use the reference count to see if this inode is alive.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/xfs/xfs_bmap_util.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 06ca11731e43..a32b5e9f0183 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -514,7 +514,7 @@ xfs_can_free_eofblocks(
* Caller must either hold the exclusive io lock; or be inactivating
* the inode, which guarantees there are no other users of the inode.
*/
- if (!(VFS_I(ip)->i_state & I_FREEING))
+ if (refcount_read(&VFS_I(ip)->i_count))
xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
/* prealloc/delalloc exists only on regular files */
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 40/50] landlock: remove I_FREEING|I_WILL_FREE check
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (38 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 39/50] xfs: remove I_FREEING check Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 41/50] fs: change inode_is_dirtytime_only to use refcount Josef Bacik
` (10 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We have the reference count that we can use to see if the inode is
alive.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
security/landlock/fs.c | 22 ++++------------------
1 file changed, 4 insertions(+), 18 deletions(-)
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index 570f851dc469..fc7e577b56e1 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -1280,23 +1280,8 @@ static void hook_sb_delete(struct super_block *const sb)
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
struct landlock_object *object;
- /* Only handles referenced inodes. */
- if (!refcount_read(&inode->i_count))
- continue;
-
- /*
- * Protects against concurrent modification of inode (e.g.
- * from get_inode_object()).
- */
spin_lock(&inode->i_lock);
- /*
- * Checks I_FREEING and I_WILL_FREE to protect against a race
- * condition when release_inode() just called iput(), which
- * could lead to a NULL dereference of inode->security or a
- * second call to iput() for the same Landlock object. Also
- * checks I_NEW because such inode cannot be tied to an object.
- */
- if (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) {
+ if (inode->i_state & I_NEW) {
spin_unlock(&inode->i_lock);
continue;
}
@@ -1308,10 +1293,11 @@ static void hook_sb_delete(struct super_block *const sb)
spin_unlock(&inode->i_lock);
continue;
}
- /* Keeps a reference to this inode until the next loop walk. */
- __iget(inode);
spin_unlock(&inode->i_lock);
+ if (!igrab(inode))
+ continue;
+
/*
* If there is no concurrent release_inode() ongoing, then we
* are in charge of calling iput() on this inode, otherwise we
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 41/50] fs: change inode_is_dirtytime_only to use refcount
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (39 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 40/50] landlock: remove I_FREEING|I_WILL_FREE check Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 42/50] btrfs: remove references to I_FREEING Josef Bacik
` (9 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We don't need the I_WILL_FREE|I_FREEING check, we can use the refcount
to see if the inode is valid.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
include/linux/fs.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b731224708be..9d9acbea6433 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2644,8 +2644,8 @@ static inline void mark_inode_dirty_sync(struct inode *inode)
*/
static inline bool inode_is_dirtytime_only(struct inode *inode)
{
- return (inode->i_state & (I_DIRTY_TIME | I_NEW |
- I_FREEING | I_WILL_FREE)) == I_DIRTY_TIME;
+ return (inode->i_state & (I_DIRTY_TIME | I_NEW)) == I_DIRTY_TIME &&
+ refcount_read(&inode->i_count);
}
extern void inc_nlink(struct inode *inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 42/50] btrfs: remove references to I_FREEING
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (40 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 41/50] fs: change inode_is_dirtytime_only to use refcount Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 43/50] ext4: remove reference to I_FREEING in inode.c Josef Bacik
` (8 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We do not need the BUG_ON in our evict truncate code, and in the
invalidate path we can simply check for the i_count == 0 to see if we're
evicting the inode.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/inode.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 03fc3cbdb4af..191ded69ab38 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5338,7 +5338,6 @@ static void evict_inode_truncate_pages(struct inode *inode)
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
struct rb_node *node;
- ASSERT(inode->i_state & I_FREEING);
truncate_inode_pages_final(&inode->i_data);
btrfs_drop_extent_map_range(BTRFS_I(inode), 0, (u64)-1, false);
@@ -7448,7 +7447,7 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
u64 page_start = folio_pos(folio);
u64 page_end = page_start + folio_size(folio) - 1;
u64 cur;
- int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
+ int inode_evicting = refcount_read(&inode->vfs_inode.i_count) == 0;
/*
* We have folio locked so no new ordered extent can be created on this
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 43/50] ext4: remove reference to I_FREEING in inode.c
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (41 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 42/50] btrfs: remove references to I_FREEING Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 44/50] ext4: remove reference to I_FREEING in orphan.c Josef Bacik
` (7 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Instead of checking I_FREEING, simply check the i_count reference to see
if this inode is going away.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/ext4/inode.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7674c1f614b1..3950e19cf862 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -199,8 +199,8 @@ void ext4_evict_inode(struct inode *inode)
* For inodes with journalled data, transaction commit could have
* dirtied the inode. And for inodes with dioread_nolock, unwritten
* extents converting worker could merge extents and also have dirtied
- * the inode. Flush worker is ignoring it because of I_FREEING flag but
- * we still need to remove the inode from the writeback lists.
+ * the inode. Flush worker is ignoring it because the of the 0 i_count
+ * but we still need to remove the inode from the writeback lists.
*/
if (!list_empty_careful(&inode->i_io_list))
inode_io_list_del(inode);
@@ -4581,7 +4581,7 @@ int ext4_truncate(struct inode *inode)
* or it's a completely new inode. In those cases we might not
* have i_rwsem locked because it's not necessary.
*/
- if (!(inode->i_state & (I_NEW|I_FREEING)))
+ if (!(inode->i_state & I_NEW) && refcount_read(&inode->i_count) > 0)
WARN_ON(!inode_is_locked(inode));
trace_ext4_truncate_enter(inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 44/50] ext4: remove reference to I_FREEING in orphan.c
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (42 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 43/50] ext4: remove reference to I_FREEING in inode.c Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 45/50] pnfs: use i_count refcount to determine if the inode is going away Josef Bacik
` (6 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
We can use the i_count refcount to see if this inode is being freed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/ext4/orphan.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/orphan.c b/fs/ext4/orphan.c
index 524d4658fa40..40d3eac7b003 100644
--- a/fs/ext4/orphan.c
+++ b/fs/ext4/orphan.c
@@ -107,7 +107,8 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
if (!sbi->s_journal || is_bad_inode(inode))
return 0;
- WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
+ WARN_ON_ONCE(!(inode->i_state & I_NEW) &&
+ refcount_read(&inode->i_count) > 0 &&
!inode_is_locked(inode));
/*
* Inode orphaned in orphan file or in orphan list?
@@ -236,7 +237,8 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
return 0;
- WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
+ WARN_ON_ONCE(!(inode->i_state & I_NEW) &&
+ refcount_read(&inode->i_count) > 0 &&
!inode_is_locked(inode));
if (ext4_test_inode_state(inode, EXT4_STATE_ORPHAN_FILE))
return ext4_orphan_file_del(handle, inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 45/50] pnfs: use i_count refcount to determine if the inode is going away
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (43 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 44/50] ext4: remove reference to I_FREEING in orphan.c Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 46/50] fs: remove some spurious I_FREEING references in inode.c Josef Bacik
` (5 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Remove the I_FREEING and I_CLEAR check in PNFS and replace it with a
i_count reference check, which will indicate that the inode is going
away.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/nfs/pnfs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index a3135b5af7ee..042a5f152068 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -317,7 +317,7 @@ pnfs_put_layout_hdr(struct pnfs_layout_hdr *lo)
WARN_ONCE(1, "NFS: BUG unfreed layout segments.\n");
pnfs_detach_layout_hdr(lo);
/* Notify pnfs_destroy_layout_final() that we're done */
- if (inode->i_state & (I_FREEING | I_CLEAR))
+ if (refcount_read(&inode->i_count) == 0)
wake_up_var_locked(lo, &inode->i_lock);
spin_unlock(&inode->i_lock);
pnfs_free_layout_hdr(lo);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 46/50] fs: remove some spurious I_FREEING references in inode.c
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (44 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 45/50] pnfs: use i_count refcount to determine if the inode is going away Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 47/50] xfs: remove reference to I_FREEING|I_WILL_FREE Josef Bacik
` (4 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Now that we have the i_count reference count rules set so that we only
go into these evict paths with a 0 count, update the sanity checks to
check that instead of I_FREEING.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 4e1eeb0c3889..f715504778d2 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -874,7 +874,7 @@ void clear_inode(struct inode *inode)
*/
xa_unlock_irq(&inode->i_data.i_pages);
BUG_ON(!list_empty(&inode->i_data.i_private_list));
- BUG_ON(!(inode->i_state & I_FREEING));
+ BUG_ON(refcount_read(&inode->i_count) != 0);
BUG_ON(inode->i_state & I_CLEAR);
BUG_ON(!list_empty(&inode->i_wb_list));
/* don't need i_lock here, no concurrent mods to i_state */
@@ -887,19 +887,19 @@ EXPORT_SYMBOL(clear_inode);
* to. We remove any pages still attached to the inode and wait for any IO that
* is still in progress before finally destroying the inode.
*
- * An inode must already be marked I_FREEING so that we avoid the inode being
+ * An inode must already have an i_count of 0 so that we avoid the inode being
* moved back onto lists if we race with other code that manipulates the lists
* (e.g. writeback_single_inode). The caller is responsible for setting this.
*
* An inode must already be removed from the LRU list before being evicted from
- * the cache. This should occur atomically with setting the I_FREEING state
- * flag, so no inodes here should ever be on the LRU when being evicted.
+ * the cache. This should always be the case as the LRU list holds an i_count
+ * reference on the inode, and we only evict inodes with an i_count of 0.
*/
static void evict(struct inode *inode)
{
const struct super_operations *op = inode->i_sb->s_op;
- BUG_ON(!(inode->i_state & I_FREEING));
+ BUG_ON(refcount_read(&inode->i_count) != 0);
BUG_ON(!list_empty(&inode->i_lru));
if (!list_empty(&inode->i_io_list))
@@ -913,8 +913,8 @@ static void evict(struct inode *inode)
/*
* Wait for flusher thread to be done with the inode so that filesystem
* does not start destroying it while writeback is still running. Since
- * the inode has I_FREEING set, flusher thread won't start new work on
- * the inode. We just have to wait for running writeback to finish.
+ * the inode has a 0 i_count, flusher thread won't start new work on the
+ * inode. We just have to wait for running writeback to finish.
*/
inode_wait_for_writeback(inode);
spin_unlock(&inode->i_lock);
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 47/50] xfs: remove reference to I_FREEING|I_WILL_FREE
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (45 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 46/50] fs: remove some spurious I_FREEING references in inode.c Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 48/50] ocfs2: do not set I_WILL_FREE Josef Bacik
` (3 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
xfs already is using igrab which will do the correct thing, simply
remove the reference to these flags.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/xfs/scrub/common.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 2ef7742be7d3..b0bb40490493 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -1086,8 +1086,7 @@ xchk_install_handle_inode(
/*
* Install an already-referenced inode for scrubbing. Get our own reference to
- * the inode to make disposal simpler. The inode must not be in I_FREEING or
- * I_WILL_FREE state!
+ * the inode to make disposal simpler. The inode must be live.
*/
int
xchk_install_live_inode(
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 48/50] ocfs2: do not set I_WILL_FREE
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (46 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 47/50] xfs: remove reference to I_FREEING|I_WILL_FREE Josef Bacik
@ 2025-08-21 20:18 ` Josef Bacik
2025-08-21 20:19 ` [PATCH 49/50] fs: remove I_FREEING|I_WILL_FREE Josef Bacik
` (2 subsequent siblings)
50 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:18 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
This is a subtle behavior change. Before this change ocfs2 would keep
this inode from being discovered and used while it was doing this
because of I_WILL_FREE being set. However now we call ->drop_inode()
before we drop the last i_count refcount, so we could potentially race
here with somebody else and grab a reference to this inode.
This isn't bad, the inode is still live and concurrent accesses will be
safe. But we could potentially end up writing this inode multiple times
if there are concurrent accesses while we're trying to drop the inode.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/ocfs2/inode.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 14bf440ea4df..d3c79d9a9635 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1306,13 +1306,9 @@ int ocfs2_drop_inode(struct inode *inode)
trace_ocfs2_drop_inode((unsigned long long)oi->ip_blkno,
inode->i_nlink, oi->ip_flags);
- assert_spin_locked(&inode->i_lock);
- inode->i_state |= I_WILL_FREE;
spin_unlock(&inode->i_lock);
write_inode_now(inode, 1);
spin_lock(&inode->i_lock);
- WARN_ON(inode->i_state & I_NEW);
- inode->i_state &= ~I_WILL_FREE;
return 1;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 49/50] fs: remove I_FREEING|I_WILL_FREE
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (47 preceding siblings ...)
2025-08-21 20:18 ` [PATCH 48/50] ocfs2: do not set I_WILL_FREE Josef Bacik
@ 2025-08-21 20:19 ` Josef Bacik
2025-08-25 11:53 ` Christian Brauner
2025-08-21 20:19 ` [PATCH 50/50] fs: add documentation explaining the reference count rules for inodes Josef Bacik
2025-08-22 10:51 ` [PATCH 00/50] fs: rework inode reference counting Christian Brauner
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:19 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Now that we're using the i_count reference count as the ultimate arbiter
of whether or not an inode is life we can remove the I_FREEING and
I_WILL_FREE flags.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/inode.c | 8 ++------
include/linux/fs.h | 32 +++++++++++---------------------
include/trace/events/writeback.h | 2 --
3 files changed, 13 insertions(+), 29 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index f715504778d2..1bb528405b3d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -878,7 +878,7 @@ void clear_inode(struct inode *inode)
BUG_ON(inode->i_state & I_CLEAR);
BUG_ON(!list_empty(&inode->i_wb_list));
/* don't need i_lock here, no concurrent mods to i_state */
- inode->i_state = I_FREEING | I_CLEAR;
+ inode->i_state = I_CLEAR;
}
EXPORT_SYMBOL(clear_inode);
@@ -942,7 +942,7 @@ static void evict(struct inode *inode)
* This also means we don't need any fences for the call below.
*/
inode_wake_up_bit(inode, __I_NEW);
- BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
+ BUG_ON(inode->i_state != I_CLEAR);
}
static void iput_evict(struct inode *inode);
@@ -1975,7 +1975,6 @@ static void iput_final(struct inode *inode, bool drop)
state = inode->i_state;
if (!drop) {
- WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
spin_unlock(&inode->i_lock);
write_inode_now(inode, 1);
@@ -1983,10 +1982,7 @@ static void iput_final(struct inode *inode, bool drop)
spin_lock(&inode->i_lock);
state = inode->i_state;
WARN_ON(state & I_NEW);
- state &= ~I_WILL_FREE;
}
-
- WRITE_ONCE(inode->i_state, state | I_FREEING);
spin_unlock(&inode->i_lock);
evict(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9d9acbea6433..0599faef0d6a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -672,8 +672,8 @@ is_uncached_acl(struct posix_acl *acl)
* I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME.
*
* Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
- * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
- * various stages of removing an inode.
+ * until that flag is cleared. I_CLEAR is set when the inode is clean and ready
+ * to be freed.
*
* Two bits are used for locking and completion notification, I_NEW and I_SYNC.
*
@@ -697,24 +697,18 @@ is_uncached_acl(struct posix_acl *acl)
* New inodes set I_NEW. If two processes both create
* the same inode, one of them will release its inode and
* wait for I_NEW to be released before returning.
- * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
- * also cause waiting on I_NEW, without I_NEW actually
- * being set. find_inode() uses this to prevent returning
+ * Inodes with an i_count == 0 or I_CLEAR state can also
+ * cause waiting on I_NEW, without I_NEW actually being
+ * set. find_inode() uses this to prevent returning
* nearly-dead inodes.
- * I_WILL_FREE Must be set when calling write_inode_now() if i_count
- * is zero. I_FREEING must be set when I_WILL_FREE is
- * cleared.
- * I_FREEING Set when inode is about to be freed but still has dirty
- * pages or buffers attached or the inode itself is still
- * dirty.
* I_CLEAR Added by clear_inode(). In this state the inode is
- * clean and can be destroyed. Inode keeps I_FREEING.
+ * clean and can be destroyed.
*
- * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
- * prohibited for many purposes. iget() must wait for
- * the inode to be completely released, then create it
- * anew. Other functions will just ignore such inodes,
- * if appropriate. I_NEW is used for waiting.
+ * Inodes that have i_count == 0 or I_CLEAR are prohibited
+ * for many purposes. iget() must wait for the inode to be
+ * completely released, then create it anew. Other
+ * functions will just ignore such inodes, if appropriate.
+ * I_NEW is used for waiting.
*
* I_SYNC Writeback of inode is running. The bit is set during
* data writeback, and cleared with a wakeup on the bit
@@ -752,8 +746,6 @@ is_uncached_acl(struct posix_acl *acl)
* I_CACHED_LRU Inode is cached because it is dirty or isn't shrinkable,
* and thus is on the s_cached_inode_lru list.
*
- * Q: What is the difference between I_WILL_FREE and I_FREEING?
- *
* __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
* upon. There's one free address left.
*/
@@ -776,8 +768,6 @@ enum inode_state_bits {
INODE_BIT(I_DIRTY_SYNC),
INODE_BIT(I_DIRTY_DATASYNC),
INODE_BIT(I_DIRTY_PAGES),
- INODE_BIT(I_WILL_FREE),
- INODE_BIT(I_FREEING),
INODE_BIT(I_CLEAR),
INODE_BIT(I_REFERENCED),
INODE_BIT(I_LINKABLE),
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 6949329c744a..58ee61f3d91d 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -15,8 +15,6 @@
{I_DIRTY_DATASYNC, "I_DIRTY_DATASYNC"}, \
{I_DIRTY_PAGES, "I_DIRTY_PAGES"}, \
{I_NEW, "I_NEW"}, \
- {I_WILL_FREE, "I_WILL_FREE"}, \
- {I_FREEING, "I_FREEING"}, \
{I_CLEAR, "I_CLEAR"}, \
{I_SYNC, "I_SYNC"}, \
{I_DIRTY_TIME, "I_DIRTY_TIME"}, \
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 50/50] fs: add documentation explaining the reference count rules for inodes
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (48 preceding siblings ...)
2025-08-21 20:19 ` [PATCH 49/50] fs: remove I_FREEING|I_WILL_FREE Josef Bacik
@ 2025-08-21 20:19 ` Josef Bacik
2025-08-25 11:56 ` Christian Brauner
2025-08-22 10:51 ` [PATCH 00/50] fs: rework inode reference counting Christian Brauner
50 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-21 20:19 UTC (permalink / raw)
To: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
Now that we've made these changes to the inode, document the reference
count rules in the vfs documentation.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
Documentation/filesystems/vfs.rst | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 229eb90c96f2..5bfe7863a5de 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -457,6 +457,29 @@ The Inode Object
An inode object represents an object within the filesystem.
+Reference counting rules
+------------------------
+
+The inode is reference counted in two distinct ways, an i_obj_count refcount and
+an i_count refcount. These control two different lifetimes of the inode. The
+i_obj_count is the simplest, think of it as a reference count on the object
+itself. When the i_obj_count reaches zero, the inode is freed. Inode freeing
+happens in the RCU context, so the inode is not freed immediately, but rather
+after a grace period.
+
+The i_count reference is the indicator that the inode is "alive". That is to
+say, it is available for use by all the ways that a user can access the inode.
+Once this count reaches zero, we begin the process of evicting the inode. This
+is where the final truncate of an unlinked inode will normally occur. Once
+i_count has reached 0, only the final iput() is allowed to do things like
+writeback, truncate, etc. All users that want to do these style of operation
+must use igrab() or, in very rare and specific circumstances, use
+inode_tryget().
+
+Every access to an inode must include one of these two references. Generally
+i_obj_count is reserved for internal VFS references, the s_inode_list for
+example. All file systems should use igrab()/lookup() to get a live reference on
+the inode, with very few exceptions.
struct inode_operations
-----------------------
--
2.49.0
^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [PATCH 00/50] fs: rework inode reference counting
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
` (49 preceding siblings ...)
2025-08-21 20:19 ` [PATCH 50/50] fs: add documentation explaining the reference count rules for inodes Josef Bacik
@ 2025-08-22 10:51 ` Christian Brauner
2025-08-22 13:30 ` Josef Bacik
50 siblings, 1 reply; 85+ messages in thread
From: Christian Brauner @ 2025-08-22 10:51 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:11PM -0400, Josef Bacik wrote:
> Hello,
>
> This series is the first part of a larger body of work geared towards solving a
> variety of scalability issues in the VFS.
>
> We have historically had a variety of foot-guns related to inode freeing. We
> have I_WILL_FREE and I_FREEING flags that indicated when the inode was in the
> different stages of being reclaimed. This lead to confusion, and bugs in cases
> where one was checked but the other wasn't. Additionally, it's frankly
> confusing to have both of these flags and to deal with them in practice.
Agreed.
> However, this exists because we have an odd behavior with inodes, we allow them
> to have a 0 reference count and still be usable. This again is a pretty unfun
> footgun, because generally speaking we want reference counts to be meaningful.
Agreed.
> The problem with the way we reference inodes is the final iput(). The majority
> of file systems do their final truncate of a unlinked inode in their
> ->evict_inode() callback, which happens when the inode is actually being
> evicted. This can be a long process for large inodes, and thus isn't safe to
> happen in a variety of contexts. Btrfs, for example, has an entire delayed iput
> infrastructure to make sure that we do not do the final iput() in a dangerous
> context. We cannot expand the use of this reference count to all the places the
> inode is used, because there are cases where we would need to iput() in an IRQ
> context (end folio writeback) or other unsafe context, which is not allowed.
>
> To that end, resolve this by introducing a new i_obj_count reference count. This
> will be used to control when we can actually free the inode. We then can use
> this reference count in all the places where we may reference the inode. This
> removes another huge footgun, having ways to access the inode itself without
> having an actual reference to it. The writeback code is one of the main places
> where we see this. Inodes end up on all sorts of lists here without a proper
> reference count. This allows us to protect the inode from being freed by giving
> this an other code mechanisms to protect their access to the inode.
>
> With this we can separate the concept of the inode being usable, and the inode
> being freed. The next part of the patch series is to stop allowing for inodes
> to have an i_count of 0 and still be viable. This comes with some warts. The
> biggest wart is now if we choose to cache inodes in the LRU list we have to
> remove the inode from the LRU list if we access it once it's on the LRU list.
> This will result in more contention on the lru list lock, but in practice we
> rarely have inodes that do not have a dentry, and if we do that inode is not
> long for this world.
>
> With not allowing inodes to hit a refcount of 0, we can take advantage of that
> common pattern of using refcount_inc_not_zero() in all of the lockless places
> where we do inode lookup in cache. From there we can change all the users who
> check I_WILL_FREE or I_FREEING to simply check the i_count. If it is 0 then they
> aren't allowed to do their work, othrwise they can proceed as normal.
>
> With all of that in place we can finally remove these two flags.
>
> This is a large series, but it is mostly mechanical. I've kept the patches very
> small, to make it easy to review and logic about each change. I have run this
> through fstests for btrfs and ext4, xfs is currently going. I wanted to get this
> out for review to make sure this big design changes are reasonable to everybody.
>
> The series is based on vfs/vfs.all branch, which is based on 6.9-rc1. Thanks,
I so hope you meant 6.17-rc1 because otherwise I did something very very
wrong. :)
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 02/50] fs: make the i_state flags an enum
2025-08-21 20:18 ` [PATCH 02/50] fs: make the i_state flags an enum Josef Bacik
@ 2025-08-22 11:08 ` Christian Brauner
2025-08-22 13:31 ` Josef Bacik
2025-08-22 14:36 ` David Sterba
2025-08-22 11:18 ` Sun YangKai
1 sibling, 2 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-22 11:08 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:13PM -0400, Josef Bacik wrote:
> Adjusting i_state flags always means updating the values manually. Bring
> these forward into the 2020's and make a nice clean macro for defining
> the i_state values as an enum, providing __ variants for the cases where
> we need the bit position instead of the actual value, and leaving the
> actual NAME as the 1U << bit value.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> include/linux/fs.h | 234 +++++++++++++++++++++++----------------------
> 1 file changed, 122 insertions(+), 112 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9a1ce67eed33..e741dc453c2c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -665,6 +665,127 @@ is_uncached_acl(struct posix_acl *acl)
> #define IOP_MGTIME 0x0020
> #define IOP_CACHED_LINK 0x0040
>
> +/*
> + * Inode state bits. Protected by inode->i_lock
> + *
> + * Four bits determine the dirty state of the inode: I_DIRTY_SYNC,
> + * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME.
> + *
> + * Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
> + * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
> + * various stages of removing an inode.
> + *
> + * Two bits are used for locking and completion notification, I_NEW and I_SYNC.
> + *
> + * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
> + * fdatasync() (unless I_DIRTY_DATASYNC is also set).
> + * Timestamp updates are the usual cause.
> + * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
> + * these changes separately from I_DIRTY_SYNC so that we
> + * don't have to write inode on fdatasync() when only
> + * e.g. the timestamps have changed.
> + * I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean.
> + * I_DIRTY_TIME The inode itself has dirty timestamps, and the
> + * lazytime mount option is enabled. We keep track of this
> + * separately from I_DIRTY_SYNC in order to implement
> + * lazytime. This gets cleared if I_DIRTY_INODE
> + * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
> + * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
> + * in place because writeback might already be in progress
> + * and we don't want to lose the time update
> + * I_NEW Serves as both a mutex and completion notification.
> + * New inodes set I_NEW. If two processes both create
> + * the same inode, one of them will release its inode and
> + * wait for I_NEW to be released before returning.
> + * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
> + * also cause waiting on I_NEW, without I_NEW actually
> + * being set. find_inode() uses this to prevent returning
> + * nearly-dead inodes.
> + * I_WILL_FREE Must be set when calling write_inode_now() if i_count
> + * is zero. I_FREEING must be set when I_WILL_FREE is
> + * cleared.
> + * I_FREEING Set when inode is about to be freed but still has dirty
> + * pages or buffers attached or the inode itself is still
> + * dirty.
> + * I_CLEAR Added by clear_inode(). In this state the inode is
> + * clean and can be destroyed. Inode keeps I_FREEING.
> + *
> + * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
> + * prohibited for many purposes. iget() must wait for
> + * the inode to be completely released, then create it
> + * anew. Other functions will just ignore such inodes,
> + * if appropriate. I_NEW is used for waiting.
> + *
> + * I_SYNC Writeback of inode is running. The bit is set during
> + * data writeback, and cleared with a wakeup on the bit
> + * address once it is done. The bit is also used to pin
> + * the inode in memory for flusher thread.
> + *
> + * I_REFERENCED Marks the inode as recently references on the LRU list.
> + *
> + * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
> + * synchronize competing switching instances and to tell
> + * wb stat updates to grab the i_pages lock. See
> + * inode_switch_wbs_work_fn() for details.
> + *
> + * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
> + * and work dirs among overlayfs mounts.
> + *
> + * I_CREATING New object's inode in the middle of setting up.
> + *
> + * I_DONTCACHE Evict inode as soon as it is not used anymore.
> + *
> + * I_SYNC_QUEUED Inode is queued in b_io or b_more_io writeback lists.
> + * Used to detect that mark_inode_dirty() should not move
> + * inode between dirty lists.
> + *
> + * I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback.
> + *
> + * I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding
> + * i_count.
> + *
> + * Q: What is the difference between I_WILL_FREE and I_FREEING?
> + *
> + * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
> + * upon. There's one free address left.
> + */
> +
> +/*
> + * As simple macro to define the inode state bits, __NAME will be the bit value
> + * (0, 1, 2, ...), and NAME will be the bit mask (1U << __NAME). The __NAME_SEQ
> + * is used to reset the sequence number so the next name gets the next bit value
> + * in the sequence.
> + */
> +#define INODE_BIT(name) \
> + __ ## name, \
> + name = (1U << __ ## name), \
> + __ ## name ## _SEQ = __ ## name
I'm not sure if this is the future we want :D
I think it's harder to parse than what we have now.
> +
> +enum inode_state_bits {
> + INODE_BIT(I_NEW),
> + INODE_BIT(I_SYNC),
> + INODE_BIT(I_LRU_ISOLATING),
> + INODE_BIT(I_DIRTY_SYNC),
> + INODE_BIT(I_DIRTY_DATASYNC),
> + INODE_BIT(I_DIRTY_PAGES),
> + INODE_BIT(I_WILL_FREE),
> + INODE_BIT(I_FREEING),
> + INODE_BIT(I_CLEAR),
> + INODE_BIT(I_REFERENCED),
> + INODE_BIT(I_LINKABLE),
> + INODE_BIT(I_DIRTY_TIME),
> + INODE_BIT(I_WB_SWITCH),
> + INODE_BIT(I_OVL_INUSE),
> + INODE_BIT(I_CREATING),
> + INODE_BIT(I_DONTCACHE),
> + INODE_BIT(I_SYNC_QUEUED),
> + INODE_BIT(I_PINNING_NETFS_WB),
> +};
Good idea but I really dislike this macro indirection.
Can't we just do the really boring?
enum inode_state_bits {
__I_BIT_NEW = 0U
__I_BIT_SYNC = 1U
__I_BIT_LRU_ISOLATING = 2U
}
enum inode_state_flags_t {
I_NEW = (1U << __I_BIT_NEW)
I_SYNC = (1U << __I_BIT_SYNC)
I_LRU_ISOLATING = (1U << __I_BIT_LRU_ISOLATING)
I_DIRTY_SYNC = (1U << 3)
I_DIRTY_DATASYNC = (1U << 4)
I_DIRTY_PAGES = (1U << 5)
I_WILL_FREE = (1U << 6)
I_FREEING = (1U << 7)
I_CLEAR = (1U << 8)
I_REFERENCED = (1U << 9)
I_LINKABLE = (1U << 10)
I_DIRTY_TIME = (1U << 11)
I_WB_SWITCH = (1U << 12)
I_OVL_INUSE = (1U << 13)
I_CREATING = (1U << 14)
I_DONTCACHE = (1U << 15)
I_SYNC_QUEUED = (1U << 16)
I_PINNING_NETFS_WB = (1U << 17)
};
Note that inode_state_wait_address() and that only works on four bits so
we can't really use higher bits anyway without switching back to a
scheme where we have to use unsigned long and waste for bytes for
nothing on 64 bit.
With that out of the way,
Reviewed-by: Christian Brauner <brauner@kernel.org>
> +#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> +#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
> +#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
> +
> /*
> * Keep mostly read-only and often accessed (especially for
> * the RCU path lookup and 'stat' data) fields at the beginning
> @@ -723,7 +844,7 @@ struct inode {
> #endif
>
> /* Misc */
> - u32 i_state;
> + enum inode_state_bits i_state;
> /* 32-bit hole */
> struct rw_semaphore i_rwsem;
>
> @@ -2484,117 +2605,6 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
> };
> }
>
> -/*
> - * Inode state bits. Protected by inode->i_lock
> - *
> - * Four bits determine the dirty state of the inode: I_DIRTY_SYNC,
> - * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME.
> - *
> - * Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
> - * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
> - * various stages of removing an inode.
> - *
> - * Two bits are used for locking and completion notification, I_NEW and I_SYNC.
> - *
> - * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
> - * fdatasync() (unless I_DIRTY_DATASYNC is also set).
> - * Timestamp updates are the usual cause.
> - * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
> - * these changes separately from I_DIRTY_SYNC so that we
> - * don't have to write inode on fdatasync() when only
> - * e.g. the timestamps have changed.
> - * I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean.
> - * I_DIRTY_TIME The inode itself has dirty timestamps, and the
> - * lazytime mount option is enabled. We keep track of this
> - * separately from I_DIRTY_SYNC in order to implement
> - * lazytime. This gets cleared if I_DIRTY_INODE
> - * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
> - * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
> - * in place because writeback might already be in progress
> - * and we don't want to lose the time update
> - * I_NEW Serves as both a mutex and completion notification.
> - * New inodes set I_NEW. If two processes both create
> - * the same inode, one of them will release its inode and
> - * wait for I_NEW to be released before returning.
> - * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
> - * also cause waiting on I_NEW, without I_NEW actually
> - * being set. find_inode() uses this to prevent returning
> - * nearly-dead inodes.
> - * I_WILL_FREE Must be set when calling write_inode_now() if i_count
> - * is zero. I_FREEING must be set when I_WILL_FREE is
> - * cleared.
> - * I_FREEING Set when inode is about to be freed but still has dirty
> - * pages or buffers attached or the inode itself is still
> - * dirty.
> - * I_CLEAR Added by clear_inode(). In this state the inode is
> - * clean and can be destroyed. Inode keeps I_FREEING.
> - *
> - * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
> - * prohibited for many purposes. iget() must wait for
> - * the inode to be completely released, then create it
> - * anew. Other functions will just ignore such inodes,
> - * if appropriate. I_NEW is used for waiting.
> - *
> - * I_SYNC Writeback of inode is running. The bit is set during
> - * data writeback, and cleared with a wakeup on the bit
> - * address once it is done. The bit is also used to pin
> - * the inode in memory for flusher thread.
> - *
> - * I_REFERENCED Marks the inode as recently references on the LRU list.
> - *
> - * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
> - * synchronize competing switching instances and to tell
> - * wb stat updates to grab the i_pages lock. See
> - * inode_switch_wbs_work_fn() for details.
> - *
> - * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
> - * and work dirs among overlayfs mounts.
> - *
> - * I_CREATING New object's inode in the middle of setting up.
> - *
> - * I_DONTCACHE Evict inode as soon as it is not used anymore.
> - *
> - * I_SYNC_QUEUED Inode is queued in b_io or b_more_io writeback lists.
> - * Used to detect that mark_inode_dirty() should not move
> - * inode between dirty lists.
> - *
> - * I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback.
> - *
> - * I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding
> - * i_count.
> - *
> - * Q: What is the difference between I_WILL_FREE and I_FREEING?
> - *
> - * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
> - * upon. There's one free address left.
> - */
> -#define __I_NEW 0
> -#define I_NEW (1 << __I_NEW)
> -#define __I_SYNC 1
> -#define I_SYNC (1 << __I_SYNC)
> -#define __I_LRU_ISOLATING 2
> -#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
> -
> -#define I_DIRTY_SYNC (1 << 3)
> -#define I_DIRTY_DATASYNC (1 << 4)
> -#define I_DIRTY_PAGES (1 << 5)
> -#define I_WILL_FREE (1 << 6)
> -#define I_FREEING (1 << 7)
> -#define I_CLEAR (1 << 8)
> -#define I_REFERENCED (1 << 9)
> -#define I_LINKABLE (1 << 10)
> -#define I_DIRTY_TIME (1 << 11)
> -#define I_WB_SWITCH (1 << 12)
> -#define I_OVL_INUSE (1 << 13)
> -#define I_CREATING (1 << 14)
> -#define I_DONTCACHE (1 << 15)
> -#define I_SYNC_QUEUED (1 << 16)
> -#define I_PINNING_NETFS_WB (1 << 17)
> -
> -#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> -#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
> -#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
> -
> extern void __mark_inode_dirty(struct inode *, int);
> static inline void mark_inode_dirty(struct inode *inode)
> {
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 02/50] fs: make the i_state flags an enum
2025-08-21 20:18 ` [PATCH 02/50] fs: make the i_state flags an enum Josef Bacik
2025-08-22 11:08 ` Christian Brauner
@ 2025-08-22 11:18 ` Sun YangKai
2025-08-22 11:42 ` [PATCH 02/50] " Alan Huang
2025-08-22 14:40 ` [PATCH 02/50] fs: " Josef Bacik
1 sibling, 2 replies; 85+ messages in thread
From: Sun YangKai @ 2025-08-22 11:18 UTC (permalink / raw)
To: josef
Cc: brauner, kernel-team, linux-btrfs, linux-ext4, linux-fsdevel,
linux-xfs, viro
[-- Attachment #1: Type: text/plain, Size: 3396 bytes --]
Hi Josef,
Sorry for the bothering, and I hope this isn't too far off-topic for the
current patch series discussion.
I recently learned about the x-macro trick and was wondering if it might be
suitable for use in this context since we are rewriting this. I'd appreciate
any thoughts or feedback on whether this approach could be applied here.
Thanks in advance for your insights!
Below is the patch for reference:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d7ab4f96d705..6a766aaa457e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2576,28 +2576,36 @@ static inline void kiocb_clone(struct kiocb *kiocb,
struct kiocb *kiocb_src,
* __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
* upon. There's one free address left.
*/
-#define __I_NEW 0
-#define I_NEW (1 << __I_NEW)
-#define __I_SYNC 1
-#define I_SYNC (1 << __I_SYNC)
-#define __I_LRU_ISOLATING 2
-#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
-
-#define I_DIRTY_SYNC (1 << 3)
-#define I_DIRTY_DATASYNC (1 << 4)
-#define I_DIRTY_PAGES (1 << 5)
-#define I_WILL_FREE (1 << 6)
-#define I_FREEING (1 << 7)
-#define I_CLEAR (1 << 8)
-#define I_REFERENCED (1 << 9)
-#define I_LINKABLE (1 << 10)
-#define I_DIRTY_TIME (1 << 11)
-#define I_WB_SWITCH (1 << 12)
-#define I_OVL_INUSE (1 << 13)
-#define I_CREATING (1 << 14)
-#define I_DONTCACHE (1 << 15)
-#define I_SYNC_QUEUED (1 << 16)
-#define I_PINNING_NETFS_WB (1 << 17)
+#define INODE_STATE(X) \
+ X(I_NEW), \
+ X(I_SYNC), \
+ X(I_LRU_ISOLATING), \
+ X(I_DIRTY_SYNC), \
+ X(I_DIRTY_DATASYNC), \
+ X(I_DIRTY_PAGES), \
+ X(I_WILL_FREE), \
+ X(I_FREEING), \
+ X(I_CLEAR), \
+ X(I_REFERENCED), \
+ X(I_LINKABLE), \
+ X(I_DIRTY_TIME), \
+ X(I_WB_SWITCH), \
+ X(I_OVL_INUSE), \
+ X(I_CREATING), \
+ X(I_DONTCACHE), \
+ X(I_SYNC_QUEUED), \
+ X(I_PINNING_NETFS_WB)
+
+enum __inode_state_bits {
+ #define X(state) __##state
+ INODE_STATE(X)
+ #undef X
+};
+enum inode_state_bits {
+ #define X(state) state = (1 << __##state)
+ INODE_STATE(X)
+ #undef X
+};
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
diff --git a/include/trace/events/writeback.h b/include/trace/events/
writeback.h
index 1e23919c0da9..4c545c72c40a 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -9,26 +9,10 @@
#include <linux/backing-dev.h>
#include <linux/writeback.h>
-#define show_inode_state(state) \
- __print_flags(state, "|", \
- {I_DIRTY_SYNC, "I_DIRTY_SYNC"}, \
- {I_DIRTY_DATASYNC, "I_DIRTY_DATASYNC"}, \
- {I_DIRTY_PAGES, "I_DIRTY_PAGES"}, \
- {I_NEW, "I_NEW"}, \
- {I_WILL_FREE, "I_WILL_FREE"}, \
- {I_FREEING, "I_FREEING"}, \
- {I_CLEAR, "I_CLEAR"}, \
- {I_SYNC, "I_SYNC"}, \
- {I_DIRTY_TIME, "I_DIRTY_TIME"}, \
- {I_REFERENCED, "I_REFERENCED"}, \
- {I_LINKABLE, "I_LINKABLE"}, \
- {I_WB_SWITCH, "I_WB_SWITCH"}, \
- {I_OVL_INUSE, "I_OVL_INUSE"}, \
- {I_CREATING, "I_CREATING"}, \
- {I_DONTCACHE, "I_DONTCACHE"}, \
- {I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \
- {I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \
- {I_LRU_ISOLATING, "I_LRU_ISOLATING"} \
+#define inode_state_name(s) { s, #s }
+#define show_inode_state(state) \
+ __print_flags(state, "|", \
+ INODE_STATE(inode_state_name) \
)
/* enums need to be exported to user space */
Best regards,
Sun YangKai
[-- Attachment #2: x-macro.patch --]
[-- Type: text/x-patch, Size: 2946 bytes --]
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d7ab4f96d705..6a766aaa457e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2576,28 +2576,36 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
* __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
* upon. There's one free address left.
*/
-#define __I_NEW 0
-#define I_NEW (1 << __I_NEW)
-#define __I_SYNC 1
-#define I_SYNC (1 << __I_SYNC)
-#define __I_LRU_ISOLATING 2
-#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
-
-#define I_DIRTY_SYNC (1 << 3)
-#define I_DIRTY_DATASYNC (1 << 4)
-#define I_DIRTY_PAGES (1 << 5)
-#define I_WILL_FREE (1 << 6)
-#define I_FREEING (1 << 7)
-#define I_CLEAR (1 << 8)
-#define I_REFERENCED (1 << 9)
-#define I_LINKABLE (1 << 10)
-#define I_DIRTY_TIME (1 << 11)
-#define I_WB_SWITCH (1 << 12)
-#define I_OVL_INUSE (1 << 13)
-#define I_CREATING (1 << 14)
-#define I_DONTCACHE (1 << 15)
-#define I_SYNC_QUEUED (1 << 16)
-#define I_PINNING_NETFS_WB (1 << 17)
+#define INODE_STATE(X) \
+ X(I_NEW), \
+ X(I_SYNC), \
+ X(I_LRU_ISOLATING), \
+ X(I_DIRTY_SYNC), \
+ X(I_DIRTY_DATASYNC), \
+ X(I_DIRTY_PAGES), \
+ X(I_WILL_FREE), \
+ X(I_FREEING), \
+ X(I_CLEAR), \
+ X(I_REFERENCED), \
+ X(I_LINKABLE), \
+ X(I_DIRTY_TIME), \
+ X(I_WB_SWITCH), \
+ X(I_OVL_INUSE), \
+ X(I_CREATING), \
+ X(I_DONTCACHE), \
+ X(I_SYNC_QUEUED), \
+ X(I_PINNING_NETFS_WB)
+
+enum __inode_state_bits {
+ #define X(state) __##state
+ INODE_STATE(X)
+ #undef X
+};
+enum inode_state_bits {
+ #define X(state) state = (1 << __##state)
+ INODE_STATE(X)
+ #undef X
+};
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 1e23919c0da9..4c545c72c40a 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -9,26 +9,10 @@
#include <linux/backing-dev.h>
#include <linux/writeback.h>
-#define show_inode_state(state) \
- __print_flags(state, "|", \
- {I_DIRTY_SYNC, "I_DIRTY_SYNC"}, \
- {I_DIRTY_DATASYNC, "I_DIRTY_DATASYNC"}, \
- {I_DIRTY_PAGES, "I_DIRTY_PAGES"}, \
- {I_NEW, "I_NEW"}, \
- {I_WILL_FREE, "I_WILL_FREE"}, \
- {I_FREEING, "I_FREEING"}, \
- {I_CLEAR, "I_CLEAR"}, \
- {I_SYNC, "I_SYNC"}, \
- {I_DIRTY_TIME, "I_DIRTY_TIME"}, \
- {I_REFERENCED, "I_REFERENCED"}, \
- {I_LINKABLE, "I_LINKABLE"}, \
- {I_WB_SWITCH, "I_WB_SWITCH"}, \
- {I_OVL_INUSE, "I_OVL_INUSE"}, \
- {I_CREATING, "I_CREATING"}, \
- {I_DONTCACHE, "I_DONTCACHE"}, \
- {I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \
- {I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \
- {I_LRU_ISOLATING, "I_LRU_ISOLATING"} \
+#define inode_state_name(s) { s, #s }
+#define show_inode_state(state) \
+ __print_flags(state, "|", \
+ INODE_STATE(inode_state_name) \
)
/* enums need to be exported to user space */
^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [PATCH 04/50] fs: hold an i_obj_count reference for the i_wb_list
2025-08-21 20:18 ` [PATCH 04/50] fs: hold an i_obj_count reference for the i_wb_list Josef Bacik
@ 2025-08-22 11:27 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-22 11:27 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:15PM -0400, Josef Bacik wrote:
> If we're holding the inode on one of the writeback lists we need to have
> a reference on that inode. Grab a reference when we add i_wb_list to
> something, drop it when it's removed.
>
> This is potentially dangerous, because we remove the inode from the
> i_wb_list potentially under IRQ via folio_end_writeback(). This will be
> mitigated by making sure all writeback is completed on the final iput,
> before the final iobj_put, preventing a potential free under IRQ.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/fs-writeback.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 001773e6e95c..c2437e3d320a 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1332,6 +1332,7 @@ void sb_mark_inode_writeback(struct inode *inode)
> if (list_empty(&inode->i_wb_list)) {
> spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
> if (list_empty(&inode->i_wb_list)) {
> + iobj_get(inode);
> list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> trace_sb_mark_inode_writeback(inode);
> }
> @@ -1346,15 +1347,26 @@ void sb_clear_inode_writeback(struct inode *inode)
> {
> struct super_block *sb = inode->i_sb;
> unsigned long flags;
> + bool drop = false;
>
> if (!list_empty(&inode->i_wb_list)) {
> spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
> if (!list_empty(&inode->i_wb_list)) {
> + drop = true;
> list_del_init(&inode->i_wb_list);
> trace_sb_clear_inode_writeback(inode);
> }
> spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
> }
> +
> + /*
> + * This can be called in IRQ context when we're clearing writeback on
> + * the folio. This should not be the last iobj_put() on the inode, we
> + * run all of the writeback before we free the inode in order to avoid
> + * this possibility.
> + */
> + if (drop)
> + iobj_put(inode);
In that case it might be valuable to have a:
VFS_WARN_ON_ONCE(refcount_read(&inode->i_obj_count) < 2);
before calling iobj_put() here? It'll compile out without
CONFIG_VFS_DEBUG set.
Btw, you should also be able to write this as removing the condition.
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2e10cc2f955f..cfdb2c2793cb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1366,13 +1366,13 @@ void sb_mark_inode_writeback(struct inode *inode)
void sb_clear_inode_writeback(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
+ struct inode *drop = NULL;
unsigned long flags;
- bool drop = false;
if (!list_empty(&inode->i_wb_list)) {
spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
if (!list_empty(&inode->i_wb_list)) {
- drop = true;
+ drop = inode;
list_del_init(&inode->i_wb_list);
trace_sb_clear_inode_writeback(inode);
}
@@ -1385,8 +1385,7 @@ void sb_clear_inode_writeback(struct inode *inode)
* run all of the writeback before we free the inode in order to avoid
* this possibility.
*/
- if (drop)
- iobj_put(inode);
+ iobj_put(drop);
}
> }
>
> /*
> @@ -2683,6 +2695,8 @@ static void wait_sb_inodes(struct super_block *sb)
> * to preserve consistency between i_wb_list and the mapping
> * writeback tag. Writeback completion is responsible to remove
> * the inode from either list once the writeback tag is cleared.
> + * At that point the i_obj_count reference will be dropped for
> + * the i_wb_list reference.
> */
> list_move_tail(&inode->i_wb_list, &sb->s_inodes_wb);
>
> --
> 2.49.0
>
^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [PATCH 02/50] make the i_state flags an enum
2025-08-22 11:18 ` Sun YangKai
@ 2025-08-22 11:42 ` Alan Huang
2025-08-22 12:11 ` Sun YangKai
2025-08-22 14:40 ` [PATCH 02/50] fs: " Josef Bacik
1 sibling, 1 reply; 85+ messages in thread
From: Alan Huang @ 2025-08-22 11:42 UTC (permalink / raw)
To: Sun YangKai
Cc: josef, brauner, kernel-team, linux-btrfs, linux-ext4,
linux-fsdevel, linux-xfs, viro
On Aug 22, 2025, at 19:18, Sun YangKai <sunk67188@gmail.com> wrote:
>
> Hi Josef,
>
> Sorry for the bothering, and I hope this isn't too far off-topic for the
> current patch series discussion.
>
> I recently learned about the x-macro trick and was wondering if it might be
> suitable for use in this context since we are rewriting this. I'd appreciate
> any thoughts or feedback on whether this approach could be applied here.
I think x-macro is easy to write, but hard to read or grep.
>
> Thanks in advance for your insights!
>
> Below is the patch for reference:
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index d7ab4f96d705..6a766aaa457e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2576,28 +2576,36 @@ static inline void kiocb_clone(struct kiocb *kiocb,
> struct kiocb *kiocb_src,
> * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
> * upon. There's one free address left.
> */
> -#define __I_NEW 0
> -#define I_NEW (1 << __I_NEW)
> -#define __I_SYNC 1
> -#define I_SYNC (1 << __I_SYNC)
> -#define __I_LRU_ISOLATING 2
> -#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
> -
> -#define I_DIRTY_SYNC (1 << 3)
> -#define I_DIRTY_DATASYNC (1 << 4)
> -#define I_DIRTY_PAGES (1 << 5)
> -#define I_WILL_FREE (1 << 6)
> -#define I_FREEING (1 << 7)
> -#define I_CLEAR (1 << 8)
> -#define I_REFERENCED (1 << 9)
> -#define I_LINKABLE (1 << 10)
> -#define I_DIRTY_TIME (1 << 11)
> -#define I_WB_SWITCH (1 << 12)
> -#define I_OVL_INUSE (1 << 13)
> -#define I_CREATING (1 << 14)
> -#define I_DONTCACHE (1 << 15)
> -#define I_SYNC_QUEUED (1 << 16)
> -#define I_PINNING_NETFS_WB (1 << 17)
> +#define INODE_STATE(X) \
And it should be INODE_STATE().
> + X(I_NEW), \
> + X(I_SYNC), \
> + X(I_LRU_ISOLATING), \
> + X(I_DIRTY_SYNC), \
> + X(I_DIRTY_DATASYNC), \
> + X(I_DIRTY_PAGES), \
> + X(I_WILL_FREE), \
> + X(I_FREEING), \
> + X(I_CLEAR), \
> + X(I_REFERENCED), \
> + X(I_LINKABLE), \
> + X(I_DIRTY_TIME), \
> + X(I_WB_SWITCH), \
> + X(I_OVL_INUSE), \
> + X(I_CREATING), \
> + X(I_DONTCACHE), \
> + X(I_SYNC_QUEUED), \
> + X(I_PINNING_NETFS_WB)
> +
> +enum __inode_state_bits {
> + #define X(state) __##state
> + INODE_STATE(X)
> + #undef X
> +};
> +enum inode_state_bits {
> + #define X(state) state = (1 << __##state)
> + INODE_STATE(X)
> + #undef X
> +};
>
> #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
> diff --git a/include/trace/events/writeback.h b/include/trace/events/
> writeback.h
> index 1e23919c0da9..4c545c72c40a 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -9,26 +9,10 @@
> #include <linux/backing-dev.h>
> #include <linux/writeback.h>
>
> -#define show_inode_state(state) \
> - __print_flags(state, "|", \
> - {I_DIRTY_SYNC, "I_DIRTY_SYNC"}, \
> - {I_DIRTY_DATASYNC, "I_DIRTY_DATASYNC"}, \
> - {I_DIRTY_PAGES, "I_DIRTY_PAGES"}, \
> - {I_NEW, "I_NEW"}, \
> - {I_WILL_FREE, "I_WILL_FREE"}, \
> - {I_FREEING, "I_FREEING"}, \
> - {I_CLEAR, "I_CLEAR"}, \
> - {I_SYNC, "I_SYNC"}, \
> - {I_DIRTY_TIME, "I_DIRTY_TIME"}, \
> - {I_REFERENCED, "I_REFERENCED"}, \
> - {I_LINKABLE, "I_LINKABLE"}, \
> - {I_WB_SWITCH, "I_WB_SWITCH"}, \
> - {I_OVL_INUSE, "I_OVL_INUSE"}, \
> - {I_CREATING, "I_CREATING"}, \
> - {I_DONTCACHE, "I_DONTCACHE"}, \
> - {I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \
> - {I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \
> - {I_LRU_ISOLATING, "I_LRU_ISOLATING"} \
> +#define inode_state_name(s) { s, #s }
> +#define show_inode_state(state) \
> + __print_flags(state, "|", \
> + INODE_STATE(inode_state_name) \
> )
>
> /* enums need to be exported to user space */
>
> Best regards,
> Sun YangKai
> <x-macro.patch>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 20/50] fs: convert i_count to refcount_t
2025-08-21 20:18 ` [PATCH 20/50] fs: convert i_count to refcount_t Josef Bacik
@ 2025-08-22 12:10 ` Amir Goldstein
2025-08-22 13:56 ` kernel test robot
2025-08-25 11:03 ` Christian Brauner
2 siblings, 0 replies; 85+ messages in thread
From: Amir Goldstein @ 2025-08-22 12:10 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
brauner, viro
On Fri, Aug 22, 2025 at 12:54 AM Josef Bacik <josef@toxicpanda.com> wrote:
>
> Now that we do not allow i_count to drop to 0 and be used we can convert
> it to a refcount_t and benefit from the protections those helpers add.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> arch/powerpc/platforms/cell/spufs/file.c | 2 +-
> fs/btrfs/inode.c | 4 ++--
> fs/ceph/mds_client.c | 2 +-
> fs/ext4/ialloc.c | 4 ++--
> fs/fs-writeback.c | 2 +-
> fs/hpfs/inode.c | 2 +-
> fs/inode.c | 11 ++++++-----
> fs/nfs/inode.c | 4 ++--
> fs/notify/fsnotify.c | 2 +-
> fs/ubifs/super.c | 2 +-
> fs/xfs/xfs_inode.c | 2 +-
> fs/xfs/xfs_trace.h | 2 +-
> include/linux/fs.h | 4 ++--
> include/trace/events/filelock.h | 2 +-
> security/landlock/fs.c | 2 +-
> 15 files changed, 24 insertions(+), 23 deletions(-)
>
You missed a spot in fs/smb/client/inode.c
that is using inode->i_count.counter directly.
Thanks,
Amir.
> diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c
> index d5a2c77bc908..3f768b003838 100644
> --- a/arch/powerpc/platforms/cell/spufs/file.c
> +++ b/arch/powerpc/platforms/cell/spufs/file.c
> @@ -1430,7 +1430,7 @@ static int spufs_mfc_open(struct inode *inode, struct file *file)
> if (ctx->owner != current->mm)
> return -EINVAL;
>
> - if (atomic_read(&inode->i_count) != 1)
> + if (refcount_read(&inode->i_count) != 1)
> return -EBUSY;
>
> mutex_lock(&ctx->mapping_lock);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index bbbcd96e8f5c..e85e38df3ea0 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3418,7 +3418,7 @@ void btrfs_add_delayed_iput(struct btrfs_inode *inode)
> struct btrfs_fs_info *fs_info = inode->root->fs_info;
> unsigned long flags;
>
> - if (atomic_add_unless(&inode->vfs_inode.i_count, -1, 1)) {
> + if (refcount_dec_not_one(&inode->vfs_inode.i_count)) {
> iobj_put(&inode->vfs_inode);
> return;
> }
> @@ -4559,7 +4559,7 @@ static void btrfs_prune_dentries(struct btrfs_root *root)
>
> inode = btrfs_find_first_inode(root, min_ino);
> while (inode) {
> - if (atomic_read(&inode->vfs_inode.i_count) > 1)
> + if (refcount_read(&inode->vfs_inode.i_count) > 1)
> d_prune_aliases(&inode->vfs_inode);
>
> min_ino = btrfs_ino(inode) + 1;
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 0f497c39ff82..ff666d18f6ad 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2221,7 +2221,7 @@ static int trim_caps_cb(struct inode *inode, int mds, void *arg)
> int count;
> dput(dentry);
> d_prune_aliases(inode);
> - count = atomic_read(&inode->i_count);
> + count = refcount_read(&inode->i_count);
> if (count == 1)
> (*remaining)--;
> doutc(cl, "%p %llx.%llx cap %p pruned, count now %d\n",
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index df4051613b29..9a3c7f22a57e 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -252,10 +252,10 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
> "nonexistent device\n", __func__, __LINE__);
> return;
> }
> - if (atomic_read(&inode->i_count) > 1) {
> + if (refcount_read(&inode->i_count) > 1) {
> ext4_msg(sb, KERN_ERR, "%s:%d: inode #%lu: count=%d",
> __func__, __LINE__, inode->i_ino,
> - atomic_read(&inode->i_count));
> + refcount_read(&inode->i_count));
> return;
> }
> if (inode->i_nlink) {
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 111a9d8215bf..789c4228412c 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1796,7 +1796,7 @@ static int writeback_single_inode(struct inode *inode,
> int ret = 0;
>
> spin_lock(&inode->i_lock);
> - if (!atomic_read(&inode->i_count))
> + if (!refcount_read(&inode->i_count))
> WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
> else
> WARN_ON(inode->i_state & I_WILL_FREE);
> diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
> index a59e8fa630db..ee23a941d8f5 100644
> --- a/fs/hpfs/inode.c
> +++ b/fs/hpfs/inode.c
> @@ -184,7 +184,7 @@ void hpfs_write_inode(struct inode *i)
> struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
> struct inode *parent;
> if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
> - if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
> + if (hpfs_inode->i_rddir_off && !refcount_read(&i->i_count)) {
> if (*hpfs_inode->i_rddir_off)
> pr_err("write_inode: some position still there\n");
> kfree(hpfs_inode->i_rddir_off);
> diff --git a/fs/inode.c b/fs/inode.c
> index 07c8edb4b58a..28d197731914 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -236,7 +236,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
> inode->i_state = 0;
> atomic64_set(&inode->i_sequence, 0);
> refcount_set(&inode->i_obj_count, 1);
> - atomic_set(&inode->i_count, 1);
> + refcount_set(&inode->i_count, 1);
> inode->i_op = &empty_iops;
> inode->i_fop = &no_open_fops;
> inode->i_ino = 0;
> @@ -561,7 +561,8 @@ static void init_once(void *foo)
> void ihold(struct inode *inode)
> {
> iobj_get(inode);
> - WARN_ON(atomic_inc_return(&inode->i_count) < 2);
> + refcount_inc(&inode->i_count);
> + WARN_ON(refcount_read(&inode->i_count) < 2);
> }
> EXPORT_SYMBOL(ihold);
>
> @@ -614,7 +615,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
>
> if (inode->i_state & (I_FREEING | I_WILL_FREE))
> return;
> - if (atomic_read(&inode->i_count) != 1)
> + if (refcount_read(&inode->i_count) != 1)
> return;
> if (inode->__i_nlink == 0)
> return;
> @@ -2019,7 +2020,7 @@ static void __iput(struct inode *inode, bool skip_lru)
> return;
> BUG_ON(inode->i_state & I_CLEAR);
>
> - if (atomic_add_unless(&inode->i_count, -1, 1)) {
> + if (refcount_dec_not_one(&inode->i_count)) {
> iobj_put(inode);
> return;
> }
> @@ -2039,7 +2040,7 @@ static void __iput(struct inode *inode, bool skip_lru)
> */
> drop = maybe_add_lru(inode, skip_lru);
>
> - if (atomic_dec_and_test(&inode->i_count))
> + if (refcount_dec_and_test(&inode->i_count))
> iput_final(inode, drop);
> else
> spin_unlock(&inode->i_lock);
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 338ef77ae423..9cc84f0afa9a 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -608,7 +608,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
> inode->i_sb->s_id,
> (unsigned long long)NFS_FILEID(inode),
> nfs_display_fhandle_hash(fh),
> - atomic_read(&inode->i_count));
> + refcount_read(&inode->i_count));
>
> out:
> return inode;
> @@ -2229,7 +2229,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
> dfprintk(VFS, "NFS: %s(%s/%lu fh_crc=0x%08x ct=%d info=0x%llx)\n",
> __func__, inode->i_sb->s_id, inode->i_ino,
> nfs_display_fhandle_hash(NFS_FH(inode)),
> - atomic_read(&inode->i_count), fattr->valid);
> + refcount_read(&inode->i_count), fattr->valid);
>
> if (!(fattr->valid & NFS_ATTR_FATTR_FILEID)) {
> /* Only a mounted-on-fileid? Just exit */
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 079b868552c2..0883696f873d 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -66,7 +66,7 @@ static void fsnotify_unmount_inodes(struct super_block *sb)
> * removed all zero refcount inodes, in any case. Test to
> * be sure.
> */
> - if (!atomic_read(&inode->i_count)) {
> + if (!refcount_read(&inode->i_count)) {
> spin_unlock(&inode->i_lock);
> continue;
> }
> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> index f3e3b2068608..79526f71fa8a 100644
> --- a/fs/ubifs/super.c
> +++ b/fs/ubifs/super.c
> @@ -358,7 +358,7 @@ static void ubifs_evict_inode(struct inode *inode)
> goto out;
>
> dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
> - ubifs_assert(c, !atomic_read(&inode->i_count));
> + ubifs_assert(c, !refcount_read(&inode->i_count));
>
> truncate_inode_pages_final(&inode->i_data);
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 9c39251961a3..06af749fe5f3 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1035,7 +1035,7 @@ xfs_itruncate_extents_flags(
> int error = 0;
>
> xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
> - if (atomic_read(&VFS_I(ip)->i_count))
> + if (refcount_read(&VFS_I(ip)->i_count))
> xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
> ASSERT(new_size <= XFS_ISIZE(ip));
> ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index ac344e42846c..167d33b8095c 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1152,7 +1152,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
> TP_fast_assign(
> __entry->dev = VFS_I(ip)->i_sb->s_dev;
> __entry->ino = ip->i_ino;
> - __entry->count = atomic_read(&VFS_I(ip)->i_count);
> + __entry->count = refcount_read(&VFS_I(ip)->i_count);
> __entry->pincount = atomic_read(&ip->i_pincount);
> __entry->iflags = ip->i_flags;
> __entry->caller_ip = caller_ip;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8384ed81a5ad..34fb40ba8a94 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -880,7 +880,7 @@ struct inode {
> };
> atomic64_t i_version;
> atomic64_t i_sequence; /* see futex */
> - atomic_t i_count;
> + refcount_t i_count;
> atomic_t i_dio_count;
> atomic_t i_writecount;
> #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
> @@ -3399,7 +3399,7 @@ static inline void iobj_get(struct inode *inode)
> static inline void __iget(struct inode *inode)
> {
> iobj_get(inode);
> - atomic_inc(&inode->i_count);
> + refcount_inc(&inode->i_count);
> }
>
> extern void iget_failed(struct inode *);
> diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
> index b8d1e00a7982..e745436cfcd2 100644
> --- a/include/trace/events/filelock.h
> +++ b/include/trace/events/filelock.h
> @@ -189,7 +189,7 @@ TRACE_EVENT(generic_add_lease,
> __entry->i_ino = inode->i_ino;
> __entry->wcount = atomic_read(&inode->i_writecount);
> __entry->rcount = atomic_read(&inode->i_readcount);
> - __entry->icount = atomic_read(&inode->i_count);
> + __entry->icount = refcount_read(&inode->i_count);
> __entry->owner = fl->c.flc_owner;
> __entry->flags = fl->c.flc_flags;
> __entry->type = fl->c.flc_type;
> diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> index c04f8879ad03..570f851dc469 100644
> --- a/security/landlock/fs.c
> +++ b/security/landlock/fs.c
> @@ -1281,7 +1281,7 @@ static void hook_sb_delete(struct super_block *const sb)
> struct landlock_object *object;
>
> /* Only handles referenced inodes. */
> - if (!atomic_read(&inode->i_count))
> + if (!refcount_read(&inode->i_count))
> continue;
>
> /*
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 02/50] make the i_state flags an enum
2025-08-22 11:42 ` [PATCH 02/50] " Alan Huang
@ 2025-08-22 12:11 ` Sun YangKai
0 siblings, 0 replies; 85+ messages in thread
From: Sun YangKai @ 2025-08-22 12:11 UTC (permalink / raw)
To: mmpgouride
Cc: brauner, josef, kernel-team, linux-btrfs, linux-ext4,
linux-fsdevel, linux-xfs, sunk67188, viro
> On Aug 22, 2025, at 19:18, Sun YangKai <sunk67188@gmail.com> wrote:
> > Hi Josef,
> >
> > Sorry for the bothering, and I hope this isn't too far off-topic for the
> > current patch series discussion.
> >
> > I recently learned about the x-macro trick and was wondering if it might
> > be
> > suitable for use in this context since we are rewriting this. I'd
> > appreciate any thoughts or feedback on whether this approach could be
> > applied here.
> I think x-macro is easy to write, but hard to read or grep.
I agree.
However, in this particular case, it's not very difficult to read and it does
help reduce some copy-pasting. So it really comes down to what we value more
in our codebase: straightforward implementation, or adhering to DRY
principles.
>
> > Thanks in advance for your insights!
> >
> > Below is the patch for reference:
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index d7ab4f96d705..6a766aaa457e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2576,28 +2576,36 @@ static inline void kiocb_clone(struct kiocb
> > *kiocb,
> > struct kiocb *kiocb_src,
> >
> > * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to
> > wait
> > * upon. There's one free address left.
> > */
> >
> > -#define __I_NEW 0
> > -#define I_NEW (1 << __I_NEW)
> > -#define __I_SYNC 1
> > -#define I_SYNC (1 << __I_SYNC)
> > -#define __I_LRU_ISOLATING 2
> > -#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
> > -
> > -#define I_DIRTY_SYNC (1 << 3)
> > -#define I_DIRTY_DATASYNC (1 << 4)
> > -#define I_DIRTY_PAGES (1 << 5)
> > -#define I_WILL_FREE (1 << 6)
> > -#define I_FREEING (1 << 7)
> > -#define I_CLEAR (1 << 8)
> > -#define I_REFERENCED (1 << 9)
> > -#define I_LINKABLE (1 << 10)
> > -#define I_DIRTY_TIME (1 << 11)
> > -#define I_WB_SWITCH (1 << 12)
> > -#define I_OVL_INUSE (1 << 13)
> > -#define I_CREATING (1 << 14)
> > -#define I_DONTCACHE (1 << 15)
> > -#define I_SYNC_QUEUED (1 << 16)
> > -#define I_PINNING_NETFS_WB (1 << 17)
> > +#define INODE_STATE(X) \
>
> And it should be INODE_STATE().
Yes, but I defined X as a parameter so that other names can be used as well.
This is particularly important in the file /include/trace/events/writeback.h,
where the macro X must be consistently defined wherever show_inode_state is
used. Therefore, it’s better to use a more meaningful name—such as
inode_state_name—instead of X. The situation is slightly more complex than a
typical x-macro, since we cannot undefine the helper macro (whether it's called
X or, in this case, inode_state_name).
>
> > + X(I_NEW), \
> > + X(I_SYNC), \
> > + X(I_LRU_ISOLATING), \
> > + X(I_DIRTY_SYNC), \
> > + X(I_DIRTY_DATASYNC), \
> > + X(I_DIRTY_PAGES), \
> > + X(I_WILL_FREE), \
> > + X(I_FREEING), \
> > + X(I_CLEAR), \
> > + X(I_REFERENCED), \
> > + X(I_LINKABLE), \
> > + X(I_DIRTY_TIME), \
> > + X(I_WB_SWITCH), \
> > + X(I_OVL_INUSE), \
> > + X(I_CREATING), \
> > + X(I_DONTCACHE), \
> > + X(I_SYNC_QUEUED), \
> > + X(I_PINNING_NETFS_WB)
> > +
> > +enum __inode_state_bits {
> > + #define X(state) __##state
> > + INODE_STATE(X)
> > + #undef X
> > +};
> > +enum inode_state_bits {
> > + #define X(state) state = (1 << __##state)
> > + INODE_STATE(X)
> > + #undef X
> > +};
> >
> > #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> > #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
> > diff --git a/include/trace/events/writeback.h b/include/trace/events/
> > writeback.h
> > index 1e23919c0da9..4c545c72c40a 100644
> > --- a/include/trace/events/writeback.h
> > +++ b/include/trace/events/writeback.h
> > @@ -9,26 +9,10 @@
> > #include <linux/backing-dev.h>
> > #include <linux/writeback.h>
> >
> > -#define show_inode_state(state) \
> > - __print_flags(state, "|", \
> > - {I_DIRTY_SYNC, "I_DIRTY_SYNC"}, \
> > - {I_DIRTY_DATASYNC, "I_DIRTY_DATASYNC"}, \
> > - {I_DIRTY_PAGES, "I_DIRTY_PAGES"}, \
> > - {I_NEW, "I_NEW"}, \
> > - {I_WILL_FREE, "I_WILL_FREE"}, \
> > - {I_FREEING, "I_FREEING"}, \
> > - {I_CLEAR, "I_CLEAR"}, \
> > - {I_SYNC, "I_SYNC"}, \
> > - {I_DIRTY_TIME, "I_DIRTY_TIME"}, \
> > - {I_REFERENCED, "I_REFERENCED"}, \
> > - {I_LINKABLE, "I_LINKABLE"}, \
> > - {I_WB_SWITCH, "I_WB_SWITCH"}, \
> > - {I_OVL_INUSE, "I_OVL_INUSE"}, \
> > - {I_CREATING, "I_CREATING"}, \
> > - {I_DONTCACHE, "I_DONTCACHE"}, \
> > - {I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \
> > - {I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \
> > - {I_LRU_ISOLATING, "I_LRU_ISOLATING"} \
> > +#define inode_state_name(s) { s, #s }
> > +#define show_inode_state(state) \
> > + __print_flags(state, "|", \
> > + INODE_STATE(inode_state_name) \
> > )
> >
> > /* enums need to be exported to user space */
> >
> > Best regards,
> > Sun YangKai
> > <x-macro.patch>
Thanks,
Sun YangKai
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 06/50] fs: hold an i_obj_count reference in writeback_sb_inodes
2025-08-21 20:18 ` [PATCH 06/50] fs: hold an i_obj_count reference in writeback_sb_inodes Josef Bacik
@ 2025-08-22 12:20 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-22 12:20 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:17PM -0400, Josef Bacik wrote:
> We drop the wb list_lock while writing back inodes, and we could
> manipulate the i_io_list while this is happening and drop our reference
> for the inode. Protect this by holding the i_obj_count reference during
> the writeback.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/fs-writeback.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 24fccb299de4..2b0d26a58a5a 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1977,6 +1977,7 @@ static long writeback_sb_inodes(struct super_block *sb,
> trace_writeback_sb_inodes_requeue(inode);
> continue;
> }
> + iobj_get(inode);
> spin_unlock(&wb->list_lock);
>
> /*
> @@ -1987,6 +1988,7 @@ static long writeback_sb_inodes(struct super_block *sb,
> if (inode->i_state & I_SYNC) {
> /* Wait for I_SYNC. This function drops i_lock... */
> inode_sleep_on_writeback(inode);
> + iobj_put(inode);
> /* Inode may be gone, start again */
> spin_lock(&wb->list_lock);
> continue;
> @@ -2035,10 +2037,9 @@ static long writeback_sb_inodes(struct super_block *sb,
> inode_sync_complete(inode);
> spin_unlock(&inode->i_lock);
>
> - if (unlikely(tmp_wb != wb)) {
> - spin_unlock(&tmp_wb->list_lock);
> - spin_lock(&wb->list_lock);
> - }
> + spin_unlock(&tmp_wb->list_lock);
> + iobj_put(inode);
> + spin_lock(&wb->list_lock);
So if tmp_wb == wb then you unlock and immediately relock dropping the
reference in between and if tmp_wb != wb then you unlock tmp_wb and the
context implies that @wb became unlocked and can be relocked again.
Seems sane, thanks. More contention on @wb->list_lock. I have no
intuition how bad that is and I know you mentioned it in your cover
letter. If it matters then I suspect the reference count would matter as
well. But let's not worry about it yet.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: (subset) [PATCH 10/50] fs: stop accessing ->i_count directly in f2fs and gfs2
2025-08-21 20:18 ` [PATCH 10/50] fs: stop accessing ->i_count directly in f2fs and gfs2 Josef Bacik
@ 2025-08-22 12:38 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-22 12:38 UTC (permalink / raw)
To: Josef Bacik
Cc: Christian Brauner, linux-fsdevel, linux-btrfs, kernel-team,
linux-ext4, linux-xfs, viro
On Thu, 21 Aug 2025 16:18:21 -0400, Josef Bacik wrote:
> Instead of accessing ->i_count directly in these file systems, use the
> appropriate __iget and iput helpers.
>
>
I'll take that as a fix.
---
Applied to the vfs.fixes branch of the vfs/vfs.git tree.
Patches in the vfs.fixes branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.fixes
[10/50] fs: stop accessing ->i_count directly in f2fs and gfs2
https://git.kernel.org/vfs/vfs/c/2c2be4cb82c2
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 12/50] fs: rework iput logic
2025-08-21 20:18 ` [PATCH 12/50] fs: rework iput logic Josef Bacik
@ 2025-08-22 12:54 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-22 12:54 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:23PM -0400, Josef Bacik wrote:
> Currently, if we are the last iput, and we have the I_DIRTY_TIME bit
> set, we will grab a reference on the inode again and then mark it dirty
> and then redo the put. This is to make sure we delay the time update
> for as long as possible.
>
> We can rework this logic to simply dec i_count if it is not 1, and if it
> is do the time update while still holding the i_count reference.
>
> Then we can replace the atomic_dec_and_lock with locking the ->i_lock
> and doing atomic_dec_and_test, since we did the atomic_add_unless above.
>
> This is preparation for no longer allowing 0 i_count inodes to exist.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 31 ++++++++++++++++---------------
> 1 file changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 16acad5583fc..814c03f5dbb1 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1928,22 +1928,23 @@ void iput(struct inode *inode)
> if (!inode)
> return;
> BUG_ON(inode->i_state & I_CLEAR);
> -retry:
> - if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> - if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> - /*
> - * Increment i_count directly as we still have our
> - * i_obj_count reference still. This is temporary and
> - * will go away in a future patch.
> - */
> - atomic_inc(&inode->i_count);
> - spin_unlock(&inode->i_lock);
> - trace_writeback_lazytime_iput(inode);
> - mark_inode_dirty_sync(inode);
> - goto retry;
> - }
> - iput_final(inode);
> +
> + if (atomic_add_unless(&inode->i_count, -1, 1)) {
> + iobj_put(inode);
> + return;
> }
> +
> + if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> + trace_writeback_lazytime_iput(inode);
> + mark_inode_dirty_sync(inode);
> + }
> +
> + spin_lock(&inode->i_lock);
> + if (atomic_dec_and_test(&inode->i_count))
> + iput_final(inode);
Personally, I'd add a
// drops i_lock
comment behind iput_final() but that's a matter of taste tbf.
> + else
> + spin_unlock(&inode->i_lock);
> +
> iobj_put(inode);
> }
> EXPORT_SYMBOL(iput);
This looks a lot less magical than the current variant! We should maybe
split this patch in two. A cleanup patch that removes the questionable
"drop to zero, take lock then increment from zero again" logic and then
in a separate patch add in the iobj_put(). So the cleanup can go to the
front of the series.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 00/50] fs: rework inode reference counting
2025-08-22 10:51 ` [PATCH 00/50] fs: rework inode reference counting Christian Brauner
@ 2025-08-22 13:30 ` Josef Bacik
0 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-22 13:30 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Fri, Aug 22, 2025 at 12:51:29PM +0200, Christian Brauner wrote:
> On Thu, Aug 21, 2025 at 04:18:11PM -0400, Josef Bacik wrote:
> > Hello,
> >
> > This series is the first part of a larger body of work geared towards solving a
> > variety of scalability issues in the VFS.
> >
> > We have historically had a variety of foot-guns related to inode freeing. We
> > have I_WILL_FREE and I_FREEING flags that indicated when the inode was in the
> > different stages of being reclaimed. This lead to confusion, and bugs in cases
> > where one was checked but the other wasn't. Additionally, it's frankly
> > confusing to have both of these flags and to deal with them in practice.
>
> Agreed.
>
> > However, this exists because we have an odd behavior with inodes, we allow them
> > to have a 0 reference count and still be usable. This again is a pretty unfun
> > footgun, because generally speaking we want reference counts to be meaningful.
>
> Agreed.
>
> > The problem with the way we reference inodes is the final iput(). The majority
> > of file systems do their final truncate of a unlinked inode in their
> > ->evict_inode() callback, which happens when the inode is actually being
> > evicted. This can be a long process for large inodes, and thus isn't safe to
> > happen in a variety of contexts. Btrfs, for example, has an entire delayed iput
> > infrastructure to make sure that we do not do the final iput() in a dangerous
> > context. We cannot expand the use of this reference count to all the places the
> > inode is used, because there are cases where we would need to iput() in an IRQ
> > context (end folio writeback) or other unsafe context, which is not allowed.
> >
> > To that end, resolve this by introducing a new i_obj_count reference count. This
> > will be used to control when we can actually free the inode. We then can use
> > this reference count in all the places where we may reference the inode. This
> > removes another huge footgun, having ways to access the inode itself without
> > having an actual reference to it. The writeback code is one of the main places
> > where we see this. Inodes end up on all sorts of lists here without a proper
> > reference count. This allows us to protect the inode from being freed by giving
> > this an other code mechanisms to protect their access to the inode.
> >
> > With this we can separate the concept of the inode being usable, and the inode
> > being freed. The next part of the patch series is to stop allowing for inodes
> > to have an i_count of 0 and still be viable. This comes with some warts. The
> > biggest wart is now if we choose to cache inodes in the LRU list we have to
> > remove the inode from the LRU list if we access it once it's on the LRU list.
> > This will result in more contention on the lru list lock, but in practice we
> > rarely have inodes that do not have a dentry, and if we do that inode is not
> > long for this world.
> >
> > With not allowing inodes to hit a refcount of 0, we can take advantage of that
> > common pattern of using refcount_inc_not_zero() in all of the lockless places
> > where we do inode lookup in cache. From there we can change all the users who
> > check I_WILL_FREE or I_FREEING to simply check the i_count. If it is 0 then they
> > aren't allowed to do their work, othrwise they can proceed as normal.
> >
> > With all of that in place we can finally remove these two flags.
> >
> > This is a large series, but it is mostly mechanical. I've kept the patches very
> > small, to make it easy to review and logic about each change. I have run this
> > through fstests for btrfs and ext4, xfs is currently going. I wanted to get this
> > out for review to make sure this big design changes are reasonable to everybody.
> >
> > The series is based on vfs/vfs.all branch, which is based on 6.9-rc1. Thanks,
>
> I so hope you meant 6.17-rc1 because otherwise I did something very very
> wrong. :)
Stupid AI hallucination...
Josef
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 02/50] fs: make the i_state flags an enum
2025-08-22 11:08 ` Christian Brauner
@ 2025-08-22 13:31 ` Josef Bacik
2025-08-22 14:36 ` David Sterba
1 sibling, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-22 13:31 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Fri, Aug 22, 2025 at 01:08:07PM +0200, Christian Brauner wrote:
> On Thu, Aug 21, 2025 at 04:18:13PM -0400, Josef Bacik wrote:
> > Adjusting i_state flags always means updating the values manually. Bring
> > these forward into the 2020's and make a nice clean macro for defining
> > the i_state values as an enum, providing __ variants for the cases where
> > we need the bit position instead of the actual value, and leaving the
> > actual NAME as the 1U << bit value.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > include/linux/fs.h | 234 +++++++++++++++++++++++----------------------
> > 1 file changed, 122 insertions(+), 112 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 9a1ce67eed33..e741dc453c2c 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -665,6 +665,127 @@ is_uncached_acl(struct posix_acl *acl)
> > #define IOP_MGTIME 0x0020
> > #define IOP_CACHED_LINK 0x0040
> >
> > +/*
> > + * Inode state bits. Protected by inode->i_lock
> > + *
> > + * Four bits determine the dirty state of the inode: I_DIRTY_SYNC,
> > + * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME.
> > + *
> > + * Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
> > + * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
> > + * various stages of removing an inode.
> > + *
> > + * Two bits are used for locking and completion notification, I_NEW and I_SYNC.
> > + *
> > + * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
> > + * fdatasync() (unless I_DIRTY_DATASYNC is also set).
> > + * Timestamp updates are the usual cause.
> > + * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
> > + * these changes separately from I_DIRTY_SYNC so that we
> > + * don't have to write inode on fdatasync() when only
> > + * e.g. the timestamps have changed.
> > + * I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean.
> > + * I_DIRTY_TIME The inode itself has dirty timestamps, and the
> > + * lazytime mount option is enabled. We keep track of this
> > + * separately from I_DIRTY_SYNC in order to implement
> > + * lazytime. This gets cleared if I_DIRTY_INODE
> > + * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
> > + * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
> > + * in place because writeback might already be in progress
> > + * and we don't want to lose the time update
> > + * I_NEW Serves as both a mutex and completion notification.
> > + * New inodes set I_NEW. If two processes both create
> > + * the same inode, one of them will release its inode and
> > + * wait for I_NEW to be released before returning.
> > + * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
> > + * also cause waiting on I_NEW, without I_NEW actually
> > + * being set. find_inode() uses this to prevent returning
> > + * nearly-dead inodes.
> > + * I_WILL_FREE Must be set when calling write_inode_now() if i_count
> > + * is zero. I_FREEING must be set when I_WILL_FREE is
> > + * cleared.
> > + * I_FREEING Set when inode is about to be freed but still has dirty
> > + * pages or buffers attached or the inode itself is still
> > + * dirty.
> > + * I_CLEAR Added by clear_inode(). In this state the inode is
> > + * clean and can be destroyed. Inode keeps I_FREEING.
> > + *
> > + * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
> > + * prohibited for many purposes. iget() must wait for
> > + * the inode to be completely released, then create it
> > + * anew. Other functions will just ignore such inodes,
> > + * if appropriate. I_NEW is used for waiting.
> > + *
> > + * I_SYNC Writeback of inode is running. The bit is set during
> > + * data writeback, and cleared with a wakeup on the bit
> > + * address once it is done. The bit is also used to pin
> > + * the inode in memory for flusher thread.
> > + *
> > + * I_REFERENCED Marks the inode as recently references on the LRU list.
> > + *
> > + * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
> > + * synchronize competing switching instances and to tell
> > + * wb stat updates to grab the i_pages lock. See
> > + * inode_switch_wbs_work_fn() for details.
> > + *
> > + * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
> > + * and work dirs among overlayfs mounts.
> > + *
> > + * I_CREATING New object's inode in the middle of setting up.
> > + *
> > + * I_DONTCACHE Evict inode as soon as it is not used anymore.
> > + *
> > + * I_SYNC_QUEUED Inode is queued in b_io or b_more_io writeback lists.
> > + * Used to detect that mark_inode_dirty() should not move
> > + * inode between dirty lists.
> > + *
> > + * I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback.
> > + *
> > + * I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding
> > + * i_count.
> > + *
> > + * Q: What is the difference between I_WILL_FREE and I_FREEING?
> > + *
> > + * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
> > + * upon. There's one free address left.
> > + */
> > +
> > +/*
> > + * As simple macro to define the inode state bits, __NAME will be the bit value
> > + * (0, 1, 2, ...), and NAME will be the bit mask (1U << __NAME). The __NAME_SEQ
> > + * is used to reset the sequence number so the next name gets the next bit value
> > + * in the sequence.
> > + */
> > +#define INODE_BIT(name) \
> > + __ ## name, \
> > + name = (1U << __ ## name), \
> > + __ ## name ## _SEQ = __ ## name
>
> I'm not sure if this is the future we want :D
> I think it's harder to parse than what we have now.
>
> > +
> > +enum inode_state_bits {
> > + INODE_BIT(I_NEW),
> > + INODE_BIT(I_SYNC),
> > + INODE_BIT(I_LRU_ISOLATING),
> > + INODE_BIT(I_DIRTY_SYNC),
> > + INODE_BIT(I_DIRTY_DATASYNC),
> > + INODE_BIT(I_DIRTY_PAGES),
> > + INODE_BIT(I_WILL_FREE),
> > + INODE_BIT(I_FREEING),
> > + INODE_BIT(I_CLEAR),
> > + INODE_BIT(I_REFERENCED),
> > + INODE_BIT(I_LINKABLE),
> > + INODE_BIT(I_DIRTY_TIME),
> > + INODE_BIT(I_WB_SWITCH),
> > + INODE_BIT(I_OVL_INUSE),
> > + INODE_BIT(I_CREATING),
> > + INODE_BIT(I_DONTCACHE),
> > + INODE_BIT(I_SYNC_QUEUED),
> > + INODE_BIT(I_PINNING_NETFS_WB),
> > +};
>
> Good idea but I really dislike this macro indirection.
> Can't we just do the really boring?
>
> enum inode_state_bits {
> __I_BIT_NEW = 0U
> __I_BIT_SYNC = 1U
> __I_BIT_LRU_ISOLATING = 2U
> }
>
> enum inode_state_flags_t {
> I_NEW = (1U << __I_BIT_NEW)
> I_SYNC = (1U << __I_BIT_SYNC)
> I_LRU_ISOLATING = (1U << __I_BIT_LRU_ISOLATING)
> I_DIRTY_SYNC = (1U << 3)
> I_DIRTY_DATASYNC = (1U << 4)
> I_DIRTY_PAGES = (1U << 5)
> I_WILL_FREE = (1U << 6)
> I_FREEING = (1U << 7)
> I_CLEAR = (1U << 8)
> I_REFERENCED = (1U << 9)
> I_LINKABLE = (1U << 10)
> I_DIRTY_TIME = (1U << 11)
> I_WB_SWITCH = (1U << 12)
> I_OVL_INUSE = (1U << 13)
> I_CREATING = (1U << 14)
> I_DONTCACHE = (1U << 15)
> I_SYNC_QUEUED = (1U << 16)
> I_PINNING_NETFS_WB = (1U << 17)
> };
>
> Note that inode_state_wait_address() and that only works on four bits so
> we can't really use higher bits anyway without switching back to a
> scheme where we have to use unsigned long and waste for bytes for
> nothing on 64 bit.
>
> With that out of the way,
>
> Reviewed-by: Christian Brauner <brauner@kernel.org>
Yup totally, I'll fix this and add your RB. Thanks!
Josef
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 20/50] fs: convert i_count to refcount_t
2025-08-21 20:18 ` [PATCH 20/50] fs: convert i_count to refcount_t Josef Bacik
2025-08-22 12:10 ` Amir Goldstein
@ 2025-08-22 13:56 ` kernel test robot
2025-08-25 11:03 ` Christian Brauner
2 siblings, 0 replies; 85+ messages in thread
From: kernel test robot @ 2025-08-22 13:56 UTC (permalink / raw)
To: Josef Bacik, linux-fsdevel, linux-btrfs, kernel-team, linux-ext4,
linux-xfs, brauner, viro
Cc: llvm, oe-kbuild-all
Hi Josef,
kernel test robot noticed the following build errors:
[auto build test ERROR on brauner-vfs/vfs.all]
[also build test ERROR on kdave/for-next trace/for-next xfs-linux/for-next linus/master v6.17-rc2 next-20250822]
[cannot apply to tytso-ext4/dev]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Josef-Bacik/fs-add-an-i_obj_count-refcount-to-the-inode/20250822-045428
base: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/r/6a12e35a078d765b50bc7ced7030d6cd98065528.1755806649.git.josef%40toxicpanda.com
patch subject: [PATCH 20/50] fs: convert i_count to refcount_t
config: x86_64-randconfig-003-20250822 (https://download.01.org/0day-ci/archive/20250822/202508222128.KZDs9G7n-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250822/202508222128.KZDs9G7n-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508222128.KZDs9G7n-lkp@intel.com/
All errors (new ones prefixed by >>):
>> fs/smb/client/inode.c:2782:37: error: no member named 'counter' in 'struct refcount_struct'
2782 | full_path, inode, inode->i_count.counter,
| ~~~~~~~~~~~~~~ ^
fs/smb/client/cifs_debug.h:77:36: note: expanded from macro 'cifs_dbg'
77 | cifs_dbg_func(once, type, fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
fs/smb/client/cifs_debug.h:66:23: note: expanded from macro 'cifs_dbg_func'
66 | __FILE__, ##__VA_ARGS__); \
| ^~~~~~~~~~~
include/linux/printk.h:693:38: note: expanded from macro 'pr_debug_once'
693 | no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
| ^~~~~~~~~~~
include/linux/printk.h:135:18: note: expanded from macro 'no_printk'
135 | _printk(fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
>> fs/smb/client/inode.c:2782:37: error: no member named 'counter' in 'struct refcount_struct'
2782 | full_path, inode, inode->i_count.counter,
| ~~~~~~~~~~~~~~ ^
fs/smb/client/cifs_debug.h:77:36: note: expanded from macro 'cifs_dbg'
77 | cifs_dbg_func(once, type, fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
fs/smb/client/cifs_debug.h:68:38: note: expanded from macro 'cifs_dbg_func'
68 | pr_err_ ## ratefunc("VFS: " fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
include/linux/printk.h:670:38: note: expanded from macro 'pr_err_once'
670 | printk_once(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
| ^~~~~~~~~~~
note: (skipping 2 expansions in backtrace; use -fmacro-backtrace-limit=0 to see all)
include/linux/once_lite.h:31:9: note: expanded from macro 'DO_ONCE_LITE_IF'
31 | func(__VA_ARGS__); \
| ^~~~~~~~~~~
include/linux/printk.h:514:60: note: expanded from macro 'printk'
514 | #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__)
| ^~~~~~~~~~~
include/linux/printk.h:486:19: note: expanded from macro 'printk_index_wrap'
486 | _p_func(_fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
>> fs/smb/client/inode.c:2782:37: error: no member named 'counter' in 'struct refcount_struct'
2782 | full_path, inode, inode->i_count.counter,
| ~~~~~~~~~~~~~~ ^
fs/smb/client/cifs_debug.h:77:36: note: expanded from macro 'cifs_dbg'
77 | cifs_dbg_func(once, type, fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
fs/smb/client/cifs_debug.h:70:32: note: expanded from macro 'cifs_dbg_func'
70 | pr_debug_ ## ratefunc(fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
include/linux/printk.h:693:38: note: expanded from macro 'pr_debug_once'
693 | no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
| ^~~~~~~~~~~
include/linux/printk.h:135:18: note: expanded from macro 'no_printk'
135 | _printk(fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
>> fs/smb/client/inode.c:2782:37: error: no member named 'counter' in 'struct refcount_struct'
2782 | full_path, inode, inode->i_count.counter,
| ~~~~~~~~~~~~~~ ^
fs/smb/client/cifs_debug.h:79:43: note: expanded from macro 'cifs_dbg'
79 | cifs_dbg_func(ratelimited, type, fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
fs/smb/client/cifs_debug.h:66:23: note: expanded from macro 'cifs_dbg_func'
66 | __FILE__, ##__VA_ARGS__); \
| ^~~~~~~~~~~
include/linux/printk.h:758:38: note: expanded from macro 'pr_debug_ratelimited'
758 | no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
| ^~~~~~~~~~~
include/linux/printk.h:135:18: note: expanded from macro 'no_printk'
135 | _printk(fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
>> fs/smb/client/inode.c:2782:37: error: no member named 'counter' in 'struct refcount_struct'
2782 | full_path, inode, inode->i_count.counter,
| ~~~~~~~~~~~~~~ ^
fs/smb/client/cifs_debug.h:79:43: note: expanded from macro 'cifs_dbg'
79 | cifs_dbg_func(ratelimited, type, fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
fs/smb/client/cifs_debug.h:68:38: note: expanded from macro 'cifs_dbg_func'
68 | pr_err_ ## ratefunc("VFS: " fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
include/linux/printk.h:722:45: note: expanded from macro 'pr_err_ratelimited'
722 | printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
| ^~~~~~~~~~~
include/linux/printk.h:708:17: note: expanded from macro 'printk_ratelimited'
708 | printk(fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
include/linux/printk.h:514:60: note: expanded from macro 'printk'
514 | #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__)
| ^~~~~~~~~~~
include/linux/printk.h:486:19: note: expanded from macro 'printk_index_wrap'
486 | _p_func(_fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
>> fs/smb/client/inode.c:2782:37: error: no member named 'counter' in 'struct refcount_struct'
2782 | full_path, inode, inode->i_count.counter,
| ~~~~~~~~~~~~~~ ^
fs/smb/client/cifs_debug.h:79:43: note: expanded from macro 'cifs_dbg'
79 | cifs_dbg_func(ratelimited, type, fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
fs/smb/client/cifs_debug.h:70:32: note: expanded from macro 'cifs_dbg_func'
70 | pr_debug_ ## ratefunc(fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
include/linux/printk.h:758:38: note: expanded from macro 'pr_debug_ratelimited'
758 | no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
| ^~~~~~~~~~~
include/linux/printk.h:135:18: note: expanded from macro 'no_printk'
135 | _printk(fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~
6 errors generated.
vim +2782 fs/smb/client/inode.c
abab095d1fd259 fs/cifs/inode.c Jeff Layton 2010-02-12 2755
6feb9891da4f8b fs/cifs/inode.c Pavel Shilovsky 2011-04-07 2756 int cifs_revalidate_dentry_attr(struct dentry *dentry)
df2cf170c823ba fs/cifs/inode.c Jeff Layton 2010-02-12 2757 {
6d5786a34d98bf fs/cifs/inode.c Pavel Shilovsky 2012-06-20 2758 unsigned int xid;
df2cf170c823ba fs/cifs/inode.c Jeff Layton 2010-02-12 2759 int rc = 0;
2b0143b5c986be fs/cifs/inode.c David Howells 2015-03-17 2760 struct inode *inode = d_inode(dentry);
df2cf170c823ba fs/cifs/inode.c Jeff Layton 2010-02-12 2761 struct super_block *sb = dentry->d_sb;
f6a9bc336b600e fs/cifs/inode.c Al Viro 2021-03-05 2762 const char *full_path;
f6a9bc336b600e fs/cifs/inode.c Al Viro 2021-03-05 2763 void *page;
fc513fac56e1b6 fs/cifs/inode.c Ronnie Sahlberg 2020-02-19 2764 int count = 0;
df2cf170c823ba fs/cifs/inode.c Jeff Layton 2010-02-12 2765
df2cf170c823ba fs/cifs/inode.c Jeff Layton 2010-02-12 2766 if (inode == NULL)
df2cf170c823ba fs/cifs/inode.c Jeff Layton 2010-02-12 2767 return -ENOENT;
^1da177e4c3f41 fs/cifs/inode.c Linus Torvalds 2005-04-16 2768
ed8561fa1d12b4 fs/cifs/inode.c Ronnie Sahlberg 2021-03-09 2769 if (!cifs_dentry_needs_reval(dentry))
6feb9891da4f8b fs/cifs/inode.c Pavel Shilovsky 2011-04-07 2770 return rc;
6feb9891da4f8b fs/cifs/inode.c Pavel Shilovsky 2011-04-07 2771
6d5786a34d98bf fs/cifs/inode.c Pavel Shilovsky 2012-06-20 2772 xid = get_xid();
^1da177e4c3f41 fs/cifs/inode.c Linus Torvalds 2005-04-16 2773
f6a9bc336b600e fs/cifs/inode.c Al Viro 2021-03-05 2774 page = alloc_dentry_path();
f6a9bc336b600e fs/cifs/inode.c Al Viro 2021-03-05 2775 full_path = build_path_from_dentry(dentry, page);
f6a9bc336b600e fs/cifs/inode.c Al Viro 2021-03-05 2776 if (IS_ERR(full_path)) {
f6a9bc336b600e fs/cifs/inode.c Al Viro 2021-03-05 2777 rc = PTR_ERR(full_path);
6feb9891da4f8b fs/cifs/inode.c Pavel Shilovsky 2011-04-07 2778 goto out;
^1da177e4c3f41 fs/cifs/inode.c Linus Torvalds 2005-04-16 2779 }
^1da177e4c3f41 fs/cifs/inode.c Linus Torvalds 2005-04-16 2780
f96637be081141 fs/cifs/inode.c Joe Perches 2013-05-04 2781 cifs_dbg(FYI, "Update attributes: %s inode 0x%p count %d dentry: 0x%p d_time %ld jiffies %ld\n",
f96637be081141 fs/cifs/inode.c Joe Perches 2013-05-04 @2782 full_path, inode, inode->i_count.counter,
a00be0e31f8df4 fs/cifs/inode.c Miklos Szeredi 2016-09-16 2783 dentry, cifs_get_time(dentry), jiffies);
^1da177e4c3f41 fs/cifs/inode.c Linus Torvalds 2005-04-16 2784
fc513fac56e1b6 fs/cifs/inode.c Ronnie Sahlberg 2020-02-19 2785 again:
102466f303ffcd fs/smb/client/inode.c Paulo Alcantara 2023-11-25 2786 if (cifs_sb_master_tcon(CIFS_SB(sb))->posix_extensions) {
102466f303ffcd fs/smb/client/inode.c Paulo Alcantara 2023-11-25 2787 rc = smb311_posix_get_inode_info(&inode, full_path,
102466f303ffcd fs/smb/client/inode.c Paulo Alcantara 2023-11-25 2788 NULL, sb, xid);
102466f303ffcd fs/smb/client/inode.c Paulo Alcantara 2023-11-25 2789 } else if (cifs_sb_master_tcon(CIFS_SB(sb))->unix_ext) {
df2cf170c823ba fs/cifs/inode.c Jeff Layton 2010-02-12 2790 rc = cifs_get_inode_info_unix(&inode, full_path, sb, xid);
102466f303ffcd fs/smb/client/inode.c Paulo Alcantara 2023-11-25 2791 } else {
df2cf170c823ba fs/cifs/inode.c Jeff Layton 2010-02-12 2792 rc = cifs_get_inode_info(&inode, full_path, NULL, sb,
df2cf170c823ba fs/cifs/inode.c Jeff Layton 2010-02-12 2793 xid, NULL);
102466f303ffcd fs/smb/client/inode.c Paulo Alcantara 2023-11-25 2794 }
fc513fac56e1b6 fs/cifs/inode.c Ronnie Sahlberg 2020-02-19 2795 if (rc == -EAGAIN && count++ < 10)
fc513fac56e1b6 fs/cifs/inode.c Ronnie Sahlberg 2020-02-19 2796 goto again;
6feb9891da4f8b fs/cifs/inode.c Pavel Shilovsky 2011-04-07 2797 out:
f6a9bc336b600e fs/cifs/inode.c Al Viro 2021-03-05 2798 free_dentry_path(page);
6d5786a34d98bf fs/cifs/inode.c Pavel Shilovsky 2012-06-20 2799 free_xid(xid);
fc513fac56e1b6 fs/cifs/inode.c Ronnie Sahlberg 2020-02-19 2800
^1da177e4c3f41 fs/cifs/inode.c Linus Torvalds 2005-04-16 2801 return rc;
^1da177e4c3f41 fs/cifs/inode.c Linus Torvalds 2005-04-16 2802 }
^1da177e4c3f41 fs/cifs/inode.c Linus Torvalds 2005-04-16 2803
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 02/50] fs: make the i_state flags an enum
2025-08-22 11:08 ` Christian Brauner
2025-08-22 13:31 ` Josef Bacik
@ 2025-08-22 14:36 ` David Sterba
1 sibling, 0 replies; 85+ messages in thread
From: David Sterba @ 2025-08-22 14:36 UTC (permalink / raw)
To: Christian Brauner
Cc: Josef Bacik, linux-fsdevel, linux-btrfs, kernel-team, linux-ext4,
linux-xfs, viro
On Fri, Aug 22, 2025 at 01:08:07PM +0200, Christian Brauner wrote:
> On Thu, Aug 21, 2025 at 04:18:13PM -0400, Josef Bacik wrote:
> > +/*
> > + * As simple macro to define the inode state bits, __NAME will be the bit value
> > + * (0, 1, 2, ...), and NAME will be the bit mask (1U << __NAME). The __NAME_SEQ
> > + * is used to reset the sequence number so the next name gets the next bit value
> > + * in the sequence.
> > + */
> > +#define INODE_BIT(name) \
> > + __ ## name, \
> > + name = (1U << __ ## name), \
> > + __ ## name ## _SEQ = __ ## name
>
> I'm not sure if this is the future we want :D
> I think it's harder to parse than what we have now.
If it would be a generic sounding name like ENUM_BIT
https://elixir.bootlin.com/linux/v6.16.2/source/fs/btrfs/misc.h#L16
one does not have to parse anything and the meaning is understandable in
the context of enum. We've converted everything to that in btrfs and
it's convenient to change just one line when adding or removing entries,
not caring about the exact values.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 02/50] fs: make the i_state flags an enum
2025-08-22 11:18 ` Sun YangKai
2025-08-22 11:42 ` [PATCH 02/50] " Alan Huang
@ 2025-08-22 14:40 ` Josef Bacik
1 sibling, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-22 14:40 UTC (permalink / raw)
To: Sun YangKai
Cc: brauner, kernel-team, linux-btrfs, linux-ext4, linux-fsdevel,
linux-xfs, viro
On Fri, Aug 22, 2025 at 07:18:26PM +0800, Sun YangKai wrote:
> Hi Josef,
>
> Sorry for the bothering, and I hope this isn't too far off-topic for the
> current patch series discussion.
>
> I recently learned about the x-macro trick and was wondering if it might be
> suitable for use in this context since we are rewriting this. I'd appreciate
> any thoughts or feedback on whether this approach could be applied here.
>
> Thanks in advance for your insights!
That's super useful, thanks for that! Christian wants me to do it a different
way so I'm going to do that. But I'll definitely keep this in mind for code he
can't see ;). Thanks,
Josef
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 14/50] fs: maintain a list of pinned inodes
2025-08-21 20:18 ` [PATCH 14/50] fs: maintain a list of pinned inodes Josef Bacik
@ 2025-08-22 14:55 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-22 14:55 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:25PM -0400, Josef Bacik wrote:
> Currently we have relied on dirty inodes and inodes with cache on them
> to simply be left hanging around on the system outside of an LRU. The
> only way to make sure these inodes are eventually reclaimed is because
> dirty writeback will grab a reference on the inode and then iput it when
> it's done, potentially getting it on the LRU. For the cached case the
> page cache deletion path will call inode_add_lru when the inode no
> longer has cached pages in order to make sure the inode object can be
> freed eventually. In the unmount case we walk all inodes and free them
> so this all works out fine.
>
> But we want to eliminate 0 i_count objects as a concept, so we need a
> mechanism to hold a reference on these pinned inodes. To that end, add a
> list to the super block that contains any inodes that are cached for one
> reason or another.
>
> When we call inode_add_lru(), if the inode falls into one of these
> categories, we will add it to the cached inode list and hold an
> i_obj_count reference. If the inode does not fall into one of these
> categories it will be moved to the normal LRU, which is already holds an
> i_obj_count reference.
>
> The dirty case we will delete it from the LRU if it is on one, and then
> the iput after the writeout will make sure it's placed onto the correct
> list at that point.
>
> The page cache case will migrate it when it calls inode_add_lru() when
> deleting pages from the page cache.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/fs-writeback.c | 8 +++
> fs/inode.c | 102 +++++++++++++++++++++++++++++--
> fs/internal.h | 1 +
> fs/super.c | 3 +
> include/linux/fs.h | 11 ++++
> include/trace/events/writeback.h | 3 +-
> 6 files changed, 121 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d2e1fb1a0787..111a9d8215bf 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2736,6 +2736,14 @@ static void wait_sb_inodes(struct super_block *sb)
> continue;
> }
> __iget(inode);
> +
> + /*
> + * We could have potentially ended up on the cached LRU list, so
> + * remove ourselves from this list now that we have a reference,
> + * the iput will handle placing it back on the appropriate LRU
> + * list if necessary.
> + */
> + inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
> rcu_read_unlock();
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 94769b356224..adcba0a4d776 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -319,6 +319,23 @@ void free_inode_nonrcu(struct inode *inode)
> }
> EXPORT_SYMBOL(free_inode_nonrcu);
>
> +/*
> + * Some inodes need to stay pinned in memory because they are dirty or there are
> + * cached pages that the VM wants to keep around to avoid thrashing. This does
> + * the appropriate checks to see if we want to sheild this inode from periodic
> + * reclaim. Must be called with ->i_lock held.
> + */
> +static bool inode_needs_cached(struct inode *inode)
> +{
> + lockdep_assert_held(&inode->i_lock);
> +
> + if (inode->i_state & (I_DIRTY_ALL | I_SYNC))
> + return true;
> + if (!mapping_shrinkable(&inode->i_data))
> + return true;
> + return false;
> +}
> +
> static void i_callback(struct rcu_head *head)
> {
> struct inode *inode = container_of(head, struct inode, i_rcu);
> @@ -532,20 +549,67 @@ void ihold(struct inode *inode)
> }
> EXPORT_SYMBOL(ihold);
>
> +static void inode_add_cached_lru(struct inode *inode)
> +{
> + lockdep_assert_held(&inode->i_lock);
> +
> + if (inode->i_state & I_CACHED_LRU)
> + return;
> + if (!list_empty(&inode->i_lru))
> + return;
> +
> + inode->i_state |= I_CACHED_LRU;
> + spin_lock(&inode->i_sb->s_cached_inodes_lock);
> + list_add(&inode->i_lru, &inode->i_sb->s_cached_inodes);
> + spin_unlock(&inode->i_sb->s_cached_inodes_lock);
> + iobj_get(inode);
> +}
For mere correctness you likely want the iobj_get() before you're adding
it to the list so it ends up on the list with the i_obj_count already
bumped.
> +
> +static bool __inode_del_cached_lru(struct inode *inode)
> +{
> + lockdep_assert_held(&inode->i_lock);
> +
> + if (!(inode->i_state & I_CACHED_LRU))
> + return false;
> +
> + inode->i_state &= ~I_CACHED_LRU;
> + spin_lock(&inode->i_sb->s_cached_inodes_lock);
> + list_del_init(&inode->i_lru);
> + spin_unlock(&inode->i_sb->s_cached_inodes_lock);
> + return true;
> +}
> +
> +static bool inode_del_cached_lru(struct inode *inode)
> +{
> + if (__inode_del_cached_lru(inode)) {
> + iobj_put(inode);
> + return true;
> + }
> + return false;
> +}
> +
> static void __inode_add_lru(struct inode *inode, bool rotate)
> {
> - if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE))
> + bool need_ref = true;
> +
> + lockdep_assert_held(&inode->i_lock);
> +
> + if (inode->i_state & (I_FREEING | I_WILL_FREE))
> return;
> if (atomic_read(&inode->i_count))
Btw, we have a bunch of atomic_read() calls on i_count. We should really
have a helper for that just like we do for files. So we should add
icount() or something similarly named. Accessing reference counts
directly should ideally never happen...
> return;
> if (!(inode->i_sb->s_flags & SB_ACTIVE))
> return;
> - if (!mapping_shrinkable(&inode->i_data))
> + if (inode_needs_cached(inode)) {
> + inode_add_cached_lru(inode);
> return;
> + }
>
> + need_ref = __inode_del_cached_lru(inode) == false;
> if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
> - iobj_get(inode);
> inode->i_state |= I_LRU;
> + if (need_ref)
> + iobj_get(inode);
> this_cpu_inc(nr_unused);
> } else if (rotate) {
> inode->i_state |= I_REFERENCED;
> @@ -573,8 +637,19 @@ void inode_add_lru(struct inode *inode)
> __inode_add_lru(inode, false);
> }
>
> -static void inode_lru_list_del(struct inode *inode)
> +/*
> + * Caller must be holding it's own i_count reference on this inode in order to
> + * prevent this being the final iput.
> + *
> + * Needs inode->i_lock held.
> + */
> +void inode_lru_list_del(struct inode *inode)
> {
> + lockdep_assert_held(&inode->i_lock);
> +
> + if (inode_del_cached_lru(inode))
> + return;
> +
> if (!(inode->i_state & I_LRU))
> return;
>
> @@ -950,6 +1025,22 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> if (!spin_trylock(&inode->i_lock))
> return LRU_SKIP;
>
> + /*
> + * This inode is either dirty or has page cache we want to keep around,
> + * so move it to the cached list.
> + *
> + * We drop the extra i_obj_count reference we grab when adding it to the
> + * cached lru.
> + */
> + if (inode_needs_cached(inode)) {
> + list_lru_isolate(lru, &inode->i_lru);
> + inode_add_cached_lru(inode);
> + iobj_put(inode);
> + spin_unlock(&inode->i_lock);
> + this_cpu_dec(nr_unused);
> + return LRU_REMOVED;
> + }
> +
> /*
> * Inodes can get referenced, redirtied, or repopulated while
> * they're already on the LRU, and this can make them
> @@ -957,8 +1048,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> * sync, or the last page cache deletion will requeue them.
> */
> if (atomic_read(&inode->i_count) ||
> - (inode->i_state & ~I_REFERENCED) ||
> - !mapping_shrinkable(&inode->i_data)) {
> + (inode->i_state & ~I_REFERENCED)) {
> list_lru_isolate(lru, &inode->i_lru);
> inode->i_state &= ~I_LRU;
> spin_unlock(&inode->i_lock);
> diff --git a/fs/internal.h b/fs/internal.h
> index 38e8aab27bbd..17ecee7056d5 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -207,6 +207,7 @@ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
> int dentry_needs_remove_privs(struct mnt_idmap *, struct dentry *dentry);
> bool in_group_or_capable(struct mnt_idmap *idmap,
> const struct inode *inode, vfsgid_t vfsgid);
> +void inode_lru_list_del(struct inode *inode);
>
> /*
> * fs-writeback.c
> diff --git a/fs/super.c b/fs/super.c
> index a038848e8d1f..bf3e6d9055af 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -364,6 +364,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> spin_lock_init(&s->s_inode_list_lock);
> INIT_LIST_HEAD(&s->s_inodes_wb);
> spin_lock_init(&s->s_inode_wblist_lock);
> + INIT_LIST_HEAD(&s->s_cached_inodes);
> + spin_lock_init(&s->s_cached_inodes_lock);
>
> s->s_count = 1;
> atomic_set(&s->s_active, 1);
> @@ -409,6 +411,7 @@ static void __put_super(struct super_block *s)
> WARN_ON(s->s_dentry_lru.node);
> WARN_ON(s->s_inode_lru.node);
> WARN_ON(!list_empty(&s->s_mounts));
> + WARN_ON(!list_empty(&s->s_cached_inodes));
> call_rcu(&s->rcu, destroy_super_rcu);
> }
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 509e696a4df0..8384ed81a5ad 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -749,6 +749,9 @@ is_uncached_acl(struct posix_acl *acl)
> * ->i_lru is on the LRU and those that are using ->i_lru
> * for some other means.
> *
> + * I_CACHED_LRU Inode is cached because it is dirty or isn't shrinkable,
> + * and thus is on the s_cached_inode_lru list.
> + *
> * Q: What is the difference between I_WILL_FREE and I_FREEING?
> *
> * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
> @@ -786,6 +789,7 @@ enum inode_state_bits {
> INODE_BIT(I_SYNC_QUEUED),
> INODE_BIT(I_PINNING_NETFS_WB),
> INODE_BIT(I_LRU),
> + INODE_BIT(I_CACHED_LRU),
> };
>
> #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> @@ -1584,6 +1588,13 @@ struct super_block {
>
> spinlock_t s_inode_wblist_lock;
> struct list_head s_inodes_wb; /* writeback inodes */
> +
> + /*
> + * Cached inodes, any inodes that their reference is held by another
> + * mechanism, such as dirty inodes or unshrinkable inodes.
> + */
> + spinlock_t s_cached_inodes_lock;
> + struct list_head s_cached_inodes;
> } __randomize_layout;
>
> static inline struct user_namespace *i_user_ns(const struct inode *inode)
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 486f85aca84d..6949329c744a 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -29,7 +29,8 @@
> {I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \
> {I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \
> {I_LRU_ISOLATING, "I_LRU_ISOLATING"}, \
> - {I_LRU, "I_LRU"} \
> + {I_LRU, "I_LRU"}, \
> + {I_CACHED_LRU, "I_CACHED_LRU"} \
> )
>
> /* enums need to be exported to user space */
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 15/50] fs: delete the inode from the LRU list on lookup
2025-08-21 20:18 ` [PATCH 15/50] fs: delete the inode from the LRU list on lookup Josef Bacik
@ 2025-08-22 15:27 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-22 15:27 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:26PM -0400, Josef Bacik wrote:
> When we move to holding a full reference on the inode when it is on an
> LRU list we need to have a mechanism to re-run the LRU add logic. The
> use case for this is btrfs's snapshot delete, we will lookup all the
> inodes and try to drop them, but if they're on the LRU we will not call
> ->drop_inode() because their refcount will be elevated, so we won't know
> that we need to drop the inode.
>
> Fix this by simply removing the inode from it's respective LRU list when
> we grab a reference to it in a way that we have active users. This will
> ensure that the logic to add the inode to the LRU or drop the inode will
> be run on the final iput from the user.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index adcba0a4d776..72981b890ec6 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1146,6 +1146,7 @@ static struct inode *find_inode(struct super_block *sb,
> return ERR_PTR(-ESTALE);
> }
> __iget(inode);
> + inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
> rcu_read_unlock();
> return inode;
> @@ -1187,6 +1188,7 @@ static struct inode *find_inode_fast(struct super_block *sb,
> return ERR_PTR(-ESTALE);
> }
> __iget(inode);
> + inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
> rcu_read_unlock();
> return inode;
> @@ -1653,6 +1655,7 @@ struct inode *igrab(struct inode *inode)
> spin_lock(&inode->i_lock);
> if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
> __iget(inode);
> + inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
> } else {
> spin_unlock(&inode->i_lock);
Interesting, so the previous behavior implies that igrab(),
find_inode(), find_inode_fast() are called on inodes that are hashed and
on an LRU. None of them even raise I_REFERENCED.
I would think that this means that there are callers that grab very
temporary references to inodes that they immediately drop without
wanting to prevent reclaim.
Oh, because btrfs subvolume delete is effectively a recursive directory
removal and that's why that happens? I wonder if there are other users.
So if this regresses someone it would regress btrfs I guess. :)
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 16/50] fs: change evict_inodes to use iput instead of evict directly
2025-08-21 20:18 ` [PATCH 16/50] fs: change evict_inodes to use iput instead of evict directly Josef Bacik
@ 2025-08-25 9:07 ` Christian Brauner
2025-08-25 19:35 ` Josef Bacik
0 siblings, 1 reply; 85+ messages in thread
From: Christian Brauner @ 2025-08-25 9:07 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:27PM -0400, Josef Bacik wrote:
> At evict_inodes() time, we no longer have SB_ACTIVE set, so we can
> easily go through the normal iput path to clear any inodes. Update
I'm a bit lost why SB_ACTIVE is used here as a justification to call
iput(). I think it's because iput_final() would somehow add it back to
the LRU if SB_ACTIVE was still set and the filesystem somehow would
indicate it wouldn't want to drop the inode.
I'm confused where that would even happen. IOW, which filesystem would
indicate "don't drop the inode" even though it's about to vanish. But
anyway, that's probably not important because...
> dispose_list() to check how we need to free the inode, and then grab a
> full reference to the inode while we're looping through the remaining
> inodes, and simply iput them at the end.
>
> Since we're just calling iput we don't really care about the i_count on
> the inode at the current time. Remove the i_count checks and just call
> iput on every inode we find.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 26 +++++++++++---------------
> 1 file changed, 11 insertions(+), 15 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 72981b890ec6..80ad327746a7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -933,7 +933,7 @@ static void evict(struct inode *inode)
> * Dispose-list gets a local list with local inodes in it, so it doesn't
> * need to worry about list corruption and SMP locks.
> */
> -static void dispose_list(struct list_head *head)
> +static void dispose_list(struct list_head *head, bool for_lru)
> {
> while (!list_empty(head)) {
> struct inode *inode;
> @@ -941,8 +941,12 @@ static void dispose_list(struct list_head *head)
> inode = list_first_entry(head, struct inode, i_lru);
> list_del_init(&inode->i_lru);
>
> - evict(inode);
> - iobj_put(inode);
> + if (for_lru) {
> + evict(inode);
> + iobj_put(inode);
> + } else {
> + iput(inode);
> + }
... Afaict, if we end up in dispose_list() we came from one of two
locations:
(1) prune_icache_sb()
In which case inode_lru_isolate() will have only returned inodes
that prior to your changes would have inode->i_count zero.
(2) evict_inodes()
Similar story, this only hits inodes with inode->i_count zero.
With your change you're adding an increment from zero for (2) via
__iget() so that you always end up with a full refcount, and that is
backing your changes to dispose_list() later.
I don't see the same done for (1) though and so your later call to
iput() drops the reference below zero? It's accidently benign because
iiuc atomic_dec_and_test() will simply tell you that reference count
didn't go to zero and so iput() will back off. But still this should be
fixed if I'm right.
The conversion to iput() is introducing a lot of subtlety in the middle
of the series. If I'm right then the iput() is a always a nop because in
all cases it was an increment from zero. But it isn't really a nop
because we still do stuff like call ->drop_inode() again. Maybe it's
fine because no filesystem would have issues with this but I wouldn't
count on it and also it feels rather unclean to do it this way.
So, under the assumption, that after the increment from zero you did, we
really only have a blatant zombie inode on our hands and we only need to
get rid of the i_count we took make that explicit and do:
if (for_lru) {
evict(inode);
iobj_put(inode);
} else {
/* This inode was always incremented from zero.
* Get rid of that reference without doing anything else.
*/
WARN_ON_ONCE(!atomic_dec_and_test(&inode->i_count));
}
Btw, for the iobj_put() above, I assume that we're not guaranteed that
i_obj_count == 1?
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 17/50] fs: hold a full ref while the inode is on a LRU
2025-08-21 20:18 ` [PATCH 17/50] fs: hold a full ref while the inode is on a LRU Josef Bacik
@ 2025-08-25 9:20 ` Christian Brauner
2025-08-25 10:40 ` Christian Brauner
0 siblings, 1 reply; 85+ messages in thread
From: Christian Brauner @ 2025-08-25 9:20 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:28PM -0400, Josef Bacik wrote:
> We want to eliminate 0 refcount inodes that can be used. To that end,
> make the LRU's hold a full reference on the inode while it is on an LRU
> list. From there we can change the eviction code to always just iput the
> inode, and the LRU operations will just add or drop a full reference
> where appropriate.
>
> We also now must take into account unlink, and drop our LRU reference
> when we go to an nlink of 0. We will also avoid adding inodes with a
> nlink of 0 as they can be reclaimed immediately.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 105 +++++++++++++++++++++++++++++------------------------
> 1 file changed, 57 insertions(+), 48 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 80ad327746a7..de0ec791f9a3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -434,8 +434,18 @@ void drop_nlink(struct inode *inode)
> {
> WARN_ON(inode->i_nlink == 0);
> inode->__i_nlink--;
> - if (!inode->i_nlink)
> + if (!inode->i_nlink) {
> + /*
> + * LRU's hold a full ref on the inode, but if we've unlinked it
> + * then we want the inode to be freed when the last user goes,
> + * so delete the inode from the LRU list.
> + */
> + spin_lock(&inode->i_lock);
> + inode_lru_list_del(inode);
> + spin_unlock(&inode->i_lock);
> +
> atomic_long_inc(&inode->i_sb->s_remove_count);
> + }
As written this doesn't work because you can have callers that have
already acquired inode->i_lock(). For example, afs:
new_inode = d_inode(new_dentry);
if (new_inode) {
spin_lock(&new_inode->i_lock);
if (S_ISDIR(new_inode->i_mode))
clear_nlink(new_inode);
else if (new_inode->i_nlink > 0)
drop_nlink(new_inode);
spin_unlock(&new_inode->i_lock);
}
> }
> EXPORT_SYMBOL(drop_nlink);
>
> @@ -451,6 +461,12 @@ void clear_nlink(struct inode *inode)
> {
> if (inode->i_nlink) {
> inode->__i_nlink = 0;
> +
> + /* See comment in drop_nlink(). */
> + spin_lock(&inode->i_lock);
> + inode_lru_list_del(inode);
> + spin_unlock(&inode->i_lock);
> +
> atomic_long_inc(&inode->i_sb->s_remove_count);
> }
> }
> @@ -555,6 +571,8 @@ static void inode_add_cached_lru(struct inode *inode)
>
> if (inode->i_state & I_CACHED_LRU)
> return;
> + if (inode->__i_nlink == 0)
> + return;
> if (!list_empty(&inode->i_lru))
> return;
>
> @@ -562,7 +580,7 @@ static void inode_add_cached_lru(struct inode *inode)
> spin_lock(&inode->i_sb->s_cached_inodes_lock);
> list_add(&inode->i_lru, &inode->i_sb->s_cached_inodes);
> spin_unlock(&inode->i_sb->s_cached_inodes_lock);
> - iobj_get(inode);
> + __iget(inode);
> }
>
> static bool __inode_del_cached_lru(struct inode *inode)
> @@ -582,7 +600,7 @@ static bool __inode_del_cached_lru(struct inode *inode)
> static bool inode_del_cached_lru(struct inode *inode)
> {
> if (__inode_del_cached_lru(inode)) {
> - iobj_put(inode);
> + iput(inode);
> return true;
> }
> return false;
> @@ -598,6 +616,8 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> return;
> if (atomic_read(&inode->i_count))
> return;
> + if (inode->__i_nlink == 0)
> + return;
> if (!(inode->i_sb->s_flags & SB_ACTIVE))
> return;
> if (inode_needs_cached(inode)) {
> @@ -609,7 +629,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
> inode->i_state |= I_LRU;
> if (need_ref)
> - iobj_get(inode);
> + __iget(inode);
> this_cpu_inc(nr_unused);
> } else if (rotate) {
> inode->i_state |= I_REFERENCED;
> @@ -655,7 +675,7 @@ void inode_lru_list_del(struct inode *inode)
>
> if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
> inode->i_state &= ~I_LRU;
> - iobj_put(inode);
> + iput(inode);
> this_cpu_dec(nr_unused);
> }
> }
> @@ -926,6 +946,7 @@ static void evict(struct inode *inode)
> BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
> }
>
> +static void iput_evict(struct inode *inode);
> /*
> * dispose_list - dispose of the contents of a local list
> * @head: the head of the list to free
> @@ -933,20 +954,14 @@ static void evict(struct inode *inode)
> * Dispose-list gets a local list with local inodes in it, so it doesn't
> * need to worry about list corruption and SMP locks.
> */
> -static void dispose_list(struct list_head *head, bool for_lru)
> +static void dispose_list(struct list_head *head)
> {
> while (!list_empty(head)) {
> struct inode *inode;
>
> inode = list_first_entry(head, struct inode, i_lru);
> list_del_init(&inode->i_lru);
> -
> - if (for_lru) {
> - evict(inode);
> - iobj_put(inode);
> - } else {
> - iput(inode);
> - }
> + iput_evict(inode);
> cond_resched();
> }
> }
> @@ -987,13 +1002,13 @@ void evict_inodes(struct super_block *sb)
> if (need_resched()) {
> spin_unlock(&sb->s_inode_list_lock);
> cond_resched();
> - dispose_list(&dispose, false);
> + dispose_list(&dispose);
> goto again;
> }
> }
> spin_unlock(&sb->s_inode_list_lock);
>
> - dispose_list(&dispose, false);
> + dispose_list(&dispose);
> }
> EXPORT_SYMBOL_GPL(evict_inodes);
>
> @@ -1031,22 +1046,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> if (inode_needs_cached(inode)) {
> list_lru_isolate(lru, &inode->i_lru);
> inode_add_cached_lru(inode);
> - iobj_put(inode);
> - spin_unlock(&inode->i_lock);
> - this_cpu_dec(nr_unused);
> - return LRU_REMOVED;
> - }
> -
> - /*
> - * Inodes can get referenced, redirtied, or repopulated while
> - * they're already on the LRU, and this can make them
> - * unreclaimable for a while. Remove them lazily here; iput,
> - * sync, or the last page cache deletion will requeue them.
> - */
> - if (atomic_read(&inode->i_count) ||
> - (inode->i_state & ~I_REFERENCED)) {
> - list_lru_isolate(lru, &inode->i_lru);
> - inode->i_state &= ~I_LRU;
> + iput(inode);
> spin_unlock(&inode->i_lock);
> this_cpu_dec(nr_unused);
> return LRU_REMOVED;
> @@ -1082,7 +1082,6 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> }
>
> WARN_ON(inode->i_state & I_NEW);
> - inode->i_state |= I_FREEING;
> inode->i_state &= ~I_LRU;
> list_lru_isolate_move(lru, &inode->i_lru, freeable);
> spin_unlock(&inode->i_lock);
> @@ -1104,7 +1103,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
>
> freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
> inode_lru_isolate, &freeable);
> - dispose_list(&freeable, true);
> + dispose_list(&freeable);
> return freed;
> }
>
> @@ -1967,7 +1966,7 @@ EXPORT_SYMBOL(generic_delete_inode);
> * in cache if fs is alive, sync and evict if fs is
> * shutting down.
> */
> -static void iput_final(struct inode *inode)
> +static void iput_final(struct inode *inode, bool skip_lru)
> {
> struct super_block *sb = inode->i_sb;
> const struct super_operations *op = inode->i_sb->s_op;
> @@ -1981,7 +1980,7 @@ static void iput_final(struct inode *inode)
> else
> drop = generic_drop_inode(inode);
>
> - if (!drop &&
> + if (!drop && !skip_lru &&
> !(inode->i_state & I_DONTCACHE) &&
> (sb->s_flags & SB_ACTIVE)) {
> __inode_add_lru(inode, true);
> @@ -1989,6 +1988,8 @@ static void iput_final(struct inode *inode)
> return;
> }
>
> + WARN_ON(!list_empty(&inode->i_lru));
> +
> state = inode->i_state;
> if (!drop) {
> WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
> @@ -2003,23 +2004,12 @@ static void iput_final(struct inode *inode)
> }
>
> WRITE_ONCE(inode->i_state, state | I_FREEING);
> - if (!list_empty(&inode->i_lru))
> - inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
>
> evict(inode);
> }
>
> -/**
> - * iput - put an inode
> - * @inode: inode to put
> - *
> - * Puts an inode, dropping its usage count. If the inode use count hits
> - * zero, the inode is then freed and may also be destroyed.
> - *
> - * Consequently, iput() can sleep.
> - */
> -void iput(struct inode *inode)
> +static void __iput(struct inode *inode, bool skip_lru)
> {
> if (!inode)
> return;
> @@ -2037,12 +2027,31 @@ void iput(struct inode *inode)
>
> spin_lock(&inode->i_lock);
> if (atomic_dec_and_test(&inode->i_count))
> - iput_final(inode);
> + iput_final(inode, skip_lru);
> else
> spin_unlock(&inode->i_lock);
>
> iobj_put(inode);
> }
> +
> +static void iput_evict(struct inode *inode)
> +{
> + __iput(inode, true);
> +}
> +
> +/**
> + * iput - put an inode
> + * @inode: inode to put
> + *
> + * Puts an inode, dropping its usage count. If the inode use count hits
> + * zero, the inode is then freed and may also be destroyed.
> + *
> + * Consequently, iput() can sleep.
> + */
> +void iput(struct inode *inode)
> +{
> + __iput(inode, false);
> +}
> EXPORT_SYMBOL(iput);
>
> /**
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 17/50] fs: hold a full ref while the inode is on a LRU
2025-08-25 9:20 ` Christian Brauner
@ 2025-08-25 10:40 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-25 10:40 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Mon, Aug 25, 2025 at 11:20:07AM +0200, Christian Brauner wrote:
> On Thu, Aug 21, 2025 at 04:18:28PM -0400, Josef Bacik wrote:
> > We want to eliminate 0 refcount inodes that can be used. To that end,
> > make the LRU's hold a full reference on the inode while it is on an LRU
> > list. From there we can change the eviction code to always just iput the
> > inode, and the LRU operations will just add or drop a full reference
> > where appropriate.
> >
> > We also now must take into account unlink, and drop our LRU reference
> > when we go to an nlink of 0. We will also avoid adding inodes with a
> > nlink of 0 as they can be reclaimed immediately.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > fs/inode.c | 105 +++++++++++++++++++++++++++++------------------------
> > 1 file changed, 57 insertions(+), 48 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 80ad327746a7..de0ec791f9a3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -434,8 +434,18 @@ void drop_nlink(struct inode *inode)
> > {
> > WARN_ON(inode->i_nlink == 0);
> > inode->__i_nlink--;
> > - if (!inode->i_nlink)
> > + if (!inode->i_nlink) {
> > + /*
> > + * LRU's hold a full ref on the inode, but if we've unlinked it
> > + * then we want the inode to be freed when the last user goes,
> > + * so delete the inode from the LRU list.
> > + */
> > + spin_lock(&inode->i_lock);
> > + inode_lru_list_del(inode);
> > + spin_unlock(&inode->i_lock);
> > +
> > atomic_long_inc(&inode->i_sb->s_remove_count);
> > + }
>
> As written this doesn't work because you can have callers that have
> already acquired inode->i_lock(). For example, afs:
>
> new_inode = d_inode(new_dentry);
> if (new_inode) {
> spin_lock(&new_inode->i_lock);
> if (S_ISDIR(new_inode->i_mode))
> clear_nlink(new_inode);
> else if (new_inode->i_nlink > 0)
> drop_nlink(new_inode);
> spin_unlock(&new_inode->i_lock);
> }
I think it should be possible to do LRU list deletion locklessly so you
don't need to hold i_lock. The plain lru lists can already be walked
locklessly and removal is also possible without locks.
So we only seem to take i_lock because of i_state. If we can get rid of
a bunch i_lock grabs for the sake of i_state by just using atomic bit
operations on i_state I'm willing to sacrifice my hard-won 32bits and
make i_state an unsigned long again...
If I'm right then this would allow us to make lru list removal lockless
and then you can avoid this problem here in clear_nlink()/drop_nlink().
Let me know if that's not working or you have other ideas.
>
> > }
> > EXPORT_SYMBOL(drop_nlink);
> >
> > @@ -451,6 +461,12 @@ void clear_nlink(struct inode *inode)
> > {
> > if (inode->i_nlink) {
> > inode->__i_nlink = 0;
> > +
> > + /* See comment in drop_nlink(). */
> > + spin_lock(&inode->i_lock);
> > + inode_lru_list_del(inode);
> > + spin_unlock(&inode->i_lock);
> > +
> > atomic_long_inc(&inode->i_sb->s_remove_count);
> > }
> > }
> > @@ -555,6 +571,8 @@ static void inode_add_cached_lru(struct inode *inode)
> >
> > if (inode->i_state & I_CACHED_LRU)
> > return;
> > + if (inode->__i_nlink == 0)
> > + return;
> > if (!list_empty(&inode->i_lru))
> > return;
> >
> > @@ -562,7 +580,7 @@ static void inode_add_cached_lru(struct inode *inode)
> > spin_lock(&inode->i_sb->s_cached_inodes_lock);
> > list_add(&inode->i_lru, &inode->i_sb->s_cached_inodes);
> > spin_unlock(&inode->i_sb->s_cached_inodes_lock);
> > - iobj_get(inode);
> > + __iget(inode);
> > }
> >
> > static bool __inode_del_cached_lru(struct inode *inode)
> > @@ -582,7 +600,7 @@ static bool __inode_del_cached_lru(struct inode *inode)
> > static bool inode_del_cached_lru(struct inode *inode)
> > {
> > if (__inode_del_cached_lru(inode)) {
> > - iobj_put(inode);
> > + iput(inode);
> > return true;
> > }
> > return false;
> > @@ -598,6 +616,8 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> > return;
> > if (atomic_read(&inode->i_count))
> > return;
> > + if (inode->__i_nlink == 0)
> > + return;
> > if (!(inode->i_sb->s_flags & SB_ACTIVE))
> > return;
> > if (inode_needs_cached(inode)) {
> > @@ -609,7 +629,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> > if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
> > inode->i_state |= I_LRU;
> > if (need_ref)
> > - iobj_get(inode);
> > + __iget(inode);
> > this_cpu_inc(nr_unused);
> > } else if (rotate) {
> > inode->i_state |= I_REFERENCED;
> > @@ -655,7 +675,7 @@ void inode_lru_list_del(struct inode *inode)
> >
> > if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) {
> > inode->i_state &= ~I_LRU;
> > - iobj_put(inode);
> > + iput(inode);
> > this_cpu_dec(nr_unused);
> > }
> > }
> > @@ -926,6 +946,7 @@ static void evict(struct inode *inode)
> > BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
> > }
> >
> > +static void iput_evict(struct inode *inode);
> > /*
> > * dispose_list - dispose of the contents of a local list
> > * @head: the head of the list to free
> > @@ -933,20 +954,14 @@ static void evict(struct inode *inode)
> > * Dispose-list gets a local list with local inodes in it, so it doesn't
> > * need to worry about list corruption and SMP locks.
> > */
> > -static void dispose_list(struct list_head *head, bool for_lru)
> > +static void dispose_list(struct list_head *head)
> > {
> > while (!list_empty(head)) {
> > struct inode *inode;
> >
> > inode = list_first_entry(head, struct inode, i_lru);
> > list_del_init(&inode->i_lru);
> > -
> > - if (for_lru) {
> > - evict(inode);
> > - iobj_put(inode);
> > - } else {
> > - iput(inode);
> > - }
> > + iput_evict(inode);
> > cond_resched();
> > }
> > }
> > @@ -987,13 +1002,13 @@ void evict_inodes(struct super_block *sb)
> > if (need_resched()) {
> > spin_unlock(&sb->s_inode_list_lock);
> > cond_resched();
> > - dispose_list(&dispose, false);
> > + dispose_list(&dispose);
> > goto again;
> > }
> > }
> > spin_unlock(&sb->s_inode_list_lock);
> >
> > - dispose_list(&dispose, false);
> > + dispose_list(&dispose);
> > }
> > EXPORT_SYMBOL_GPL(evict_inodes);
> >
> > @@ -1031,22 +1046,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> > if (inode_needs_cached(inode)) {
> > list_lru_isolate(lru, &inode->i_lru);
> > inode_add_cached_lru(inode);
> > - iobj_put(inode);
> > - spin_unlock(&inode->i_lock);
> > - this_cpu_dec(nr_unused);
> > - return LRU_REMOVED;
> > - }
> > -
> > - /*
> > - * Inodes can get referenced, redirtied, or repopulated while
> > - * they're already on the LRU, and this can make them
> > - * unreclaimable for a while. Remove them lazily here; iput,
> > - * sync, or the last page cache deletion will requeue them.
> > - */
> > - if (atomic_read(&inode->i_count) ||
> > - (inode->i_state & ~I_REFERENCED)) {
> > - list_lru_isolate(lru, &inode->i_lru);
> > - inode->i_state &= ~I_LRU;
> > + iput(inode);
> > spin_unlock(&inode->i_lock);
> > this_cpu_dec(nr_unused);
> > return LRU_REMOVED;
> > @@ -1082,7 +1082,6 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> > }
> >
> > WARN_ON(inode->i_state & I_NEW);
> > - inode->i_state |= I_FREEING;
> > inode->i_state &= ~I_LRU;
> > list_lru_isolate_move(lru, &inode->i_lru, freeable);
> > spin_unlock(&inode->i_lock);
> > @@ -1104,7 +1103,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
> >
> > freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
> > inode_lru_isolate, &freeable);
> > - dispose_list(&freeable, true);
> > + dispose_list(&freeable);
> > return freed;
> > }
> >
> > @@ -1967,7 +1966,7 @@ EXPORT_SYMBOL(generic_delete_inode);
> > * in cache if fs is alive, sync and evict if fs is
> > * shutting down.
> > */
> > -static void iput_final(struct inode *inode)
> > +static void iput_final(struct inode *inode, bool skip_lru)
> > {
> > struct super_block *sb = inode->i_sb;
> > const struct super_operations *op = inode->i_sb->s_op;
> > @@ -1981,7 +1980,7 @@ static void iput_final(struct inode *inode)
> > else
> > drop = generic_drop_inode(inode);
> >
> > - if (!drop &&
> > + if (!drop && !skip_lru &&
> > !(inode->i_state & I_DONTCACHE) &&
> > (sb->s_flags & SB_ACTIVE)) {
> > __inode_add_lru(inode, true);
> > @@ -1989,6 +1988,8 @@ static void iput_final(struct inode *inode)
> > return;
> > }
> >
> > + WARN_ON(!list_empty(&inode->i_lru));
> > +
> > state = inode->i_state;
> > if (!drop) {
> > WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
> > @@ -2003,23 +2004,12 @@ static void iput_final(struct inode *inode)
> > }
> >
> > WRITE_ONCE(inode->i_state, state | I_FREEING);
> > - if (!list_empty(&inode->i_lru))
> > - inode_lru_list_del(inode);
> > spin_unlock(&inode->i_lock);
> >
> > evict(inode);
> > }
> >
> > -/**
> > - * iput - put an inode
> > - * @inode: inode to put
> > - *
> > - * Puts an inode, dropping its usage count. If the inode use count hits
> > - * zero, the inode is then freed and may also be destroyed.
> > - *
> > - * Consequently, iput() can sleep.
> > - */
> > -void iput(struct inode *inode)
> > +static void __iput(struct inode *inode, bool skip_lru)
> > {
> > if (!inode)
> > return;
> > @@ -2037,12 +2027,31 @@ void iput(struct inode *inode)
> >
> > spin_lock(&inode->i_lock);
> > if (atomic_dec_and_test(&inode->i_count))
> > - iput_final(inode);
> > + iput_final(inode, skip_lru);
> > else
> > spin_unlock(&inode->i_lock);
> >
> > iobj_put(inode);
> > }
> > +
> > +static void iput_evict(struct inode *inode)
> > +{
> > + __iput(inode, true);
> > +}
> > +
> > +/**
> > + * iput - put an inode
> > + * @inode: inode to put
> > + *
> > + * Puts an inode, dropping its usage count. If the inode use count hits
> > + * zero, the inode is then freed and may also be destroyed.
> > + *
> > + * Consequently, iput() can sleep.
> > + */
> > +void iput(struct inode *inode)
> > +{
> > + __iput(inode, false);
> > +}
> > EXPORT_SYMBOL(iput);
> >
> > /**
> > --
> > 2.49.0
> >
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 18/50] fs: disallow 0 reference count inodes
2025-08-21 20:18 ` [PATCH 18/50] fs: disallow 0 reference count inodes Josef Bacik
@ 2025-08-25 10:54 ` Christian Brauner
2025-08-25 19:26 ` Josef Bacik
0 siblings, 1 reply; 85+ messages in thread
From: Christian Brauner @ 2025-08-25 10:54 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:29PM -0400, Josef Bacik wrote:
> Now that we take a full reference for inodes on the LRU, move the logic
> to add the inode to the LRU to before we drop our last reference. This
> allows us to ensure that if the inode has a reference count it can be
> used, and we no longer hold onto inodes that have a 0 reference count.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 53 +++++++++++++++++++++++++++++++++--------------------
> 1 file changed, 33 insertions(+), 20 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index de0ec791f9a3..b4145ddbaf8e 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -614,7 +614,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
>
> if (inode->i_state & (I_FREEING | I_WILL_FREE))
> return;
> - if (atomic_read(&inode->i_count))
> + if (atomic_read(&inode->i_count) != 1)
> return;
> if (inode->__i_nlink == 0)
> return;
> @@ -1966,28 +1966,11 @@ EXPORT_SYMBOL(generic_delete_inode);
> * in cache if fs is alive, sync and evict if fs is
> * shutting down.
> */
> -static void iput_final(struct inode *inode, bool skip_lru)
> +static void iput_final(struct inode *inode, bool drop)
> {
> - struct super_block *sb = inode->i_sb;
> - const struct super_operations *op = inode->i_sb->s_op;
> unsigned long state;
> - int drop;
>
> WARN_ON(inode->i_state & I_NEW);
> -
> - if (op->drop_inode)
> - drop = op->drop_inode(inode);
> - else
> - drop = generic_drop_inode(inode);
> -
> - if (!drop && !skip_lru &&
> - !(inode->i_state & I_DONTCACHE) &&
> - (sb->s_flags & SB_ACTIVE)) {
> - __inode_add_lru(inode, true);
> - spin_unlock(&inode->i_lock);
> - return;
> - }
> -
> WARN_ON(!list_empty(&inode->i_lru));
>
> state = inode->i_state;
> @@ -2009,8 +1992,29 @@ static void iput_final(struct inode *inode, bool skip_lru)
> evict(inode);
> }
>
> +static bool maybe_add_lru(struct inode *inode, bool skip_lru)
> +{
> + const struct super_operations *op = inode->i_sb->s_op;
> + struct super_block *sb = inode->i_sb;
> + bool drop = false;
> +
> + if (op->drop_inode)
> + drop = op->drop_inode(inode);
> + else
> + drop = generic_drop_inode(inode);
> +
> + if (!drop && !skip_lru &&
> + !(inode->i_state & I_DONTCACHE) &&
> + (sb->s_flags & SB_ACTIVE))
> + __inode_add_lru(inode, true);
> +
> + return drop;
> +}
Can we rewrite this as:
static bool maybe_add_lru(struct inode *inode, bool skip_lru)
{
const struct super_operations *op = inode->i_sb->s_op;
const struct super_block *sb = inode->i_sb;
bool drop = false;
if (op->drop_inode)
drop = op->drop_inode(inode);
else
drop = generic_drop_inode(inode);
if (drop)
return drop;
if (skip_lru)
return drop;
if (inode->i_state & I_DONTCACHE)
return drop;
if (!(sb->s_flags & SB_ACTIVE))
return drop;
__inode_add_lru(inode, true);
return drop;
}
so it's a lot easier to follow. I really dislike munging conditions
together with a bunch of ands and negations mixed in.
And btw for both I_DONTCACHE and !SB_ACTIVE it seems that returning
anything other than false from op->drop_inode() would be a bug probably
a technicality but I find it pretty odd.
Maybe we add a VFS_WARN_ON_ONCE() at least in your local testing to see
whether you see anything that ever hits this case.
> +
> static void __iput(struct inode *inode, bool skip_lru)
> {
> + bool drop;
> +
> if (!inode)
> return;
> BUG_ON(inode->i_state & I_CLEAR);
> @@ -2026,8 +2030,17 @@ static void __iput(struct inode *inode, bool skip_lru)
> }
>
> spin_lock(&inode->i_lock);
> +
> + /*
> + * If we want to keep the inode around on an LRU we will grab a ref to
> + * the inode when we add it to the LRU list, so we can safely drop the
> + * callers reference after this. If we didn't add the inode to the LRU
> + * then the refcount will still be 1 and we can do the final iput.
> + */
> + drop = maybe_add_lru(inode, skip_lru);
> +
> if (atomic_dec_and_test(&inode->i_count))
> - iput_final(inode, skip_lru);
> + iput_final(inode, drop);
> else
> spin_unlock(&inode->i_lock);
>
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 20/50] fs: convert i_count to refcount_t
2025-08-21 20:18 ` [PATCH 20/50] fs: convert i_count to refcount_t Josef Bacik
2025-08-22 12:10 ` Amir Goldstein
2025-08-22 13:56 ` kernel test robot
@ 2025-08-25 11:03 ` Christian Brauner
2 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-25 11:03 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:31PM -0400, Josef Bacik wrote:
> Now that we do not allow i_count to drop to 0 and be used we can convert
> it to a refcount_t and benefit from the protections those helpers add.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
I've commented on that earlier in the series:
We should have no plain accesses to ->i_count anywhere outside of core
vfs. We should always use accessors. So please add icount_read() and
iobj_count_read() or something similar like we have file_ref_read().
> arch/powerpc/platforms/cell/spufs/file.c | 2 +-
> fs/btrfs/inode.c | 4 ++--
> fs/ceph/mds_client.c | 2 +-
> fs/ext4/ialloc.c | 4 ++--
> fs/fs-writeback.c | 2 +-
> fs/hpfs/inode.c | 2 +-
> fs/inode.c | 11 ++++++-----
> fs/nfs/inode.c | 4 ++--
> fs/notify/fsnotify.c | 2 +-
> fs/ubifs/super.c | 2 +-
> fs/xfs/xfs_inode.c | 2 +-
> fs/xfs/xfs_trace.h | 2 +-
> include/linux/fs.h | 4 ++--
> include/trace/events/filelock.h | 2 +-
> security/landlock/fs.c | 2 +-
> 15 files changed, 24 insertions(+), 23 deletions(-)
>
> diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c
> index d5a2c77bc908..3f768b003838 100644
> --- a/arch/powerpc/platforms/cell/spufs/file.c
> +++ b/arch/powerpc/platforms/cell/spufs/file.c
> @@ -1430,7 +1430,7 @@ static int spufs_mfc_open(struct inode *inode, struct file *file)
> if (ctx->owner != current->mm)
> return -EINVAL;
>
> - if (atomic_read(&inode->i_count) != 1)
> + if (refcount_read(&inode->i_count) != 1)
> return -EBUSY;
>
> mutex_lock(&ctx->mapping_lock);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index bbbcd96e8f5c..e85e38df3ea0 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3418,7 +3418,7 @@ void btrfs_add_delayed_iput(struct btrfs_inode *inode)
> struct btrfs_fs_info *fs_info = inode->root->fs_info;
> unsigned long flags;
>
> - if (atomic_add_unless(&inode->vfs_inode.i_count, -1, 1)) {
> + if (refcount_dec_not_one(&inode->vfs_inode.i_count)) {
> iobj_put(&inode->vfs_inode);
> return;
> }
> @@ -4559,7 +4559,7 @@ static void btrfs_prune_dentries(struct btrfs_root *root)
>
> inode = btrfs_find_first_inode(root, min_ino);
> while (inode) {
> - if (atomic_read(&inode->vfs_inode.i_count) > 1)
> + if (refcount_read(&inode->vfs_inode.i_count) > 1)
> d_prune_aliases(&inode->vfs_inode);
>
> min_ino = btrfs_ino(inode) + 1;
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 0f497c39ff82..ff666d18f6ad 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2221,7 +2221,7 @@ static int trim_caps_cb(struct inode *inode, int mds, void *arg)
> int count;
> dput(dentry);
> d_prune_aliases(inode);
> - count = atomic_read(&inode->i_count);
> + count = refcount_read(&inode->i_count);
> if (count == 1)
> (*remaining)--;
> doutc(cl, "%p %llx.%llx cap %p pruned, count now %d\n",
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index df4051613b29..9a3c7f22a57e 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -252,10 +252,10 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
> "nonexistent device\n", __func__, __LINE__);
> return;
> }
> - if (atomic_read(&inode->i_count) > 1) {
> + if (refcount_read(&inode->i_count) > 1) {
> ext4_msg(sb, KERN_ERR, "%s:%d: inode #%lu: count=%d",
> __func__, __LINE__, inode->i_ino,
> - atomic_read(&inode->i_count));
> + refcount_read(&inode->i_count));
> return;
> }
> if (inode->i_nlink) {
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 111a9d8215bf..789c4228412c 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1796,7 +1796,7 @@ static int writeback_single_inode(struct inode *inode,
> int ret = 0;
>
> spin_lock(&inode->i_lock);
> - if (!atomic_read(&inode->i_count))
> + if (!refcount_read(&inode->i_count))
> WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
> else
> WARN_ON(inode->i_state & I_WILL_FREE);
> diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
> index a59e8fa630db..ee23a941d8f5 100644
> --- a/fs/hpfs/inode.c
> +++ b/fs/hpfs/inode.c
> @@ -184,7 +184,7 @@ void hpfs_write_inode(struct inode *i)
> struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
> struct inode *parent;
> if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
> - if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
> + if (hpfs_inode->i_rddir_off && !refcount_read(&i->i_count)) {
> if (*hpfs_inode->i_rddir_off)
> pr_err("write_inode: some position still there\n");
> kfree(hpfs_inode->i_rddir_off);
> diff --git a/fs/inode.c b/fs/inode.c
> index 07c8edb4b58a..28d197731914 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -236,7 +236,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
> inode->i_state = 0;
> atomic64_set(&inode->i_sequence, 0);
> refcount_set(&inode->i_obj_count, 1);
> - atomic_set(&inode->i_count, 1);
> + refcount_set(&inode->i_count, 1);
> inode->i_op = &empty_iops;
> inode->i_fop = &no_open_fops;
> inode->i_ino = 0;
> @@ -561,7 +561,8 @@ static void init_once(void *foo)
> void ihold(struct inode *inode)
> {
> iobj_get(inode);
> - WARN_ON(atomic_inc_return(&inode->i_count) < 2);
> + refcount_inc(&inode->i_count);
> + WARN_ON(refcount_read(&inode->i_count) < 2);
> }
> EXPORT_SYMBOL(ihold);
>
> @@ -614,7 +615,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
>
> if (inode->i_state & (I_FREEING | I_WILL_FREE))
> return;
> - if (atomic_read(&inode->i_count) != 1)
> + if (refcount_read(&inode->i_count) != 1)
> return;
> if (inode->__i_nlink == 0)
> return;
> @@ -2019,7 +2020,7 @@ static void __iput(struct inode *inode, bool skip_lru)
> return;
> BUG_ON(inode->i_state & I_CLEAR);
>
> - if (atomic_add_unless(&inode->i_count, -1, 1)) {
> + if (refcount_dec_not_one(&inode->i_count)) {
> iobj_put(inode);
> return;
> }
> @@ -2039,7 +2040,7 @@ static void __iput(struct inode *inode, bool skip_lru)
> */
> drop = maybe_add_lru(inode, skip_lru);
>
> - if (atomic_dec_and_test(&inode->i_count))
> + if (refcount_dec_and_test(&inode->i_count))
> iput_final(inode, drop);
> else
> spin_unlock(&inode->i_lock);
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 338ef77ae423..9cc84f0afa9a 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -608,7 +608,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
> inode->i_sb->s_id,
> (unsigned long long)NFS_FILEID(inode),
> nfs_display_fhandle_hash(fh),
> - atomic_read(&inode->i_count));
> + refcount_read(&inode->i_count));
>
> out:
> return inode;
> @@ -2229,7 +2229,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
> dfprintk(VFS, "NFS: %s(%s/%lu fh_crc=0x%08x ct=%d info=0x%llx)\n",
> __func__, inode->i_sb->s_id, inode->i_ino,
> nfs_display_fhandle_hash(NFS_FH(inode)),
> - atomic_read(&inode->i_count), fattr->valid);
> + refcount_read(&inode->i_count), fattr->valid);
>
> if (!(fattr->valid & NFS_ATTR_FATTR_FILEID)) {
> /* Only a mounted-on-fileid? Just exit */
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 079b868552c2..0883696f873d 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -66,7 +66,7 @@ static void fsnotify_unmount_inodes(struct super_block *sb)
> * removed all zero refcount inodes, in any case. Test to
> * be sure.
> */
> - if (!atomic_read(&inode->i_count)) {
> + if (!refcount_read(&inode->i_count)) {
> spin_unlock(&inode->i_lock);
> continue;
> }
> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> index f3e3b2068608..79526f71fa8a 100644
> --- a/fs/ubifs/super.c
> +++ b/fs/ubifs/super.c
> @@ -358,7 +358,7 @@ static void ubifs_evict_inode(struct inode *inode)
> goto out;
>
> dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
> - ubifs_assert(c, !atomic_read(&inode->i_count));
> + ubifs_assert(c, !refcount_read(&inode->i_count));
>
> truncate_inode_pages_final(&inode->i_data);
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 9c39251961a3..06af749fe5f3 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1035,7 +1035,7 @@ xfs_itruncate_extents_flags(
> int error = 0;
>
> xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
> - if (atomic_read(&VFS_I(ip)->i_count))
> + if (refcount_read(&VFS_I(ip)->i_count))
> xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
> ASSERT(new_size <= XFS_ISIZE(ip));
> ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index ac344e42846c..167d33b8095c 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1152,7 +1152,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
> TP_fast_assign(
> __entry->dev = VFS_I(ip)->i_sb->s_dev;
> __entry->ino = ip->i_ino;
> - __entry->count = atomic_read(&VFS_I(ip)->i_count);
> + __entry->count = refcount_read(&VFS_I(ip)->i_count);
> __entry->pincount = atomic_read(&ip->i_pincount);
> __entry->iflags = ip->i_flags;
> __entry->caller_ip = caller_ip;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8384ed81a5ad..34fb40ba8a94 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -880,7 +880,7 @@ struct inode {
> };
> atomic64_t i_version;
> atomic64_t i_sequence; /* see futex */
> - atomic_t i_count;
> + refcount_t i_count;
> atomic_t i_dio_count;
> atomic_t i_writecount;
> #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
> @@ -3399,7 +3399,7 @@ static inline void iobj_get(struct inode *inode)
> static inline void __iget(struct inode *inode)
> {
> iobj_get(inode);
> - atomic_inc(&inode->i_count);
> + refcount_inc(&inode->i_count);
> }
>
> extern void iget_failed(struct inode *);
> diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
> index b8d1e00a7982..e745436cfcd2 100644
> --- a/include/trace/events/filelock.h
> +++ b/include/trace/events/filelock.h
> @@ -189,7 +189,7 @@ TRACE_EVENT(generic_add_lease,
> __entry->i_ino = inode->i_ino;
> __entry->wcount = atomic_read(&inode->i_writecount);
> __entry->rcount = atomic_read(&inode->i_readcount);
> - __entry->icount = atomic_read(&inode->i_count);
> + __entry->icount = refcount_read(&inode->i_count);
> __entry->owner = fl->c.flc_owner;
> __entry->flags = fl->c.flc_flags;
> __entry->type = fl->c.flc_type;
> diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> index c04f8879ad03..570f851dc469 100644
> --- a/security/landlock/fs.c
> +++ b/security/landlock/fs.c
> @@ -1281,7 +1281,7 @@ static void hook_sb_delete(struct super_block *const sb)
> struct landlock_object *object;
>
> /* Only handles referenced inodes. */
> - if (!atomic_read(&inode->i_count))
> + if (!refcount_read(&inode->i_count))
> continue;
>
> /*
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 21/50] fs: use refcount_inc_not_zero in igrab
2025-08-21 20:18 ` [PATCH 21/50] fs: use refcount_inc_not_zero in igrab Josef Bacik
@ 2025-08-25 11:21 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-25 11:21 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:32PM -0400, Josef Bacik wrote:
> We are going to use igrab everywhere we want to acquire a live inode.
> Update it to do a refcount_inc_not_zero on the i_count, and if
> successful grab an reference to i_obj_count. Add a comment explaining
> why we do this and the safety.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 26 +++++++++++++-------------
> include/linux/fs.h | 27 +++++++++++++++++++++++++++
> 2 files changed, 40 insertions(+), 13 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 28d197731914..b9122c1eee1d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1648,20 +1648,20 @@ EXPORT_SYMBOL(iunique);
>
> struct inode *igrab(struct inode *inode)
> {
> + lockdep_assert_not_held(&inode->i_lock);
> +
> + inode = inode_tryget(inode);
> + if (!inode)
> + return NULL;
> +
> + /*
> + * If this inode is on the LRU, take it off so that we can re-run the
> + * LRU logic on the next iput().
> + */
> spin_lock(&inode->i_lock);
> - if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
> - __iget(inode);
> - inode_lru_list_del(inode);
> - spin_unlock(&inode->i_lock);
> - } else {
> - spin_unlock(&inode->i_lock);
> - /*
> - * Handle the case where s_op->clear_inode is not been
> - * called yet, and somebody is calling igrab
> - * while the inode is getting freed.
> - */
> - inode = NULL;
> - }
> + inode_lru_list_del(inode);
> + spin_unlock(&inode->i_lock);
> +
> return inode;
> }
> EXPORT_SYMBOL(igrab);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 34fb40ba8a94..b731224708be 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3393,6 +3393,33 @@ static inline void iobj_get(struct inode *inode)
> refcount_inc(&inode->i_obj_count);
> }
>
> +static inline struct inode *inode_tryget(struct inode *inode)
> +{
> + /*
> + * We are using inode_tryget() because we're interested in getting a
> + * live reference to the inode, which is ->i_count. Normally we would
> + * grab i_obj_count first, as it is the highe priority reference.
> + * However we're only interested in making sure we have a live inode,
> + * and we know that if we get a reference for i_count then we can safely
> + * acquire i_obj_count because we always drop i_obj_count after dropping
> + * an i_count reference.
> + *
> + * This is meant to be used either in a place where we have an existing
> + * i_obj_count reference on the inode, or under rcu_read_lock() so we
> + * know we're safe in accessing this inode still.
Maybe add a debug assert to that effect?
VFS_WAR_ON_ONCE(!icount_read(inode) && !rcu_read_lock_held());
> + */
> + if (!refcount_inc_not_zero(&inode->i_count)) {
> + /*
> + * If we failed to increment the reference count, then the
> + * inode is being freed or has been freed. We return NULL
> + * in this case.
> + */
> + return NULL;
> + }
I would invert the logic here?
if (refcount_inc_not_zero()) {
iobj_get(inode);
return inode;
}
/*
* If we failed to increment the reference count, then the
* inode is being freed or has been freed. We return NULL
* in this case.
*/
return NULL;
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 22/50] fs: use inode_tryget in find_inode*
2025-08-21 20:18 ` [PATCH 22/50] fs: use inode_tryget in find_inode* Josef Bacik
@ 2025-08-25 11:26 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-25 11:26 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:33PM -0400, Josef Bacik wrote:
> Now that we never drop the i_count to 0 for valid objects, rework the
> logic in the find_inode* helpers to use inode_tryget() to see if they
> have a live inode. If this fails we can wait for the inode to be freed
> as we know it's currently being evicted.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 19 +++++++++----------
> 1 file changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index b9122c1eee1d..893ac902268b 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1109,6 +1109,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
> }
>
> static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked);
> +
> /*
> * Called with the inode lock held.
> */
> @@ -1132,16 +1133,15 @@ static struct inode *find_inode(struct super_block *sb,
> if (!test(inode, data))
> continue;
> spin_lock(&inode->i_lock);
> - if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> - __wait_on_freeing_inode(inode, is_inode_hash_locked);
> - goto repeat;
> - }
> if (unlikely(inode->i_state & I_CREATING)) {
> spin_unlock(&inode->i_lock);
> rcu_read_unlock();
> return ERR_PTR(-ESTALE);
> }
> - __iget(inode);
> + if (!inode_tryget(inode)) {
> + __wait_on_freeing_inode(inode, is_inode_hash_locked);
> + goto repeat;
> + }
> inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
> rcu_read_unlock();
> @@ -1174,16 +1174,15 @@ static struct inode *find_inode_fast(struct super_block *sb,
> if (inode->i_sb != sb)
> continue;
> spin_lock(&inode->i_lock);
> - if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
Oh yes, this is such a beautiful change...
> - __wait_on_freeing_inode(inode, is_inode_hash_locked);
> - goto repeat;
> - }
> if (unlikely(inode->i_state & I_CREATING)) {
> spin_unlock(&inode->i_lock);
> rcu_read_unlock();
> return ERR_PTR(-ESTALE);
> }
> - __iget(inode);
> + if (!inode_tryget(inode)) {
> + __wait_on_freeing_inode(inode, is_inode_hash_locked);
> + goto repeat;
> + }
> inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
> rcu_read_unlock();
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 23/50] fs: update find_inode_*rcu to check the i_count count
2025-08-21 20:18 ` [PATCH 23/50] fs: update find_inode_*rcu to check the i_count count Josef Bacik
@ 2025-08-25 11:27 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-25 11:27 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:34PM -0400, Josef Bacik wrote:
> These two helpers are always used under the RCU and don't appear to mind
> if the inode state changes in between time of check and time of use.
> Update them to use the i_count refcount instead of I_WILL_FREE or
> I_FREEING.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 893ac902268b..63ccd32fa221 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1839,7 +1839,7 @@ struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
>
> hlist_for_each_entry_rcu(inode, head, i_hash) {
> if (inode->i_sb == sb &&
> - !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)) &&
> + refcount_read(&inode->i_count) > 0 &&
No direct i_count access, otherwise another nice change. Not having to
look at these terribly named flags anymore is wonderful.
> test(inode, data))
> return inode;
> }
> @@ -1878,8 +1878,8 @@ struct inode *find_inode_by_ino_rcu(struct super_block *sb,
> hlist_for_each_entry_rcu(inode, head, i_hash) {
> if (inode->i_ino == ino &&
> inode->i_sb == sb &&
> - !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)))
> - return inode;
> + refcount_read(&inode->i_count) > 0)
> + return inode;
> }
> return NULL;
> }
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 27/50] fs: use inode_tryget in evict_inodes
2025-08-21 20:18 ` [PATCH 27/50] fs: use inode_tryget in evict_inodes Josef Bacik
@ 2025-08-25 11:43 ` Christian Brauner
2025-08-25 18:22 ` Josef Bacik
0 siblings, 1 reply; 85+ messages in thread
From: Christian Brauner @ 2025-08-25 11:43 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:38PM -0400, Josef Bacik wrote:
> Instead of checking I_WILL_FREE|I_FREEING we can simply use
> inode_tryget() to determine if we have a live inode that can be evicted.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/inode.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index a14b3a54c4b5..4e1eeb0c3889 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -983,12 +983,16 @@ void evict_inodes(struct super_block *sb)
> spin_lock(&sb->s_inode_list_lock);
> list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> spin_lock(&inode->i_lock);
> - if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
> + if (inode->i_state & I_NEW) {
> + spin_unlock(&inode->i_lock);
> + continue;
> + }
> +
> + if (!inode_tryget(inode)) {
So it reads like if we fail to take a reference count on @inode then
someone else is already evicting it. I get that.
But what's confusing to me is that the __iget() call you're removing
was an increment from zero earlier in your series because evict_inodes()
was only callable on inodes that had a zero i_count.
Oh, ok, I forgot, you mandate that for an inode to be on an LRU they
must now hold an i_count reference not just an i_obj_count reference.
So in the prior scheme i_count was zero and wouldn't go back up from
zero. In your scheme is i_count guaranteed to be one and after you've
grabbed another reference and it's gone up to 2 is that the max it can
reach or is it possible that i_count can be grabbed by others somehow?
> spin_unlock(&inode->i_lock);
> continue;
> }
>
> - __iget(inode);
> inode_lru_list_del(inode);
> list_add(&inode->i_lru, &dispose);
> spin_unlock(&inode->i_lock);
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 35/50] fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c
2025-08-21 20:18 ` [PATCH 35/50] fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c Josef Bacik
@ 2025-08-25 11:46 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-25 11:46 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:18:46PM -0400, Josef Bacik wrote:
> Now that we have the reference count to indicate live inodes, we can
> remove all of the uses of I_WILL_FREE and I_FREEING in fs-writeback.c
> and use the appropriate reference count checks.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
Very nice change and it will be even better without the plain accesses
to i_count. ;)
> fs/fs-writeback.c | 47 ++++++++++++++++++++++++++---------------------
> 1 file changed, 26 insertions(+), 21 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 789c4228412c..0ea6d0e86791 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -129,7 +129,7 @@ static bool inode_io_list_move_locked(struct inode *inode,
> {
> assert_spin_locked(&wb->list_lock);
> assert_spin_locked(&inode->i_lock);
> - WARN_ON_ONCE(inode->i_state & I_FREEING);
> + WARN_ON_ONCE(!refcount_read(&inode->i_count));
>
> if (list_empty(&inode->i_io_list))
> iobj_get(inode);
> @@ -314,7 +314,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
> {
> assert_spin_locked(&wb->list_lock);
> assert_spin_locked(&inode->i_lock);
> - WARN_ON_ONCE(inode->i_state & I_FREEING);
> + WARN_ON_ONCE(!refcount_read(&inode->i_count));
>
> inode->i_state &= ~I_SYNC_QUEUED;
> if (wb != &wb->bdi->wb)
> @@ -415,10 +415,10 @@ static bool inode_do_switch_wbs(struct inode *inode,
> xa_lock_irq(&mapping->i_pages);
>
> /*
> - * Once I_FREEING or I_WILL_FREE are visible under i_lock, the eviction
> - * path owns the inode and we shouldn't modify ->i_io_list.
> + * Once the refcount is 0, the eviction path owns the inode and we
> + * shouldn't modify ->i_io_list.
> */
> - if (unlikely(inode->i_state & (I_FREEING | I_WILL_FREE)))
> + if (unlikely(!refcount_read(&inode->i_count)))
> goto skip_switch;
>
> trace_inode_switch_wbs(inode, old_wb, new_wb);
> @@ -570,13 +570,16 @@ static bool inode_prepare_wbs_switch(struct inode *inode,
> /* while holding I_WB_SWITCH, no one else can update the association */
> spin_lock(&inode->i_lock);
> if (!(inode->i_sb->s_flags & SB_ACTIVE) ||
> - inode->i_state & (I_WB_SWITCH | I_FREEING | I_WILL_FREE) ||
> + inode->i_state & I_WB_SWITCH ||
> inode_to_wb(inode) == new_wb) {
> spin_unlock(&inode->i_lock);
> return false;
> }
> + if (!inode_tryget(inode)) {
> + spin_unlock(&inode->i_lock);
> + return false;
> + }
> inode->i_state |= I_WB_SWITCH;
> - __iget(inode);
> spin_unlock(&inode->i_lock);
>
> return true;
> @@ -1207,7 +1210,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
> {
> assert_spin_locked(&wb->list_lock);
> assert_spin_locked(&inode->i_lock);
> - WARN_ON_ONCE(inode->i_state & I_FREEING);
> + WARN_ON_ONCE(!refcount_read(&inode->i_count));
>
> inode->i_state &= ~I_SYNC_QUEUED;
> inode_delete_from_io_list(inode);
> @@ -1405,7 +1408,7 @@ static void redirty_tail_locked(struct inode *inode, struct bdi_writeback *wb)
> * tracking. Flush worker will ignore this inode anyway and it will
> * trigger assertions in inode_io_list_move_locked().
> */
> - if (inode->i_state & I_FREEING) {
> + if (!refcount_read(&inode->i_count)) {
> inode_delete_from_io_list(inode);
> wb_io_lists_depopulated(wb);
> return;
> @@ -1621,7 +1624,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
> struct writeback_control *wbc,
> unsigned long dirtied_before)
> {
> - if (inode->i_state & I_FREEING)
> + if (!refcount_read(&inode->i_count))
> return;
>
> /*
> @@ -1787,7 +1790,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> * whether it is a data-integrity sync (%WB_SYNC_ALL) or not (%WB_SYNC_NONE).
> *
> * To prevent the inode from going away, either the caller must have a reference
> - * to the inode, or the inode must have I_WILL_FREE or I_FREEING set.
> + * to the inode, or the inode must have a zero refcount.
> */
> static int writeback_single_inode(struct inode *inode,
> struct writeback_control *wbc)
> @@ -1797,9 +1800,7 @@ static int writeback_single_inode(struct inode *inode,
>
> spin_lock(&inode->i_lock);
> if (!refcount_read(&inode->i_count))
> - WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
> - else
> - WARN_ON(inode->i_state & I_WILL_FREE);
> + WARN_ON(inode->i_state & I_NEW);
>
> if (inode->i_state & I_SYNC) {
> /*
> @@ -1837,7 +1838,7 @@ static int writeback_single_inode(struct inode *inode,
> * If the inode is freeing, its i_io_list shoudn't be updated
> * as it can be finally deleted at this moment.
> */
> - if (!(inode->i_state & I_FREEING)) {
> + if (refcount_read(&inode->i_count)) {
> /*
> * If the inode is now fully clean, then it can be safely
> * removed from its writeback list (if any). Otherwise the
> @@ -1957,7 +1958,7 @@ static long writeback_sb_inodes(struct super_block *sb,
> * kind writeout is handled by the freer.
> */
> spin_lock(&inode->i_lock);
> - if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
> + if ((inode->i_state & I_NEW) || !refcount_read(&inode->i_count)) {
> redirty_tail_locked(inode, wb);
> spin_unlock(&inode->i_lock);
> continue;
> @@ -2615,7 +2616,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> if (inode_unhashed(inode))
> goto out_unlock;
> }
> - if (inode->i_state & I_FREEING)
> + if (!refcount_read(&inode->i_count))
> goto out_unlock;
>
> /*
> @@ -2729,13 +2730,17 @@ static void wait_sb_inodes(struct super_block *sb)
> spin_unlock_irq(&sb->s_inode_wblist_lock);
>
> spin_lock(&inode->i_lock);
> - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> + if (inode->i_state & I_NEW) {
> + spin_unlock(&inode->i_lock);
> + spin_lock_irq(&sb->s_inode_wblist_lock);
> + continue;
> + }
> +
> + if (!inode_tryget(inode)) {
> spin_unlock(&inode->i_lock);
> -
> spin_lock_irq(&sb->s_inode_wblist_lock);
> continue;
> }
> - __iget(inode);
>
> /*
> * We could have potentially ended up on the cached LRU list, so
> @@ -2886,7 +2891,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
> * This function commits an inode to disk immediately if it is dirty. This is
> * primarily needed by knfsd.
> *
> - * The caller must either have a ref on the inode or must have set I_WILL_FREE.
> + * The caller must either have a ref on the inode or the inode must have a zero refcount.
> */
> int write_inode_now(struct inode *inode, int sync)
> {
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 49/50] fs: remove I_FREEING|I_WILL_FREE
2025-08-21 20:19 ` [PATCH 49/50] fs: remove I_FREEING|I_WILL_FREE Josef Bacik
@ 2025-08-25 11:53 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-25 11:53 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:19:00PM -0400, Josef Bacik wrote:
> Now that we're using the i_count reference count as the ultimate arbiter
> of whether or not an inode is life we can remove the I_FREEING and
> I_WILL_FREE flags.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
Very nice.
> fs/inode.c | 8 ++------
> include/linux/fs.h | 32 +++++++++++---------------------
> include/trace/events/writeback.h | 2 --
> 3 files changed, 13 insertions(+), 29 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index f715504778d2..1bb528405b3d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -878,7 +878,7 @@ void clear_inode(struct inode *inode)
> BUG_ON(inode->i_state & I_CLEAR);
> BUG_ON(!list_empty(&inode->i_wb_list));
> /* don't need i_lock here, no concurrent mods to i_state */
> - inode->i_state = I_FREEING | I_CLEAR;
> + inode->i_state = I_CLEAR;
> }
> EXPORT_SYMBOL(clear_inode);
>
> @@ -942,7 +942,7 @@ static void evict(struct inode *inode)
> * This also means we don't need any fences for the call below.
> */
> inode_wake_up_bit(inode, __I_NEW);
> - BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
> + BUG_ON(inode->i_state != I_CLEAR);
> }
>
> static void iput_evict(struct inode *inode);
> @@ -1975,7 +1975,6 @@ static void iput_final(struct inode *inode, bool drop)
>
> state = inode->i_state;
> if (!drop) {
> - WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
> spin_unlock(&inode->i_lock);
>
> write_inode_now(inode, 1);
> @@ -1983,10 +1982,7 @@ static void iput_final(struct inode *inode, bool drop)
> spin_lock(&inode->i_lock);
> state = inode->i_state;
> WARN_ON(state & I_NEW);
> - state &= ~I_WILL_FREE;
> }
> -
> - WRITE_ONCE(inode->i_state, state | I_FREEING);
> spin_unlock(&inode->i_lock);
>
> evict(inode);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9d9acbea6433..0599faef0d6a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -672,8 +672,8 @@ is_uncached_acl(struct posix_acl *acl)
> * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME.
> *
> * Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
> - * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
> - * various stages of removing an inode.
> + * until that flag is cleared. I_CLEAR is set when the inode is clean and ready
> + * to be freed.
> *
> * Two bits are used for locking and completion notification, I_NEW and I_SYNC.
> *
> @@ -697,24 +697,18 @@ is_uncached_acl(struct posix_acl *acl)
> * New inodes set I_NEW. If two processes both create
> * the same inode, one of them will release its inode and
> * wait for I_NEW to be released before returning.
> - * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
> - * also cause waiting on I_NEW, without I_NEW actually
> - * being set. find_inode() uses this to prevent returning
> + * Inodes with an i_count == 0 or I_CLEAR state can also
> + * cause waiting on I_NEW, without I_NEW actually being
> + * set. find_inode() uses this to prevent returning
> * nearly-dead inodes.
> - * I_WILL_FREE Must be set when calling write_inode_now() if i_count
> - * is zero. I_FREEING must be set when I_WILL_FREE is
> - * cleared.
> - * I_FREEING Set when inode is about to be freed but still has dirty
> - * pages or buffers attached or the inode itself is still
> - * dirty.
> * I_CLEAR Added by clear_inode(). In this state the inode is
> - * clean and can be destroyed. Inode keeps I_FREEING.
> + * clean and can be destroyed.
> *
> - * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
> - * prohibited for many purposes. iget() must wait for
> - * the inode to be completely released, then create it
> - * anew. Other functions will just ignore such inodes,
> - * if appropriate. I_NEW is used for waiting.
> + * Inodes that have i_count == 0 or I_CLEAR are prohibited
> + * for many purposes. iget() must wait for the inode to be
> + * completely released, then create it anew. Other
> + * functions will just ignore such inodes, if appropriate.
> + * I_NEW is used for waiting.
> *
> * I_SYNC Writeback of inode is running. The bit is set during
> * data writeback, and cleared with a wakeup on the bit
> @@ -752,8 +746,6 @@ is_uncached_acl(struct posix_acl *acl)
> * I_CACHED_LRU Inode is cached because it is dirty or isn't shrinkable,
> * and thus is on the s_cached_inode_lru list.
> *
> - * Q: What is the difference between I_WILL_FREE and I_FREEING?
> - *
> * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
> * upon. There's one free address left.
> */
> @@ -776,8 +768,6 @@ enum inode_state_bits {
> INODE_BIT(I_DIRTY_SYNC),
> INODE_BIT(I_DIRTY_DATASYNC),
> INODE_BIT(I_DIRTY_PAGES),
> - INODE_BIT(I_WILL_FREE),
> - INODE_BIT(I_FREEING),
> INODE_BIT(I_CLEAR),
> INODE_BIT(I_REFERENCED),
> INODE_BIT(I_LINKABLE),
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 6949329c744a..58ee61f3d91d 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -15,8 +15,6 @@
> {I_DIRTY_DATASYNC, "I_DIRTY_DATASYNC"}, \
> {I_DIRTY_PAGES, "I_DIRTY_PAGES"}, \
> {I_NEW, "I_NEW"}, \
> - {I_WILL_FREE, "I_WILL_FREE"}, \
> - {I_FREEING, "I_FREEING"}, \
> {I_CLEAR, "I_CLEAR"}, \
> {I_SYNC, "I_SYNC"}, \
> {I_DIRTY_TIME, "I_DIRTY_TIME"}, \
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 50/50] fs: add documentation explaining the reference count rules for inodes
2025-08-21 20:19 ` [PATCH 50/50] fs: add documentation explaining the reference count rules for inodes Josef Bacik
@ 2025-08-25 11:56 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-25 11:56 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Thu, Aug 21, 2025 at 04:19:01PM -0400, Josef Bacik wrote:
> Now that we've made these changes to the inode, document the reference
> count rules in the vfs documentation.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> Documentation/filesystems/vfs.rst | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
>
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 229eb90c96f2..5bfe7863a5de 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -457,6 +457,29 @@ The Inode Object
>
> An inode object represents an object within the filesystem.
>
> +Reference counting rules
> +------------------------
> +
> +The inode is reference counted in two distinct ways, an i_obj_count refcount and
> +an i_count refcount. These control two different lifetimes of the inode. The
> +i_obj_count is the simplest, think of it as a reference count on the object
> +itself. When the i_obj_count reaches zero, the inode is freed. Inode freeing
> +happens in the RCU context, so the inode is not freed immediately, but rather
> +after a grace period.
> +
> +The i_count reference is the indicator that the inode is "alive". That is to
> +say, it is available for use by all the ways that a user can access the inode.
> +Once this count reaches zero, we begin the process of evicting the inode. This
> +is where the final truncate of an unlinked inode will normally occur. Once
> +i_count has reached 0, only the final iput() is allowed to do things like
> +writeback, truncate, etc. All users that want to do these style of operation
> +must use igrab() or, in very rare and specific circumstances, use
> +inode_tryget().
> +
> +Every access to an inode must include one of these two references. Generally
> +i_obj_count is reserved for internal VFS references, the s_inode_list for
> +example. All file systems should use igrab()/lookup() to get a live reference on
> +the inode, with very few exceptions.
It would be awesome if you could document in more detail how LRU
handling and reference counts works.
I_FREEING | I_WILL_FREE was tightly interconnected with LRU handling and
it was frankly a mess to understand so having the new stuff clearly
documented would help us all a lot. Thanks!
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 27/50] fs: use inode_tryget in evict_inodes
2025-08-25 11:43 ` Christian Brauner
@ 2025-08-25 18:22 ` Josef Bacik
0 siblings, 0 replies; 85+ messages in thread
From: Josef Bacik @ 2025-08-25 18:22 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Mon, Aug 25, 2025 at 01:43:57PM +0200, Christian Brauner wrote:
> On Thu, Aug 21, 2025 at 04:18:38PM -0400, Josef Bacik wrote:
> > Instead of checking I_WILL_FREE|I_FREEING we can simply use
> > inode_tryget() to determine if we have a live inode that can be evicted.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > fs/inode.c | 8 ++++++--
> > 1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index a14b3a54c4b5..4e1eeb0c3889 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -983,12 +983,16 @@ void evict_inodes(struct super_block *sb)
> > spin_lock(&sb->s_inode_list_lock);
> > list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > spin_lock(&inode->i_lock);
> > - if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
> > + if (inode->i_state & I_NEW) {
> > + spin_unlock(&inode->i_lock);
> > + continue;
> > + }
> > +
> > + if (!inode_tryget(inode)) {
>
> So it reads like if we fail to take a reference count on @inode then
> someone else is already evicting it. I get that.
>
> But what's confusing to me is that the __iget() call you're removing
> was an increment from zero earlier in your series because evict_inodes()
> was only callable on inodes that had a zero i_count.
>
> Oh, ok, I forgot, you mandate that for an inode to be on an LRU they
> must now hold an i_count reference not just an i_obj_count reference.
>
> So in the prior scheme i_count was zero and wouldn't go back up from
> zero. In your scheme is i_count guaranteed to be one and after you've
> grabbed another reference and it's gone up to 2 is that the max it can
> reach or is it possible that i_count can be grabbed by others somehow?
It can be grabbed by others now.
The idea here is that we're drastically simplifying the logic. We no longer care
to only operate on inodes that are truly dead. If we can grab a reference to the
inode then it is live by some other means (LRU, someone holding a file open,
etc). We remove it from the LRU and then we drop our reference. At this point
becasue S_ACTIVE is not set we know that we won't be adding inodes to the LRU
anymore, and this should free the inode.
However if there's some bug in the filesystem or elsewhere and we have an
elevated refcount then we could still leak the inode. But we just don't care
about that here. Before we wouldn't even bother to touch the inode, now we
uncondtionally process all the inodes, and if there's still inodes left then
there's a bug. Thanks,
Josef
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 18/50] fs: disallow 0 reference count inodes
2025-08-25 10:54 ` Christian Brauner
@ 2025-08-25 19:26 ` Josef Bacik
2025-08-26 9:28 ` Christian Brauner
0 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-25 19:26 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Mon, Aug 25, 2025 at 12:54:01PM +0200, Christian Brauner wrote:
> On Thu, Aug 21, 2025 at 04:18:29PM -0400, Josef Bacik wrote:
> > Now that we take a full reference for inodes on the LRU, move the logic
> > to add the inode to the LRU to before we drop our last reference. This
> > allows us to ensure that if the inode has a reference count it can be
> > used, and we no longer hold onto inodes that have a 0 reference count.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > fs/inode.c | 53 +++++++++++++++++++++++++++++++++--------------------
> > 1 file changed, 33 insertions(+), 20 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index de0ec791f9a3..b4145ddbaf8e 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -614,7 +614,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> >
> > if (inode->i_state & (I_FREEING | I_WILL_FREE))
> > return;
> > - if (atomic_read(&inode->i_count))
> > + if (atomic_read(&inode->i_count) != 1)
> > return;
> > if (inode->__i_nlink == 0)
> > return;
> > @@ -1966,28 +1966,11 @@ EXPORT_SYMBOL(generic_delete_inode);
> > * in cache if fs is alive, sync and evict if fs is
> > * shutting down.
> > */
> > -static void iput_final(struct inode *inode, bool skip_lru)
> > +static void iput_final(struct inode *inode, bool drop)
> > {
> > - struct super_block *sb = inode->i_sb;
> > - const struct super_operations *op = inode->i_sb->s_op;
> > unsigned long state;
> > - int drop;
> >
> > WARN_ON(inode->i_state & I_NEW);
> > -
> > - if (op->drop_inode)
> > - drop = op->drop_inode(inode);
> > - else
> > - drop = generic_drop_inode(inode);
> > -
> > - if (!drop && !skip_lru &&
> > - !(inode->i_state & I_DONTCACHE) &&
> > - (sb->s_flags & SB_ACTIVE)) {
> > - __inode_add_lru(inode, true);
> > - spin_unlock(&inode->i_lock);
> > - return;
> > - }
> > -
> > WARN_ON(!list_empty(&inode->i_lru));
> >
> > state = inode->i_state;
> > @@ -2009,8 +1992,29 @@ static void iput_final(struct inode *inode, bool skip_lru)
> > evict(inode);
> > }
> >
> > +static bool maybe_add_lru(struct inode *inode, bool skip_lru)
> > +{
> > + const struct super_operations *op = inode->i_sb->s_op;
> > + struct super_block *sb = inode->i_sb;
> > + bool drop = false;
> > +
> > + if (op->drop_inode)
> > + drop = op->drop_inode(inode);
> > + else
> > + drop = generic_drop_inode(inode);
> > +
> > + if (!drop && !skip_lru &&
> > + !(inode->i_state & I_DONTCACHE) &&
> > + (sb->s_flags & SB_ACTIVE))
> > + __inode_add_lru(inode, true);
> > +
> > + return drop;
> > +}
>
> Can we rewrite this as:
>
> static bool maybe_add_lru(struct inode *inode, bool skip_lru)
> {
> const struct super_operations *op = inode->i_sb->s_op;
> const struct super_block *sb = inode->i_sb;
> bool drop = false;
>
> if (op->drop_inode)
> drop = op->drop_inode(inode);
> else
> drop = generic_drop_inode(inode);
>
> if (drop)
> return drop;
>
> if (skip_lru)
> return drop;
>
> if (inode->i_state & I_DONTCACHE)
> return drop;
>
> if (!(sb->s_flags & SB_ACTIVE))
> return drop;
>
> __inode_add_lru(inode, true);
> return drop;
> }
>
> so it's a lot easier to follow. I really dislike munging conditions
> together with a bunch of ands and negations mixed in.
>
> And btw for both I_DONTCACHE and !SB_ACTIVE it seems that returning
> anything other than false from op->drop_inode() would be a bug probably
> a technicality but I find it pretty odd.
Not necsessarily, maybe we had some delayed iput (*cough* btrfs *cough*) that
didn't run until umount time and now we have true coming from ->drop_inode()
with SB_ACTIVE turned off. That would be completely valid. Thanks,
Josef
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 16/50] fs: change evict_inodes to use iput instead of evict directly
2025-08-25 9:07 ` Christian Brauner
@ 2025-08-25 19:35 ` Josef Bacik
2025-08-26 9:56 ` Christian Brauner
0 siblings, 1 reply; 85+ messages in thread
From: Josef Bacik @ 2025-08-25 19:35 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Mon, Aug 25, 2025 at 11:07:55AM +0200, Christian Brauner wrote:
> On Thu, Aug 21, 2025 at 04:18:27PM -0400, Josef Bacik wrote:
> > At evict_inodes() time, we no longer have SB_ACTIVE set, so we can
> > easily go through the normal iput path to clear any inodes. Update
>
> I'm a bit lost why SB_ACTIVE is used here as a justification to call
> iput(). I think it's because iput_final() would somehow add it back to
> the LRU if SB_ACTIVE was still set and the filesystem somehow would
> indicate it wouldn't want to drop the inode.
>
> I'm confused where that would even happen. IOW, which filesystem would
> indicate "don't drop the inode" even though it's about to vanish. But
> anyway, that's probably not important because...
>
> > dispose_list() to check how we need to free the inode, and then grab a
> > full reference to the inode while we're looping through the remaining
> > inodes, and simply iput them at the end.
> >
> > Since we're just calling iput we don't really care about the i_count on
> > the inode at the current time. Remove the i_count checks and just call
> > iput on every inode we find.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > fs/inode.c | 26 +++++++++++---------------
> > 1 file changed, 11 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 72981b890ec6..80ad327746a7 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -933,7 +933,7 @@ static void evict(struct inode *inode)
> > * Dispose-list gets a local list with local inodes in it, so it doesn't
> > * need to worry about list corruption and SMP locks.
> > */
> > -static void dispose_list(struct list_head *head)
> > +static void dispose_list(struct list_head *head, bool for_lru)
> > {
> > while (!list_empty(head)) {
> > struct inode *inode;
> > @@ -941,8 +941,12 @@ static void dispose_list(struct list_head *head)
> > inode = list_first_entry(head, struct inode, i_lru);
> > list_del_init(&inode->i_lru);
> >
> > - evict(inode);
> > - iobj_put(inode);
> > + if (for_lru) {
> > + evict(inode);
> > + iobj_put(inode);
> > + } else {
> > + iput(inode);
> > + }
>
> ... Afaict, if we end up in dispose_list() we came from one of two
> locations:
>
> (1) prune_icache_sb()
> In which case inode_lru_isolate() will have only returned inodes
> that prior to your changes would have inode->i_count zero.
>
> (2) evict_inodes()
> Similar story, this only hits inodes with inode->i_count zero.
>
> With your change you're adding an increment from zero for (2) via
> __iget() so that you always end up with a full refcount, and that is
> backing your changes to dispose_list() later.
>
> I don't see the same done for (1) though and so your later call to
> iput() drops the reference below zero? It's accidently benign because
> iiuc atomic_dec_and_test() will simply tell you that reference count
> didn't go to zero and so iput() will back off. But still this should be
> fixed if I'm right.
Because (1) at this point doesn't have a full reference, it only has an
i_obj_count reference. The next patch converts this, and removes this bit. I did
it this way to clearly mark the change in behavior.
prune_icache_sb() will call dispose_list(&list, true), which will do the
evict(inode) and iobj_put(inode). This is correct because the inodes on the list
from prune_icache_sb() will have an i_count == and have I_WILL_FREE set, so it
will never have it's i_count increased to 1.
The change here is to change evict_inodes() to simply call iput(), as it calls
dispose_list(&list, false). We will increase the i_count to 1 from zero via
__iget(), which at this point in the series is completely correct behavior. Then
we will call iput() which will drop the i_count back to zero, and then call
iput_final, and since SB_ACTIVE is not set, it will call evict(inode) and clean
everything up properly.
>
> The conversion to iput() is introducing a lot of subtlety in the middle
> of the series. If I'm right then the iput() is a always a nop because in
> all cases it was an increment from zero. But it isn't really a nop
> because we still do stuff like call ->drop_inode() again. Maybe it's
> fine because no filesystem would have issues with this but I wouldn't
> count on it and also it feels rather unclean to do it this way.
So I'm definitely introducing another call to ->drop_inode() here, but
->drop_inode() has always been a "do we want to keep this inode on the LRU"
call, calling it again doesn't really change anything.
That being said it is a subtle functional change. I put it here specifically
because it is a functional change. If it bites us in the ass in some unforseen
way we'll be able to bisect it down to here and then we can all laugh at Josef
because he missed something.
>
> So, under the assumption, that after the increment from zero you did, we
> really only have a blatant zombie inode on our hands and we only need to
> get rid of the i_count we took make that explicit and do:
>
> if (for_lru) {
> evict(inode);
> iobj_put(inode);
> } else {
> /* This inode was always incremented from zero.
> * Get rid of that reference without doing anything else.
> */
> WARN_ON_ONCE(!atomic_dec_and_test(&inode->i_count));
> }
We still need the evict() to actually free the inode. We're just getting there
via iput_final() now instead of directly calling evict().
>
> Btw, for the iobj_put() above, I assume that we're not guaranteed that
> i_obj_count == 1?
Right, it's purely dropping the LRU list i_obj_count reference. Thanks,
Josef
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 18/50] fs: disallow 0 reference count inodes
2025-08-25 19:26 ` Josef Bacik
@ 2025-08-26 9:28 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-26 9:28 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Mon, Aug 25, 2025 at 03:26:10PM -0400, Josef Bacik wrote:
> On Mon, Aug 25, 2025 at 12:54:01PM +0200, Christian Brauner wrote:
> > On Thu, Aug 21, 2025 at 04:18:29PM -0400, Josef Bacik wrote:
> > > Now that we take a full reference for inodes on the LRU, move the logic
> > > to add the inode to the LRU to before we drop our last reference. This
> > > allows us to ensure that if the inode has a reference count it can be
> > > used, and we no longer hold onto inodes that have a 0 reference count.
> > >
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> > > fs/inode.c | 53 +++++++++++++++++++++++++++++++++--------------------
> > > 1 file changed, 33 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index de0ec791f9a3..b4145ddbaf8e 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -614,7 +614,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> > >
> > > if (inode->i_state & (I_FREEING | I_WILL_FREE))
> > > return;
> > > - if (atomic_read(&inode->i_count))
> > > + if (atomic_read(&inode->i_count) != 1)
> > > return;
> > > if (inode->__i_nlink == 0)
> > > return;
> > > @@ -1966,28 +1966,11 @@ EXPORT_SYMBOL(generic_delete_inode);
> > > * in cache if fs is alive, sync and evict if fs is
> > > * shutting down.
> > > */
> > > -static void iput_final(struct inode *inode, bool skip_lru)
> > > +static void iput_final(struct inode *inode, bool drop)
> > > {
> > > - struct super_block *sb = inode->i_sb;
> > > - const struct super_operations *op = inode->i_sb->s_op;
> > > unsigned long state;
> > > - int drop;
> > >
> > > WARN_ON(inode->i_state & I_NEW);
> > > -
> > > - if (op->drop_inode)
> > > - drop = op->drop_inode(inode);
> > > - else
> > > - drop = generic_drop_inode(inode);
> > > -
> > > - if (!drop && !skip_lru &&
> > > - !(inode->i_state & I_DONTCACHE) &&
> > > - (sb->s_flags & SB_ACTIVE)) {
> > > - __inode_add_lru(inode, true);
> > > - spin_unlock(&inode->i_lock);
> > > - return;
> > > - }
> > > -
> > > WARN_ON(!list_empty(&inode->i_lru));
> > >
> > > state = inode->i_state;
> > > @@ -2009,8 +1992,29 @@ static void iput_final(struct inode *inode, bool skip_lru)
> > > evict(inode);
> > > }
> > >
> > > +static bool maybe_add_lru(struct inode *inode, bool skip_lru)
> > > +{
> > > + const struct super_operations *op = inode->i_sb->s_op;
> > > + struct super_block *sb = inode->i_sb;
> > > + bool drop = false;
> > > +
> > > + if (op->drop_inode)
> > > + drop = op->drop_inode(inode);
> > > + else
> > > + drop = generic_drop_inode(inode);
> > > +
> > > + if (!drop && !skip_lru &&
> > > + !(inode->i_state & I_DONTCACHE) &&
> > > + (sb->s_flags & SB_ACTIVE))
> > > + __inode_add_lru(inode, true);
> > > +
> > > + return drop;
> > > +}
> >
> > Can we rewrite this as:
> >
> > static bool maybe_add_lru(struct inode *inode, bool skip_lru)
> > {
> > const struct super_operations *op = inode->i_sb->s_op;
> > const struct super_block *sb = inode->i_sb;
> > bool drop = false;
> >
> > if (op->drop_inode)
> > drop = op->drop_inode(inode);
> > else
> > drop = generic_drop_inode(inode);
> >
> > if (drop)
> > return drop;
> >
> > if (skip_lru)
> > return drop;
> >
> > if (inode->i_state & I_DONTCACHE)
> > return drop;
> >
> > if (!(sb->s_flags & SB_ACTIVE))
> > return drop;
> >
> > __inode_add_lru(inode, true);
> > return drop;
> > }
> >
> > so it's a lot easier to follow. I really dislike munging conditions
> > together with a bunch of ands and negations mixed in.
> >
> > And btw for both I_DONTCACHE and !SB_ACTIVE it seems that returning
> > anything other than false from op->drop_inode() would be a bug probably
> > a technicality but I find it pretty odd.
>
> Not necsessarily, maybe we had some delayed iput (*cough* btrfs *cough*) that
> didn't run until umount time and now we have true coming from ->drop_inode()
> with SB_ACTIVE turned off. That would be completely valid. Thanks,
Ah, right, thanks! Yeah, that's seems legit.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 16/50] fs: change evict_inodes to use iput instead of evict directly
2025-08-25 19:35 ` Josef Bacik
@ 2025-08-26 9:56 ` Christian Brauner
0 siblings, 0 replies; 85+ messages in thread
From: Christian Brauner @ 2025-08-26 9:56 UTC (permalink / raw)
To: Josef Bacik
Cc: linux-fsdevel, linux-btrfs, kernel-team, linux-ext4, linux-xfs,
viro
On Mon, Aug 25, 2025 at 03:35:02PM -0400, Josef Bacik wrote:
> On Mon, Aug 25, 2025 at 11:07:55AM +0200, Christian Brauner wrote:
> > On Thu, Aug 21, 2025 at 04:18:27PM -0400, Josef Bacik wrote:
> > > At evict_inodes() time, we no longer have SB_ACTIVE set, so we can
> > > easily go through the normal iput path to clear any inodes. Update
> >
> > I'm a bit lost why SB_ACTIVE is used here as a justification to call
> > iput(). I think it's because iput_final() would somehow add it back to
> > the LRU if SB_ACTIVE was still set and the filesystem somehow would
> > indicate it wouldn't want to drop the inode.
> >
> > I'm confused where that would even happen. IOW, which filesystem would
> > indicate "don't drop the inode" even though it's about to vanish. But
> > anyway, that's probably not important because...
> >
> > > dispose_list() to check how we need to free the inode, and then grab a
> > > full reference to the inode while we're looping through the remaining
> > > inodes, and simply iput them at the end.
> > >
> > > Since we're just calling iput we don't really care about the i_count on
> > > the inode at the current time. Remove the i_count checks and just call
> > > iput on every inode we find.
> > >
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> > > fs/inode.c | 26 +++++++++++---------------
> > > 1 file changed, 11 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index 72981b890ec6..80ad327746a7 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -933,7 +933,7 @@ static void evict(struct inode *inode)
> > > * Dispose-list gets a local list with local inodes in it, so it doesn't
> > > * need to worry about list corruption and SMP locks.
> > > */
> > > -static void dispose_list(struct list_head *head)
> > > +static void dispose_list(struct list_head *head, bool for_lru)
> > > {
> > > while (!list_empty(head)) {
> > > struct inode *inode;
> > > @@ -941,8 +941,12 @@ static void dispose_list(struct list_head *head)
> > > inode = list_first_entry(head, struct inode, i_lru);
> > > list_del_init(&inode->i_lru);
> > >
> > > - evict(inode);
> > > - iobj_put(inode);
> > > + if (for_lru) {
> > > + evict(inode);
> > > + iobj_put(inode);
> > > + } else {
> > > + iput(inode);
> > > + }
> >
> > ... Afaict, if we end up in dispose_list() we came from one of two
> > locations:
> >
> > (1) prune_icache_sb()
> > In which case inode_lru_isolate() will have only returned inodes
> > that prior to your changes would have inode->i_count zero.
> >
> > (2) evict_inodes()
> > Similar story, this only hits inodes with inode->i_count zero.
> >
> > With your change you're adding an increment from zero for (2) via
> > __iget() so that you always end up with a full refcount, and that is
> > backing your changes to dispose_list() later.
> >
> > I don't see the same done for (1) though and so your later call to
> > iput() drops the reference below zero? It's accidently benign because
> > iiuc atomic_dec_and_test() will simply tell you that reference count
> > didn't go to zero and so iput() will back off. But still this should be
> > fixed if I'm right.
>
> Because (1) at this point doesn't have a full reference, it only has an
> i_obj_count reference. The next patch converts this, and removes this bit. I did
> it this way to clearly mark the change in behavior.
Ah, right, I forgot to take the the boolean argument into account.
You're passing that as true in (1). Sure and then sorry about the noise.
> prune_icache_sb() will call dispose_list(&list, true), which will do the
> evict(inode) and iobj_put(inode). This is correct because the inodes on the list
> from prune_icache_sb() will have an i_count == and have I_WILL_FREE set, so it
> will never have it's i_count increased to 1.
>
> The change here is to change evict_inodes() to simply call iput(), as it calls
> dispose_list(&list, false). We will increase the i_count to 1 from zero via
> __iget(), which at this point in the series is completely correct behavior. Then
> we will call iput() which will drop the i_count back to zero, and then call
> iput_final, and since SB_ACTIVE is not set, it will call evict(inode) and clean
> everything up properly.
>
> >
> > The conversion to iput() is introducing a lot of subtlety in the middle
> > of the series. If I'm right then the iput() is a always a nop because in
> > all cases it was an increment from zero. But it isn't really a nop
> > because we still do stuff like call ->drop_inode() again. Maybe it's
> > fine because no filesystem would have issues with this but I wouldn't
> > count on it and also it feels rather unclean to do it this way.
>
> So I'm definitely introducing another call to ->drop_inode() here, but
> ->drop_inode() has always been a "do we want to keep this inode on the LRU"
> call, calling it again doesn't really change anything.
Right, the first time the inode reference count drops to zero we call
iput_final() and then ->drop_inode() which tells us to put it on the
LRU. And we leave it otherwise in tact. This is the part I forgot.
Once you add it to the LRU you'll have incremented i_obj_count either
when you added it to the cached LRU and moving it from there to the
"actual" LRU or you'll take a new one.
And then dispose_list() for (1) will just need to put the i_obj_count.
Thanks!
> That being said it is a subtle functional change. I put it here specifically
> because it is a functional change. If it bites us in the ass in some unforseen
> way we'll be able to bisect it down to here and then we can all laugh at Josef
> because he missed something.
Yeah, it's good. It's just very confusing because for a brief time we
have to understand two reference counts that are first coupled almost
1:1 and then slowly decoupled which makes this particularly nasty...
> > So, under the assumption, that after the increment from zero you did, we
> > really only have a blatant zombie inode on our hands and we only need to
> > get rid of the i_count we took make that explicit and do:
> >
> > if (for_lru) {
> > evict(inode);
> > iobj_put(inode);
> > } else {
> > /* This inode was always incremented from zero.
> > * Get rid of that reference without doing anything else.
> > */
> > WARN_ON_ONCE(!atomic_dec_and_test(&inode->i_count));
> > }
>
> We still need the evict() to actually free the inode. We're just getting there
> via iput_final() now instead of directly calling evict().
Right, thanks!
>
> >
> > Btw, for the iobj_put() above, I assume that we're not guaranteed that
> > i_obj_count == 1?
>
> Right, it's purely dropping the LRU list i_obj_count reference. Thanks,
Thanks for the comments!
^ permalink raw reply [flat|nested] 85+ messages in thread
end of thread, other threads:[~2025-08-26 9:56 UTC | newest]
Thread overview: 85+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-21 20:18 [PATCH 00/50] fs: rework inode reference counting Josef Bacik
2025-08-21 20:18 ` [PATCH 01/50] fs: add an i_obj_count refcount to the inode Josef Bacik
2025-08-21 20:18 ` [PATCH 02/50] fs: make the i_state flags an enum Josef Bacik
2025-08-22 11:08 ` Christian Brauner
2025-08-22 13:31 ` Josef Bacik
2025-08-22 14:36 ` David Sterba
2025-08-22 11:18 ` Sun YangKai
2025-08-22 11:42 ` [PATCH 02/50] " Alan Huang
2025-08-22 12:11 ` Sun YangKai
2025-08-22 14:40 ` [PATCH 02/50] fs: " Josef Bacik
2025-08-21 20:18 ` [PATCH 03/50] fs: hold an i_obj_count reference in wait_sb_inodes Josef Bacik
2025-08-21 20:18 ` [PATCH 04/50] fs: hold an i_obj_count reference for the i_wb_list Josef Bacik
2025-08-22 11:27 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 05/50] fs: hold an i_obj_count reference for the i_io_list Josef Bacik
2025-08-21 20:18 ` [PATCH 06/50] fs: hold an i_obj_count reference in writeback_sb_inodes Josef Bacik
2025-08-22 12:20 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 07/50] fs: hold an i_obj_count reference while on the hashtable Josef Bacik
2025-08-21 20:18 ` [PATCH 08/50] fs: hold an i_obj_count reference while on the LRU list Josef Bacik
2025-08-21 20:18 ` [PATCH 09/50] fs: hold an i_obj_count reference while on the sb inode list Josef Bacik
2025-08-21 20:18 ` [PATCH 10/50] fs: stop accessing ->i_count directly in f2fs and gfs2 Josef Bacik
2025-08-22 12:38 ` (subset) " Christian Brauner
2025-08-21 20:18 ` [PATCH 11/50] fs: hold an i_obj_count when we have an i_count reference Josef Bacik
2025-08-21 20:18 ` [PATCH 12/50] fs: rework iput logic Josef Bacik
2025-08-22 12:54 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 13/50] fs: add an I_LRU flag to the inode Josef Bacik
2025-08-21 20:18 ` [PATCH 14/50] fs: maintain a list of pinned inodes Josef Bacik
2025-08-22 14:55 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 15/50] fs: delete the inode from the LRU list on lookup Josef Bacik
2025-08-22 15:27 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 16/50] fs: change evict_inodes to use iput instead of evict directly Josef Bacik
2025-08-25 9:07 ` Christian Brauner
2025-08-25 19:35 ` Josef Bacik
2025-08-26 9:56 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 17/50] fs: hold a full ref while the inode is on a LRU Josef Bacik
2025-08-25 9:20 ` Christian Brauner
2025-08-25 10:40 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 18/50] fs: disallow 0 reference count inodes Josef Bacik
2025-08-25 10:54 ` Christian Brauner
2025-08-25 19:26 ` Josef Bacik
2025-08-26 9:28 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 19/50] fs: make evict_inodes add to the dispose list under the i_lock Josef Bacik
2025-08-21 20:18 ` [PATCH 20/50] fs: convert i_count to refcount_t Josef Bacik
2025-08-22 12:10 ` Amir Goldstein
2025-08-22 13:56 ` kernel test robot
2025-08-25 11:03 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 21/50] fs: use refcount_inc_not_zero in igrab Josef Bacik
2025-08-25 11:21 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 22/50] fs: use inode_tryget in find_inode* Josef Bacik
2025-08-25 11:26 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 23/50] fs: update find_inode_*rcu to check the i_count count Josef Bacik
2025-08-25 11:27 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 24/50] fs: use igrab in insert_inode_locked Josef Bacik
2025-08-21 20:18 ` [PATCH 25/50] fs: remove I_WILL_FREE|I_FREEING check from __inode_add_lru Josef Bacik
2025-08-21 20:18 ` [PATCH 26/50] fs: remove I_WILL_FREE|I_FREEING check in inode_pin_lru_isolating Josef Bacik
2025-08-21 20:18 ` [PATCH 27/50] fs: use inode_tryget in evict_inodes Josef Bacik
2025-08-25 11:43 ` Christian Brauner
2025-08-25 18:22 ` Josef Bacik
2025-08-21 20:18 ` [PATCH 28/50] fs: change evict_dentries_for_decrypted_inodes to use refcount Josef Bacik
2025-08-21 20:18 ` [PATCH 29/50] block: use igrab in sync_bdevs Josef Bacik
2025-08-21 20:18 ` [PATCH 30/50] bcachefs: use the refcount instead of I_WILL_FREE|I_FREEING Josef Bacik
2025-08-21 20:18 ` [PATCH 31/50] btrfs: don't check I_WILL_FREE|I_FREEING Josef Bacik
2025-08-21 20:18 ` [PATCH 32/50] fs: use igrab in drop_pagecache_sb Josef Bacik
2025-08-21 20:18 ` [PATCH 33/50] fs: stop checking I_FREEING in d_find_alias_rcu Josef Bacik
2025-08-21 20:18 ` [PATCH 34/50] ext4: stop checking I_WILL_FREE|IFREEING in ext4_check_map_extents_env Josef Bacik
2025-08-21 20:18 ` [PATCH 35/50] fs: remove I_WILL_FREE|I_FREEING from fs-writeback.c Josef Bacik
2025-08-25 11:46 ` Christian Brauner
2025-08-21 20:18 ` [PATCH 36/50] gfs2: remove I_WILL_FREE|I_FREEING usage Josef Bacik
2025-08-21 20:18 ` [PATCH 37/50] fs: remove I_WILL_FREE|I_FREEING check from dquot.c Josef Bacik
2025-08-21 20:18 ` [PATCH 38/50] notify: remove I_WILL_FREE|I_FREEING checks in fsnotify_unmount_inodes Josef Bacik
2025-08-21 20:18 ` [PATCH 39/50] xfs: remove I_FREEING check Josef Bacik
2025-08-21 20:18 ` [PATCH 40/50] landlock: remove I_FREEING|I_WILL_FREE check Josef Bacik
2025-08-21 20:18 ` [PATCH 41/50] fs: change inode_is_dirtytime_only to use refcount Josef Bacik
2025-08-21 20:18 ` [PATCH 42/50] btrfs: remove references to I_FREEING Josef Bacik
2025-08-21 20:18 ` [PATCH 43/50] ext4: remove reference to I_FREEING in inode.c Josef Bacik
2025-08-21 20:18 ` [PATCH 44/50] ext4: remove reference to I_FREEING in orphan.c Josef Bacik
2025-08-21 20:18 ` [PATCH 45/50] pnfs: use i_count refcount to determine if the inode is going away Josef Bacik
2025-08-21 20:18 ` [PATCH 46/50] fs: remove some spurious I_FREEING references in inode.c Josef Bacik
2025-08-21 20:18 ` [PATCH 47/50] xfs: remove reference to I_FREEING|I_WILL_FREE Josef Bacik
2025-08-21 20:18 ` [PATCH 48/50] ocfs2: do not set I_WILL_FREE Josef Bacik
2025-08-21 20:19 ` [PATCH 49/50] fs: remove I_FREEING|I_WILL_FREE Josef Bacik
2025-08-25 11:53 ` Christian Brauner
2025-08-21 20:19 ` [PATCH 50/50] fs: add documentation explaining the reference count rules for inodes Josef Bacik
2025-08-25 11:56 ` Christian Brauner
2025-08-22 10:51 ` [PATCH 00/50] fs: rework inode reference counting Christian Brauner
2025-08-22 13:30 ` Josef Bacik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).