[PATCH RFC 0/4] fs: Deferred inode reclaim

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/4] fs: Deferred inode reclaim
@ 2026-04-29 18:00 Jan Kara
  2026-04-29 18:00 ` [PATCH 1/4] fs: Avoid inode dirtying on last iput Jan Kara
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Jan Kara @ 2026-04-29 18:00 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Matthew Wilcox, Jan Kara

Hello,

here are patches implementing deferred inode reclaim to deal with MM warnings
due to GFP_NOFAIL allocations from reclaim paths. The patches are only very
lightly tested and are meant mainly as a starting point of discussion we are
going to have at LSF/MM/BPF summit (yay for conference driven development).

The first patch dealing with lazy timestamp updates is kind of standalone
as I've decided to handle that in writeback infrastructure instead but it is
also a case where we happen to do GFP_NOFAIL allocations from reclaim path.
There are obviously other filesystems that need similar treatment as ext4
but let's handle that once the infrastructure is settled.

Comments are welcome.

								Honza

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/4] fs: Avoid inode dirtying on last iput
  2026-04-29 18:00 [PATCH RFC 0/4] fs: Deferred inode reclaim Jan Kara
@ 2026-04-29 18:00 ` Jan Kara
  2026-04-29 18:00 ` [PATCH 2/4] fs: Basic infrastructure for offloading inode reclaim Jan Kara
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2026-04-29 18:00 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Matthew Wilcox, Jan Kara

When inode has dirtied timestamps, we currently call sync_lazytime() on
last iput. This is done because inode with any dirty bit set is not
inserted into LRU and dirty timestamps expire only after many (12 by
default) hours so these inodes would be sitting outside of LRU aging for
a really long time. However this can result in doing IO and consequently
GFP_NOFAIL allocations from dentry reclaim making MM complain. Sample
trace for ext4 is:

prune_dcache_sb
shrink_dentry_list
__dentry_kill
iput
sync_lazytime
__mark_inode_dirty
ext4_dirty_inode
__ext4_mark_inode_dirty
ext4_reserve_inode_write
ext4_get_inode_loc
bdev_getblk
__filemap_get_folio_mpol

Avoid this dirtying on last iput by reshuffling unused inodes to the
beginning of b_dirty_time list and clobbering dirtied_time_when instead
so that they get written during next periodic writeback.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 fs/inode.c        | 15 +++++++--------
 fs/internal.h     |  1 +
 3 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e1fbdf9ee769..acc27fbe4230 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2729,6 +2729,51 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
+/*
+ * If inode has dirty timestamps to write out, make sure flush worker writes
+ * them out during its next periodic writeback writeout.
+ */
+void queue_dirtytime_writeback(struct inode *inode)
+{
+	struct bdi_writeback *wb;
+	unsigned long new_time;
+
+	lockdep_assert_held(&inode->i_lock);
+
+	if (!(inode_state_read(inode) & I_DIRTY_TIME))
+		return;
+
+	wb = locked_inode_to_wb_and_lock_list(inode);
+	spin_lock(&inode->i_lock);
+	/*
+	 * If inode writeback is already queued or inode got dirty, we have
+	 * nothing to do and we mustn't touch writeback lists anyway.
+	 */
+	if (inode_state_read(inode) & (I_SYNC_QUEUED | I_DIRTY))
+		goto out_wb_lock;
+	/* Written back while we dropped i_lock? */
+	if (!(inode_state_read(inode) & I_DIRTY_TIME))
+		goto out_wb_lock;
+
+	/*
+	 * Move inode to the beginning of dirty queue and clobber dirtied time
+	 * so that it gets written out during the next periodic writeback.
+	 */
+	new_time = jiffies - dirtytime_expire_interval * HZ;
+	if (!list_empty(&wb->b_dirty_time)) {
+		struct inode *first = wb_inode(wb->b_dirty_time.prev);
+		unsigned long first_time = READ_ONCE(first->dirtied_time_when);
+
+		if (time_before(first_time, new_time))
+			new_time = first_time;
+	}
+	inode->dirtied_when = new_time;
+	inode->dirtied_time_when = new_time;
+	list_move_tail(&inode->i_io_list, &wb->b_dirty_time);
+out_wb_lock:
+	spin_unlock(&wb->list_lock);
+}
+
 /*
  * The @s_sync_lock is used to serialise concurrent sync operations
  * to avoid lock contention problems with concurrent wait_sb_inodes() calls.
diff --git a/fs/inode.c b/fs/inode.c
index 6a3cbc7dcd28..276debcd3e20 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1975,7 +1975,6 @@ void iput(struct inode *inode)
 	if (unlikely(!inode))
 		return;
 
-retry:
 	lockdep_assert_not_held(&inode->i_lock);
 	VFS_BUG_ON_INODE(inode_state_read_once(inode) & (I_FREEING | I_CLEAR), inode);
 	/*
@@ -1988,14 +1987,14 @@ void iput(struct inode *inode)
 	if (atomic_add_unless(&inode->i_count, -1, 1))
 		return;
 
-	if (inode->i_nlink && sync_lazytime(inode))
-		goto retry;
-
 	spin_lock(&inode->i_lock);
-	if (unlikely((inode_state_read(inode) & I_DIRTY_TIME) && inode->i_nlink)) {
-		spin_unlock(&inode->i_lock);
-		goto retry;
-	}
+	/*
+	 * If inode has timestamp updates pending, queue flushing them now as
+	 * otherwise the dirtiness could be preventing the inode from entering
+	 * LRU for hours.
+	 */
+	if (inode->i_nlink)
+		queue_dirtytime_writeback(inode);
 
 	if (!atomic_dec_and_test(&inode->i_count)) {
 		spin_unlock(&inode->i_lock);
diff --git a/fs/internal.h b/fs/internal.h
index d77578d66d42..7c8f452d28c6 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -219,6 +219,7 @@ bool in_group_or_capable(struct mnt_idmap *idmap,
  */
 long get_nr_dirty_inodes(void);
 bool sync_lazytime(struct inode *inode);
+void queue_dirtytime_writeback(struct inode *inode);
 
 /*
  * dcache.c
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/4] fs: Basic infrastructure for offloading inode reclaim
  2026-04-29 18:00 [PATCH RFC 0/4] fs: Deferred inode reclaim Jan Kara
  2026-04-29 18:00 ` [PATCH 1/4] fs: Avoid inode dirtying on last iput Jan Kara
@ 2026-04-29 18:00 ` Jan Kara
  2026-04-29 18:00 ` [PATCH 3/4] fs: Add throttling to deferred " Jan Kara
  2026-04-29 18:00 ` [PATCH 4/4] ext4: Defer inode reclaim if it has preallocations Jan Kara
  3 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2026-04-29 18:00 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Matthew Wilcox, Jan Kara

Reclaim of some inodes is rather complex requiring running transactions
or doing other IO. Consequently filesystems end up doing GFP_NOFAIL
allocations from kswapd or even direct reclaim which is problematic
because forward progress of these allocations isn't guaranteed. Add
infrastructure for marking inodes whose reclaim is difficult and offload
reclaim of such inodes into a workqueue to not block kswapd with
difficult inode reclaim.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/inode.c                     | 89 +++++++++++++++++++++++++++++++---
 fs/super.c                     |  5 ++
 include/linux/fs.h             |  5 +-
 include/linux/fs/super_types.h |  7 +++
 4 files changed, 99 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 276debcd3e20..448e3d7ee48e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -938,6 +938,11 @@ void evict_inodes(struct super_block *sb)
 }
 EXPORT_SYMBOL_GPL(evict_inodes);
 
+struct inodes_to_prune {
+	struct list_head freeable;
+	struct list_head deferred;
+};
+
 /*
  * Isolate the inode from the LRU in preparation for freeing it.
  *
@@ -952,7 +957,7 @@ EXPORT_SYMBOL_GPL(evict_inodes);
 static enum lru_status inode_lru_isolate(struct list_head *item,
 		struct list_lru_one *lru, void *arg)
 {
-	struct list_head *freeable = arg;
+	struct inodes_to_prune *lists = arg;
 	struct inode	*inode = container_of(item, struct inode, i_lru);
 
 	/*
@@ -969,7 +974,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 	 * sync, or the last page cache deletion will requeue them.
 	 */
 	if (icount_read(inode) ||
-	    (inode_state_read(inode) & ~I_REFERENCED) ||
+	    inode_state_read(inode) & ~(I_REFERENCED | I_DEFER_RECLAIM) ||
 	    !mapping_shrinkable(&inode->i_data)) {
 		list_lru_isolate(lru, &inode->i_lru);
 		spin_unlock(&inode->i_lock);
@@ -1007,7 +1012,11 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 
 	WARN_ON(inode_state_read(inode) & I_NEW);
 	inode_state_set(inode, I_FREEING);
-	list_lru_isolate_move(lru, &inode->i_lru, freeable);
+	/* Inode will take long time to cleanup. Offload that to worker. */
+	if (inode_state_read(inode) & I_DEFER_RECLAIM)
+		list_lru_isolate_move(lru, &inode->i_lru, &lists->deferred);
+	else
+		list_lru_isolate_move(lru, &inode->i_lru, &lists->freeable);
 	spin_unlock(&inode->i_lock);
 
 	this_cpu_dec(nr_unused);
@@ -1022,15 +1031,83 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
  */
 long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
 {
-	LIST_HEAD(freeable);
+	struct inodes_to_prune lists = {
+		.freeable = LIST_HEAD_INIT(lists.freeable),
+		.deferred = LIST_HEAD_INIT(lists.deferred),
+	};
 	long freed;
 
 	freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
-				     inode_lru_isolate, &freeable);
-	dispose_list(&freeable);
+				     inode_lru_isolate, &lists);
+	dispose_list(&lists.freeable);
+	if (!list_empty(&lists.deferred)) {
+		struct inode_deferred_reclaim *reclaim =
+						READ_ONCE(sb->s_inode_reclaim);
+
+		if (WARN_ON_ONCE(!reclaim)) {
+			dispose_list(&lists.deferred);
+			return freed;
+		}
+		spin_lock(&reclaim->lock);
+		if (list_empty(&reclaim->list))
+			queue_work(system_dfl_wq, &reclaim->work);
+		list_splice_tail(&lists.deferred, &reclaim->list);
+		spin_unlock(&reclaim->lock);
+	}
 	return freed;
 }
 
+static void inode_reclaim_deferred(struct work_struct *work)
+{
+	struct inode_deferred_reclaim *reclaim =
+		container_of(work, struct inode_deferred_reclaim, work);
+	struct inode *inode;
+
+	spin_lock(&reclaim->lock);
+	while (!list_empty(&reclaim->list)) {
+		inode = list_first_entry(&reclaim->list, struct inode, i_lru);
+		list_del_init(&inode->i_lru);
+		spin_unlock(&reclaim->lock);
+		evict(inode);
+		cond_resched();
+		spin_lock(&reclaim->lock);
+	}
+	spin_unlock(&reclaim->lock);
+}
+
+static struct inode_deferred_reclaim *inode_deferred_reclaim_alloc(
+							struct super_block *sb)
+{
+	struct inode_deferred_reclaim *reclaim;
+
+	reclaim = kzalloc_obj(*reclaim, GFP_KERNEL | __GFP_NOFAIL);
+	INIT_LIST_HEAD(&reclaim->list);
+	INIT_WORK(&reclaim->work, inode_reclaim_deferred);
+	spin_lock_init(&reclaim->lock);
+	/* Someone installed new struct before us? */
+	if (cmpxchg(&sb->s_inode_reclaim, NULL, reclaim))
+		kfree(reclaim);
+
+	return sb->s_inode_reclaim;
+}
+
+void mark_inode_reclaim_deferred(struct inode *inode)
+{
+	struct inode_deferred_reclaim *reclaim;
+
+	if (inode_state_read_once(inode) & I_DEFER_RECLAIM)
+		return;
+
+	reclaim = READ_ONCE(inode->i_sb->s_inode_reclaim);
+	if (!reclaim)
+		reclaim = inode_deferred_reclaim_alloc(inode->i_sb);
+
+	spin_lock(&inode->i_lock);
+	inode_state_set(inode, I_DEFER_RECLAIM);
+	spin_unlock(&inode->i_lock);
+}
+EXPORT_SYMBOL_GPL(mark_inode_reclaim_deferred);
+
 static void __wait_on_freeing_inode(struct inode *inode, bool hash_locked, bool rcu_locked);
 
 /*
diff --git a/fs/super.c b/fs/super.c
index 378e81efe643..c35bfb3f7785 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -645,6 +645,11 @@ void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
+		if (sb->s_inode_reclaim) {
+			cancel_work_sync(&sb->s_inode_reclaim->work);
+			kfree(sb->s_inode_reclaim);
+		}
+
 		/*
 		 * Now that all potentially-encrypted inodes have been evicted,
 		 * the fscrypt keyring can be destroyed.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..2a20cbffc87c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -745,7 +745,8 @@ enum inode_state_flags_enum {
 	I_CREATING		= (1U << 15),
 	I_DONTCACHE		= (1U << 16),
 	I_SYNC_QUEUED		= (1U << 17),
-	I_PINNING_NETFS_WB	= (1U << 18)
+	I_PINNING_NETFS_WB	= (1U << 18),
+	I_DEFER_RECLAIM		= (1U << 19),
 };
 
 #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
@@ -2218,6 +2219,8 @@ static inline void mark_inode_dirty_sync(struct inode *inode)
 	__mark_inode_dirty(inode, I_DIRTY_SYNC);
 }
 
+void mark_inode_reclaim_deferred(struct inode *inode);
+
 static inline int icount_read(const struct inode *inode)
 {
 	return atomic_read(&inode->i_count);
diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
index 383050e7fdf5..00744ae5be18 100644
--- a/include/linux/fs/super_types.h
+++ b/include/linux/fs/super_types.h
@@ -129,6 +129,12 @@ struct super_operations {
 	void (*report_error)(const struct fserror_event *event);
 };
 
+struct inode_deferred_reclaim {
+	struct list_head	list;
+	struct work_struct	work;
+	spinlock_t		lock;
+};
+
 struct super_block {
 	struct list_head			s_list;		/* Keep this first */
 	dev_t					s_dev;		/* search index; _not_ kdev_t */
@@ -254,6 +260,7 @@ struct super_block {
 	 */
 	struct list_lru				s_dentry_lru;
 	struct list_lru				s_inode_lru;
+	struct inode_deferred_reclaim		*s_inode_reclaim;
 	struct rcu_head				rcu;
 	struct work_struct			destroy_work;
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/4] fs: Add throttling to deferred inode reclaim
  2026-04-29 18:00 [PATCH RFC 0/4] fs: Deferred inode reclaim Jan Kara
  2026-04-29 18:00 ` [PATCH 1/4] fs: Avoid inode dirtying on last iput Jan Kara
  2026-04-29 18:00 ` [PATCH 2/4] fs: Basic infrastructure for offloading inode reclaim Jan Kara
@ 2026-04-29 18:00 ` Jan Kara
  2026-04-29 18:00 ` [PATCH 4/4] ext4: Defer inode reclaim if it has preallocations Jan Kara
  3 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2026-04-29 18:00 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Matthew Wilcox, Jan Kara

Deferring difficult inode reclaim from prune_icache_sb() to a workqueue
removes the natural feedback loop of blocking tasks in direct reclaim
until they make space for new allocations. This can result in the list
of deferred inodes to grow beyond any bounds and possibly push the
machine to a reclaim storm or OOM.

Add a throttling mechanism slowing down tasks in
mark_inode_reclaim_deferred() if the list of deferred inodes to reclaim
grows over limit. We measure average time it takes to reclaim inode on
deferred list and block tasks proportionally to that.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/inode.c                       | 94 +++++++++++++++++++++++++++++---
 include/linux/fs/super_types.h   |  2 +
 include/trace/events/writeback.h | 51 +++++++++++++++++
 3 files changed, 139 insertions(+), 8 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 448e3d7ee48e..fe39f96fbc80 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -941,6 +941,7 @@ EXPORT_SYMBOL_GPL(evict_inodes);
 struct inodes_to_prune {
 	struct list_head freeable;
 	struct list_head deferred;
+	int deferred_count;
 };
 
 /*
@@ -1013,9 +1014,10 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 	WARN_ON(inode_state_read(inode) & I_NEW);
 	inode_state_set(inode, I_FREEING);
 	/* Inode will take long time to cleanup. Offload that to worker. */
-	if (inode_state_read(inode) & I_DEFER_RECLAIM)
+	if (inode_state_read(inode) & I_DEFER_RECLAIM) {
 		list_lru_isolate_move(lru, &inode->i_lru, &lists->deferred);
-	else
+		lists->deferred_count++;
+	} else
 		list_lru_isolate_move(lru, &inode->i_lru, &lists->freeable);
 	spin_unlock(&inode->i_lock);
 
@@ -1052,27 +1054,58 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
 		if (list_empty(&reclaim->list))
 			queue_work(system_dfl_wq, &reclaim->work);
 		list_splice_tail(&lists.deferred, &reclaim->list);
+		reclaim->len += lists.deferred_count;
 		spin_unlock(&reclaim->lock);
 	}
 	return freed;
 }
 
+static void inode_reclaim_update_stat(struct inode_deferred_reclaim *reclaim,
+				      struct super_block *sb, unsigned int n,
+				      u64 start)
+{
+	u64 end = ktime_get_ns();
+	u32 delay;
+
+	delay = div_u64(end - start, n);
+	/* Smooth delay updates with exponential moving average */
+	reclaim->delay = (63 * (u64)reclaim->delay + delay) / 64;
+
+	trace_inode_reclaim_update_stat(sb, n, delay, reclaim->delay);
+}
+
 static void inode_reclaim_deferred(struct work_struct *work)
 {
 	struct inode_deferred_reclaim *reclaim =
 		container_of(work, struct inode_deferred_reclaim, work);
+	struct super_block *sb = NULL;
 	struct inode *inode;
+	u64 start;
+	unsigned int batch = 0;
 
 	spin_lock(&reclaim->lock);
 	while (!list_empty(&reclaim->list)) {
 		inode = list_first_entry(&reclaim->list, struct inode, i_lru);
 		list_del_init(&inode->i_lru);
+		reclaim->len--;
 		spin_unlock(&reclaim->lock);
+		if (!sb)
+			sb = inode->i_sb;
+		if (!batch)
+			start = ktime_get_ns();
 		evict(inode);
+		batch++;
+		/* Batch stat updates to avoid excessive computations */
+		if (batch >= 64 || need_resched()) {
+			inode_reclaim_update_stat(reclaim, sb, batch, start);
+			batch = 0;
+		}
 		cond_resched();
 		spin_lock(&reclaim->lock);
 	}
 	spin_unlock(&reclaim->lock);
+	if (batch)
+		inode_reclaim_update_stat(reclaim, sb, batch, start);
 }
 
 static struct inode_deferred_reclaim *inode_deferred_reclaim_alloc(
@@ -1091,20 +1124,65 @@ static struct inode_deferred_reclaim *inode_deferred_reclaim_alloc(
 	return sb->s_inode_reclaim;
 }
 
+/*
+ * Size of deferred reclaim list from which we start throttling tasks creating
+ * inodes marked for deferred reclaim.
+ */
+#define INODE_DEFERRED_RECLAIM_LIMIT 8192
+
+static void throttle_inode_deferred_reclaim(struct inode *inode)
+{
+	unsigned int len;
+	struct inode_deferred_reclaim *reclaim =
+				READ_ONCE(inode->i_sb->s_inode_reclaim);
+
+	if (!reclaim)
+		reclaim = inode_deferred_reclaim_alloc(inode->i_sb);
+
+	/*
+	 * If inodes with deferred reclaim are accumulating too much, slow down
+	 * tasks creating them. This doesn't provide any kind of guarantee on
+	 * the length of the deferred list since lots of inodes with
+	 * I_DEFER_RECLAIM can be already present in the inode cache and we
+	 * have no control when they reach the deferred list. But if the
+	 * pressure on the deferred list is sustained, the balance should
+	 * eventually be established.
+	 */
+	len = READ_ONCE(reclaim->len);
+	if (len > INODE_DEFERRED_RECLAIM_LIMIT) {
+		u64 delay = READ_ONCE(reclaim->delay);
+
+		if (!delay)
+			return;
+		/*
+		 * Scale the delay based on how much we exceed the limit. Wait
+		 * at most 4x as long as estimated time to reclaim the inode.
+		 */
+		len = min(len, 5 * INODE_DEFERRED_RECLAIM_LIMIT);
+		delay = div_u64(delay * (len - INODE_DEFERRED_RECLAIM_LIMIT),
+				INODE_DEFERRED_RECLAIM_LIMIT);
+		trace_mark_inode_reclaim_deferred_throttle(inode, len, delay);
+
+		schedule_timeout_killable(nsecs_to_jiffies(delay));
+	}
+}
+
 void mark_inode_reclaim_deferred(struct inode *inode)
 {
-	struct inode_deferred_reclaim *reclaim;
+	bool throttle = false;
 
 	if (inode_state_read_once(inode) & I_DEFER_RECLAIM)
 		return;
 
-	reclaim = READ_ONCE(inode->i_sb->s_inode_reclaim);
-	if (!reclaim)
-		reclaim = inode_deferred_reclaim_alloc(inode->i_sb);
-
 	spin_lock(&inode->i_lock);
-	inode_state_set(inode, I_DEFER_RECLAIM);
+	if (!(inode_state_read(inode) & I_DEFER_RECLAIM)) {
+		inode_state_set(inode, I_DEFER_RECLAIM);
+		throttle = true;
+	}
 	spin_unlock(&inode->i_lock);
+
+	if (throttle)
+		throttle_inode_deferred_reclaim(inode);
 }
 EXPORT_SYMBOL_GPL(mark_inode_reclaim_deferred);
 
diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
index 00744ae5be18..533256892550 100644
--- a/include/linux/fs/super_types.h
+++ b/include/linux/fs/super_types.h
@@ -133,6 +133,8 @@ struct inode_deferred_reclaim {
 	struct list_head	list;
 	struct work_struct	work;
 	spinlock_t		lock;
+	unsigned int		len;
+	u32			delay;
 };
 
 struct super_block {
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index bdac0d685a98..c0ae39b4dc7b 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -879,6 +879,57 @@ DEFINE_EVENT(writeback_inode_template, sb_clear_inode_writeback,
 	TP_ARGS(inode)
 );
 
+TRACE_EVENT(inode_reclaim_update_stat,
+	TP_PROTO(
+		struct super_block *sb,
+		unsigned int n,
+		u32 batch_delay,
+		u32 avg_delay
+	),
+	TP_ARGS(sb, n, batch_delay, avg_delay),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		dev)
+		__field(unsigned int,	n)
+		__field(u32,		batch_delay)
+		__field(u32,		avg_delay)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+		__entry->n = n;
+		__entry->batch_delay = batch_delay;
+		__entry->avg_delay = avg_delay;
+	),
+
+	TP_printk("dev %d,%d batch size %u batch delay %u ns avg delay %u ns",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->n,
+		  __entry->batch_delay, __entry->avg_delay)
+);
+
+TRACE_EVENT(mark_inode_reclaim_deferred_throttle,
+	TP_PROTO(struct inode *inode, unsigned int len, u64 delay),
+	TP_ARGS(inode, len, delay),
+
+	TP_STRUCT__entry(
+		__field(u64,		ino)
+		__field(dev_t,		dev)
+		__field(unsigned int,	len)
+		__field(u64,		delay)
+	),
+
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->dev = inode->i_sb->s_dev;
+		__entry->len = len;
+		__entry->delay = delay;
+	),
+
+	TP_printk("dev %d,%d ino %llu deferred list len %u delay %llu ns",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino, __entry->len, __entry->delay)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 4/4] ext4: Defer inode reclaim if it has preallocations
  2026-04-29 18:00 [PATCH RFC 0/4] fs: Deferred inode reclaim Jan Kara
                   ` (2 preceding siblings ...)
  2026-04-29 18:00 ` [PATCH 3/4] fs: Add throttling to deferred " Jan Kara
@ 2026-04-29 18:00 ` Jan Kara
  3 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2026-04-29 18:00 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Matthew Wilcox, Jan Kara

When we have to free preallocations during inode eviction, we need to
load block bitmaps and run transaction to modify them. This takes time
and also requires GFP_NOFAIL allocations. Mark inodes with preallocated
blocks as needing offloading of inode reclaim to a workqueue so that we
don't block reclaim for long and potentially deadlock MM subsystem.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index ed1bd00e11cd..4de3cfa4e648 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3114,6 +3114,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	if (ac->ac_prefetch_nr)
 		ext4_mb_prefetch_fini(sb, ac->ac_prefetch_grp, ac->ac_prefetch_nr);
 
+	/*
+	 * Freeing preallocations requires loading bitmaps and running
+	 * transactions. Defer inode reclaim to a workqueue.
+	 */
+	if (!RB_EMPTY_ROOT(&EXT4_I(ac->ac_inode)->i_prealloc_node))
+		mark_inode_reclaim_deferred(ac->ac_inode);
+
 	return err;
 }
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-29 18:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29 18:00 [PATCH RFC 0/4] fs: Deferred inode reclaim Jan Kara
2026-04-29 18:00 ` [PATCH 1/4] fs: Avoid inode dirtying on last iput Jan Kara
2026-04-29 18:00 ` [PATCH 2/4] fs: Basic infrastructure for offloading inode reclaim Jan Kara
2026-04-29 18:00 ` [PATCH 3/4] fs: Add throttling to deferred " Jan Kara
2026-04-29 18:00 ` [PATCH 4/4] ext4: Defer inode reclaim if it has preallocations Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox