public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ext4: avoid __GFP_NOFAIL in __ext4_get_inode_loc allocation
@ 2026-04-27 22:23 Chao Shi
  2026-04-28  1:28 ` Theodore Tso
  0 siblings, 1 reply; 2+ messages in thread
From: Chao Shi @ 2026-04-27 22:23 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, Chao Shi, Sungwoo Kim, Dave Tian,
	Weidong Zhu

When kswapd shrinks the dcache, the last iput() on an ext4 inode can
trigger ext4_orphan_del(), which calls ext4_reserve_inode_write() and
ultimately __ext4_get_inode_loc().  That function calls sb_getblk(),
which wraps __getblk() and carries implicit __GFP_NOFAIL.  Because
kswapd runs with PF_MEMALLOC set, combining NOFAIL with a non-reclaimable
context trips WARN_ON_ONCE(current->flags & PF_MEMALLOC) inside
__alloc_pages_slowpath(), producing a spurious splat even though the
allocation could simply fail and return -ENOMEM to the caller.

Switch both sb_getblk() call sites in __ext4_get_inode_loc() to
sb_getblk_gfp() with the same flags minus __GFP_NOFAIL
(mapping_gfp_constraint(~__GFP_FS) | __GFP_MOVABLE), computing the gfp
value once and reusing it for the optional bitmap_bh optimisation fetch.
All callers of __ext4_get_inode_loc() -- reached via ext4_get_inode_loc(),
__ext4_get_inode_loc_noinmem(), and ext4_get_fc_inode_loc() -- already
propagate a non-zero return as an error without aborting the filesystem.
Both sb_getblk() call sites in __ext4_get_inode_loc() are converted; the
bitmap_bh fetch already falls back to make_io on NULL, so allowing it to
fail is a no-op there.

Reproduced under syzkaller+FEMU based fuzz tool (FuzzNvme) on x86_64 QEMU,
based on mainline 894009e2ef10:

  WARNING: CPU: 0 PID: 55 at mm/page_alloc.c:4722
                                    __alloc_pages_slowpath
  Comm: kswapd0 Not tainted 6.19.0+ #14
  Call Trace:
   __alloc_pages_slowpath
   alloc_pages_mpol
   folio_alloc_noprof
   filemap_alloc_folio_noprof
   __filemap_get_folio
   grow_dev_folio
   grow_buffers
   __getblk_slow
   bdev_getblk
   __ext4_get_inode_loc
   ext4_get_inode_loc
   ext4_reserve_inode_write
   ext4_orphan_del
   ext4_evict_inode
   evict
   iput
   dentry_unlink_inode
   __dentry_kill
   shrink_dentry_list
   prune_dcache_sb
   super_cache_scan
   do_shrink_slab
   shrink_slab
   shrink_node
   balance_pgdat
   kswapd
   kthread
   ret_from_fork

Related: see d8b90e6387a ("ext4: add ext4_sb_bread_nofail() helper
function for ext4_free_branches()") for the same strategy applied to
the read path in ext4_free_branches().
Link: https://lore.kernel.org/all/?q=PF_MEMALLOC+nofail+ext4+iput

Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
 fs/ext4/inode.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3..1b2a7bd59b8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4859,6 +4859,7 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
 	ext4_fsblk_t		block;
 	struct blk_plug		plug;
 	int			inodes_per_block, inode_offset;
+	gfp_t			gfp;
 
 	iloc->bh = NULL;
 	if (ino < EXT4_ROOT_INO ||
@@ -4887,7 +4888,14 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
 	}
 	block += (inode_offset / inodes_per_block);
 
-	bh = sb_getblk(sb, block);
+	/*
+	 * No __GFP_NOFAIL: this can run from reclaim context (kswapd
+	 * shrinker -> iput -> ext4_orphan_del path) where NOFAIL trips
+	 * WARN_ON_ONCE in __alloc_pages_slowpath().
+	 */
+	gfp = mapping_gfp_constraint(sb->s_bdev->bd_mapping, ~__GFP_FS) |
+		__GFP_MOVABLE;
+	bh = sb_getblk_gfp(sb, block, gfp);
 	if (unlikely(!bh))
 		return -ENOMEM;
 	if (ext4_buffer_uptodate(bh))
@@ -4912,7 +4920,7 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
 		start = inode_offset & ~(inodes_per_block - 1);
 
 		/* Is the inode bitmap in cache? */
-		bitmap_bh = sb_getblk(sb, ext4_inode_bitmap(sb, gdp));
+		bitmap_bh = sb_getblk_gfp(sb, ext4_inode_bitmap(sb, gdp), gfp);
 		if (unlikely(!bitmap_bh))
 			goto make_io;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH] ext4: avoid __GFP_NOFAIL in __ext4_get_inode_loc allocation
  2026-04-27 22:23 [PATCH] ext4: avoid __GFP_NOFAIL in __ext4_get_inode_loc allocation Chao Shi
@ 2026-04-28  1:28 ` Theodore Tso
  0 siblings, 0 replies; 2+ messages in thread
From: Theodore Tso @ 2026-04-28  1:28 UTC (permalink / raw)
  To: Chao Shi
  Cc: linux-ext4, adilger.kernel, jack, Sungwoo Kim, Dave Tian,
	Weidong Zhu

On Mon, Apr 27, 2026 at 06:23:00PM -0400, Chao Shi wrote:
> When kswapd shrinks the dcache, the last iput() on an ext4 inode can
> trigger ext4_orphan_del(), which calls ext4_reserve_inode_write() and
> ultimately __ext4_get_inode_loc().  That function calls sb_getblk(),
> which wraps __getblk() and carries implicit __GFP_NOFAIL.  Because
> kswapd runs with PF_MEMALLOC set, combining NOFAIL with a non-reclaimable
> context trips WARN_ON_ONCE(current->flags & PF_MEMALLOC) inside
> __alloc_pages_slowpath(), producing a spurious splat even though the
> allocation could simply fail and return -ENOMEM to the caller.

NAK.  As Sashiko correctly points out:

Sashiko AI review found 1 potential issue(s):
- [Critical] Removing __GFP_NOFAIL from __ext4_get_inode_loc causes transient memory
shortages to trigger a fatal filesystem abort (remount read-only) or severe metadata
corruption, trading a memory reclaim warning for a Denial of Service.

The warning in mm/page_alloc.c is the sort of thing that causes file
system developers to decide to drop __GFP_NOFAIL and replace it with a
retry loop just to shut the mm subsystem the heck up, since some mm
developers seem to view hangs in heavy OOM conditions as the worst
thing, where as fs developers consider data corruption to be far
worse, since users tend to get cranky when they lose their data, and
(a) in practice the OOM killer tends to get triggered first, and (b)
that's what software and hardware watchdogs are for.

In any case, there are *far* worse things than a random splat, and if
you really want to make it go away, my suggestion is to remove the
WARN_ON_ONCE from __alloc_pages_slowpath().

		/*
		 * PF_MEMALLOC request from this context is rather bizarre
		 * because we cannot reclaim anything and only can loop waiting
		 * for somebody to do a work for us.
		 */
		WARN_ON_ONCE(current->flags & PF_MEMALLOC);

I disagrr the premise; it's not bizzare at all.

						- Ted


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-04-28  1:29 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 22:23 [PATCH] ext4: avoid __GFP_NOFAIL in __ext4_get_inode_loc allocation Chao Shi
2026-04-28  1:28 ` Theodore Tso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox