public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Filesystem inode reclaim
@ 2026-04-09  9:16 Jan Kara
  2026-04-09 12:57 ` [Lsf-pc] " Amir Goldstein
  2026-04-09 16:12 ` Darrick J. Wong
  0 siblings, 2 replies; 5+ messages in thread
From: Jan Kara @ 2026-04-09  9:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Matthew Wilcox, lsf-pc

Hello!

This is a recurring topic Matthew has been kicking forward for the last
year so let me maybe offer a fs-person point of view on the problem and
possible solutions. The problem is very simple: When a filesystem (ext4,
btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
complex cleanup - like trimming of preallocated blocks beyond end of file,
making sure journalling machinery is done with the inode, etc.. This may
require reading metadata into memory which requires memory allocations and
as inode eviction cannot fail, these are effectively GFP_NOFAIL
allocations (and there are other reasons why it would be very difficult to
make some of these required allocations in the filesystems failable).

GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
trigger warnings - and for a good reason as forward progress isn't
guaranteed. Also it leaves a bad taste that we are performing sometimes
rather long running operations blocking on IO from reclaim context thus
stalling reclaim for substantial amount of time to free 1k worth of slab
cache.

I have been mulling over possible solutions since I don't think each
filesystem should be inventing a complex inode lifetime management scheme
as XFS has invented to solve these issues. Here's what I think we could do:

1) Filesystems will be required to mark inodes that have non-trivial
cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
whatever :)). Usually I expect this to happen on first inode modification
or so. This will require some per-fs work but it shouldn't be that
difficult and filesystems can be adapted one-by-one as they decide to
address these warnings from reclaim.

2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
performance reasons. I expect this to be a significant portion of inodes
on average and in particular for some workloads which scan a lot of inodes
(find through the whole fs or similar) the efficiency of inode reclaim is
one of the determining factors for their performance.

3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
to process them.

4) The work will walk s_hard_reclaim_inodes list and call evict() for each
inode, doing the hard work.

This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
and they can work on freeing memory needed for freeing of hard to reclaim
inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
they should really be addressed.

One possible concern is that s_hard_reclaim_inodes list could grow out of
control for some workloads (in particular because there could be multiple
CPUs generating hard to reclaim inodes while the cleanup would be
single-threaded). This could be addressed by tracking number of inodes in
that list and if it grows over some limit, we could start throttling
processes when setting I_RECLAIM_HARD inode flag.

There's also a simpler approach to this problem but with more radical
changes to behavior. For example getting rid of inode LRU completely -
inodes without dentries referencing them anymore should be rare and it
isn't very useful to cache them. So we can always drop inodes on last
iput() (as we currently do for example for unlinked inodes). But I have a
nagging feeling that somebody is depending on inode LRU somewhere - I'd
like poll the collective knowledge of what could possibly go wrong here :)

In the session I'd like to discuss if people see some problems with these
approaches, what they'd prefer etc.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-09 17:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09  9:16 [LSF/MM/BPF TOPIC] Filesystem inode reclaim Jan Kara
2026-04-09 12:57 ` [Lsf-pc] " Amir Goldstein
2026-04-09 16:48   ` Boris Burkov
2026-04-09 16:12 ` Darrick J. Wong
2026-04-09 17:37   ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox