All of lore.kernel.org
 help / color / mirror / Atom feed
From: Usama Arif <usama.arif@linux.dev>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: brauner@kernel.org, jack@suse.cz, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Al Viro <viro@zeniv.linux.org.uk>,
	linux-mm@kvack.org, hughd@google.com, boris@bur.io, clm@fb.com,
	dsterba@suse.com, linux-btrfs@vger.kernel.org, cem@kernel.org,
	linux-xfs@vger.kernel.org, hannes@cmpxchg.org, riel@surriel.com,
	kernel-team@meta.com
Subject: Re: [PATCH] fs/super: skip non-memcg-aware nr_cached_objects in memcg slab shrink
Date: Wed, 24 Jun 2026 16:18:38 +0100	[thread overview]
Message-ID: <cfdb8620-ea55-4226-98bf-d006820e6270@linux.dev> (raw)
In-Reply-To: <ajrHa_JGxQvCMf8t@linux.dev>



On 23/06/2026 19:08, Shakeel Butt wrote:
> On Tue, Jun 09, 2026 at 05:30:47AM -0700, Usama Arif wrote:
>> The super_block shrinker is registered with SHRINKER_MEMCG_AWARE because its
>> dentry and inode LRUs are memcg-aware (via list_lru). But the optional
>> ->nr_cached_objects() hooks that the shrinker also drives are not memcg-aware:
>> btrfs extent maps and xfs inode reclaim operate on filesystem-global
>> state, and shmem's unused-huge shrinker walks a per-superblock shrinklist.
>> None of them filter by sc->memcg.
> 
> I see the underlying objects whose count is returned by ->nr_cached_objects()
> hook is memcg charged for shmem and xfs but not for btrfs. Do you envision
> there might be a rare scenario where we have a lot of memory charged to a memcg
> consumed by objects which ->nr_cached_objects() tracks and that memory becomes
> unreclaimable due to this patch?

Hello!

Thanks for the review.

For XFS, xfs_inode is SLAB_ACCOUNT, so a memcg can have memory charged in
XFS inodes sitting in XFS' internal reclaim state. But the current callback
does not target that memcg: xfs_fs_free_cached_objects() calls
xfs_reclaim_inodes_nr(), which walks the mount and reclaims
XFS_ICI_RECLAIM_TAG inodes without checking sc->memcg. So non-root memcg
reclaim was only getting an opportunistic mount-wide reclaim pass here; it
could reclaim the target memcg's inodes by chance, but it could just as well
reclaim inodes charged to other memcgs. This patch removes that cross-memcg
side effect, not a correct memcg-targeted reclaim path.

Those XFS inodes are still reclaimed by XFS' own reclaim worker, which is
queued when reclaimable inodes are tagged and requeues while reclaimable
inodes remain. With the default xfssyncd_centisecs value, that is about every
5 seconds. The root/global superblock shrinker path also continues to call
the XFS callbacks. This will keep the memory reclaimable.

For shmem, shrinklist_len counts inodes whose tail large folio could be                                                                                                                                                                                                                                                                                                                                               
split — splitting itself doesn't free anything; it just lets normal page                                                                                                                                                                                                                                                                                                                                              
LRU reclaim the truncated tail. The folios are on the memcg-aware LRU and                                                                                                                                                                                                                                                                                
will be aged/reclaimed there independently, so skipping the split from                                                                                                                                                                                                                                                                                   
memcg context just delays the split?

or btrfs extent maps, as you note, these objects are not memcg-charged.
> 
>>
>> The mismatch shows up under memcg-heavy slab reclaim. shrink_slab_memcg()
>> calls do_shrink_slab() once per (memcg, NUMA node) pair for every memcg
>> whose bit is set in the per-superblock shrinker bitmap, which on a busy
>> host means hundreds of calls per reclaim pass. Each scan queues the same
>> global shrinker work item that's already kicked from the root path.
>>
>> Because btrfs/xfs global count is typically non-zero on any in-use filesystem,
>> the returned total stays positive even if a memcg's own dentry/inode LRUs
>> are empty. shrink_slab_memcg() therefore never clears the SB shrinker bit
>> in the memcg bitmap, so subsequent reclaim passes from the same memcg
>> re-enter super_cache_count() and pay for the global counter walk again.
> 
> What is the main concern? Is it the amount of CPU wasted or are we over
> reclaiming or reclaiming from unrelated memcgs?
> 

The primary concern is wasted CPU. On the busy hosts where I saw this,
hundreds of memcgs all repeating that walk per reclaim pass is what made
it visible in profiles.

Another concern is misattribution. Reclaim from memcg X ends up
reclaiming another memcg Y, which could have an affect on Y and
won't provide proper isolation.

>>
>> Restrict ->nr_cached_objects() to the global shrink path (sc->memcg NULL
>> or root). The memcg-aware dentry/inode LRUs keep being counted and
>> scanned per memcg as before; only the global fs-specific hooks are skipped.
>> The root/global shrink path still drives those hooks; only their
>> invocation from non-root memcg slab reclaim is removed.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> 
> I am fine with the stopgap but it would be nice to have proper memcg awareness
> in xfs and shmem callbacks. For btrfs, I am not sure if it makes sense to memcg
> charge btrfs_extent_map objects but at least to decision to skip memcg reclaim
> will be inside the fs callbacks i.e. nr_cached_objects.
> 

Agreed that this is the right long term approach. If the preference is to move this down to
the fs own callbacks instead of fs/super.c, I can do that in the revision as well.


      reply	other threads:[~2026-06-24 15:18 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-09 12:30 [PATCH] fs/super: skip non-memcg-aware nr_cached_objects in memcg slab shrink Usama Arif
2026-06-09 12:52 ` Jan Kara
2026-06-09 13:38   ` Usama Arif
2026-06-23 10:14 ` Christian Brauner
2026-06-23 18:08 ` Shakeel Butt
2026-06-24 15:18   ` Usama Arif [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cfdb8620-ea55-4226-98bf-d006820e6270@linux.dev \
    --to=usama.arif@linux.dev \
    --cc=boris@bur.io \
    --cc=brauner@kernel.org \
    --cc=cem@kernel.org \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=kernel-team@meta.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=riel@surriel.com \
    --cc=shakeel.butt@linux.dev \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.