linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [DISCUSSION] Revisiting Slab Movable Objects
       [not found] <aAZMe21Ic2sDIAtY@harry>
@ 2025-04-21 21:54 ` Dave Chinner
  2025-04-23  1:47   ` Al Viro
  2025-04-25 11:09   ` Harry Yoo
  0 siblings, 2 replies; 9+ messages in thread
From: Dave Chinner @ 2025-04-21 21:54 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Christoph Lameter, David Rientjes, Andrew Morton, Roman Gushchin,
	Tobin C. Harding, Alexander Viro, Matthew Wilcox, Vlastimil Babka,
	Rik van Riel, Andrea Arcangeli, Liam R. Howlett, Lorenzo Stoakes,
	Jann Horn, Pedro Falcato, David Hildenbrand, Oscar Salvador,
	Michal Hocko, Byungchul Park, linux-mm, linux-fsdevel

On Mon, Apr 21, 2025 at 10:47:39PM +0900, Harry Yoo wrote:
> Hi folks,
> 
> As a long term project, I'm starting to look into resurrecting
> Slab Movable Objects. The goal is to make certain types of slab memory
> movable and thus enable targeted reclamation, migration, and
> defragmentation.
> 
> The main purpose of this posting is to briefly review what's been tried
> in the past, ask people why prior efforts have stalled (due to lack of
> time or insufficient justification for additional complexity?),
> and discuss what's feasible today.
> 
> Please add anyone I may have missed to Cc. :)

Adding -fsdevel because dentry/inode cache discussion needs to be
visible to all the fs/VFS developers.

I'm going to cut straight to the chase here, but I'll leave the rest
of the original email quoted below for -fsdevel readers.

> Previous Work on Slab Movable Objects
> =====================================

<snip>

Without including any sort of viable proposal for dentry/inode
relocation (i.e. the showstopper for past attempts), what is the
point of trying to ressurect this?

I don't have a solution for the dentry cache reference issues - the
dentry cache maintains the working set of files, so anything that
randomly shoots down unused dentries for compaction is likely to
have negative performance implications for dentry cache intensive
workloads.

However, I can think of two possible solutions to the untracked
external inode reference issue.

The first is that external inode references need to take an active
reference to the inode (like a dentry does), and this prevents
inodes from being relocated whilst such external references exist.

Josef has proposed an active/passive reference counting mechanism
for all references to inodes recently on -fsdevel here:

https://lore.kernel.org/linux-fsdevel/20250303170029.GA3964340@perftesting/

However, the ability to revoke external references and/or resolve
internal references atomically has not really been considered at
this point in time.

To allow referenced inodes to be relocated, I'd suggest that any
subsystem that takes an external reference to the inode needs to
provide something like a SRCU notifier block to allow the external
reference to be dynamically removed. Once the relocation is done,
another notifier method can be called allowing the external
reference to be updated with the new inode address.  Any attempt to
access the inode whilst it is being relocated through that external
mechanism should probably block.

[ Note: this could be leveraged as a general ->revoke mechanism for
external inode references. Instead of the external access blocking
after reference recall, it would return an error if access
revocation has occurred. This mechanism could likely also solve some
of the current lifetime issues with fsnotify and landlock objects. ]

This leaves internal (passive) references that can be resolved by
locking the inode itself. e.g. getting rid of mapping tree
references (e.g. folio->mapping->host) by invalidating the
inode page cache.

The other solution is to prevent excessive inode slab cache
fragmentation in the first place. i.e. *stop caching unreferenced
inodes*. In this case, the inode LRU goes away and we rely fully on
the dentry cache pinning inodes to maintain the working set of
inodes in memory. This works with/without Josef's proposed reference
counting changes - though Josef's proposed changes make getting rid
of the inode LRU a lot easier.

I talk about some of that stuff in the discussion of this superblock
inode list iteration patchset here:

https://lore.kernel.org/linux-fsdevel/20241002014017.3801899-1-david@fromorbit.com/

-Dave.

> 
> Previous Work on Slab Movable Objects
> =====================================
> 
> Christoph Lameter, Slab Defragmentation Reduction, 2007-2017 (V16: [2]):
> Christoph Lameter, Slab object migration for xarray, 2017-2018 (V2: [3]):
>   Christoph's long-standing effort (since 2007) aiming to defragment
>   slab memory in cases where sparsely populated slabs occupy excessive
>   amount of memory.
> 
>   Early versions of the work focused on defragmenting slab caches
>   for filesystem data structures such as inode, dentry, and buffer head.
>   updatedb was suggested as the standard way to trigger for generating
>   sparsely populated slabs on file servers.
> 
>   However, defragmenting slabs for filesystem data structures has proven
>   to be very difficult to fully solve, because inodes and dentries are
>   neither reclaimable nor migratable, limiting the effectiveness of
>   defragmentation.
> 
>   In late 2018, the effort was revived with a new focus on migrating
>   XArray nodes. However, it appears the work was discontinued after
>   V2 [3]?
> 
> Tobin C. Harding, Slab Movable Objects, 2019 (First Non-RFC: [5])
> - Tobin C. Harding revived Christoph's earlier work and introduced
>   a few enhancements, including partial shrinking of dentries, moving
>   objects to and from a specific NUMA node, and balancing objects across
>   all NUMA nodes.
> 
>   Also appears to be discontinued after the first non-RFC version [5]? 
> 
> At LSFMM 2017, Andrea Arcangeli suggested [6] virtually mapped slabs,
> which might be useful since migrating them does not require changing the
> address of objects. But as Rik van Riel pointed out at that time, it
> isn't really useful for defragmentation. Andrea Arcangeli responded
> that it can be beneficial for memory hotplug, compaction and out-of-memory
> avoidance.
> 
> The exact mechanism wasn't described in [6], but I assume it'll involve
> 1) unmap a slab (and page faults after unmap need to wait for migration
> to complete), 2) copy objects to a new slab, and 3) map the new slab?
> But the idea hasn't gained enough attention for anyone to actually
> implement it.
> 
> Potential Candidates of SMO
> ===========================
> 
> Basic Rules
> -----------
> 
> - Slab memory can only be reclaimed or migrated if the user of the slab
>   provides a way to isolate / migrate objects.
> - If objects can be reclaimed, it makes sense to simply reclaim them
>   instead of migrating them (unless we know it's better to keep that
>   object in memory).
> - Some objects can't be reclaimed, but migrating them is (if possible)
>   still useful for defragmentation and compaction.
>   - However it is not always feasible 
> 
> Potential candidates include (but not limited to):
> --------------------------------------------------
> 
> - XArray nodes can be migrated (can't be reclaimed as they're being used)
>   - Can be reclaimed if it only includes shadow entries.
> - Maple tree nodes (if without external locking) and VMAs can be migrated
>   and obviously can't be reclaimed.
> - Negative dentry should be reclaimed, instead of being migrated.
> - Only unused dentries can be reclaimed without high cost.
>   - Dentries with nonzero refcount are not really relocatable? (per [1])
> - Even unused inodes can't be reclaimed nor relocated due to external
>   references? (per [4])
> 
> Al Viro made it clear [1] that inodes / dentries are not really
> relocatable. He also mentioned:
> > So from the correctness POV
> > 	* you can kick out everything with zero refcount not
> > on shrink lists.
> > 	* you _might_ try shrink_dcache_parent() on directory
> > dentries, in hope to drive their refcount to zero.  However,
> > that's almost certainly going to hit too hard and be too costly.
> > 	* d_invalidate() is no-go; if anything, you want something
> > weaker than shrink_dcache_parent(), not stronger.
> > 
> > For anything beyond "just kick out everything in that page that
> > happens to have zero refcount" I would really like to see the
> > stats - how much does it help, how costly it is _and_ how much
> > of the cache does it throw away (see above re running into a root
> > dentry of some filesystem and essentially trimming dcache for
> > that fs down to the unevictable stuff).
> 
> Dave Chinner mentioned [4] why it is hard to reclaim or migrate (in a
> targeted manner) even inodes with no active references:
> > On Wed, Dec 27, 2017 at 04:06:36PM -0600, Christoph Lameter wrote:
> > > This is a patchset on top of Matthew Wilcox Xarray code and implements
> > > object migration of xarray nodes. The migration is integrated into
> > > the defragmetation and shrinking logic of the slab allocator.
> > .....
> > > This is only possible for xarray for now but it would be worthwhile
> > > to extend this to dentries and inodes.
> > 
> > Christoph, you keep saying this is the goal, but I'm yet to see a
> > solution proposed for the atomic replacement of all the pointers to
> > an inode from external objects.  An inode that has no active
> > references still has an awful lot of passive and internal references
> > that need to be dealt with.
> > 
> > e.g. racing page operations accessing mapping->host, the inode in
> > various lists (e.g. superblock inode list, writeback lists, etc),
> > the inode lookup cache(s), backpointers from LSMs, fsnotify marks,
> > crypto information, internal filesystem pointers (e.g. log items,
> > journal handles, buffer references, etc) and so on. And each
> > filesystem has a different set of passive references, too.
> > 
> > Oh, and I haven't even mentioned deadlocks yet, either. :P
> > 
> > IOWs, just saying "it would be worthwhile to extend this to dentries
> > and inodes" completely misrepresents the sheer complexity of doing
> > so. We've known that atomic replacement is the big problem for
> > defragging inodes and dentries since this work was started, what,
> > more than 10 years? And while there's been many revisions of the
> > core defrag code since then, there has been no credible solution
> > presented for atomic replacement of objects with complex external
> > references. This is a show-stopper for inode/dentry slab defrag, and
> > I don't see that this new patchset is any different...
> 
> [1] https://lore.kernel.org/linux-mm/20190403190520.GW2217@ZenIV.linux.org.uk
> [2] https://lore.kernel.org/linux-mm/20170307212429.044249411@linux.com
> [3] https://marc.info/?l=linux-mm&m=154533371911133
> [4] https://lore.kernel.org/linux-mm/20171228222419.GQ1871@rh
> [5] https://lore.kernel.org/linux-mm/20190603042637.2018-1-tobin@kernel.org
> [6] https://lwn.net/Articles/717650
> 
> -- 
> Cheers,
> Harry / Hyeonggon
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [DISCUSSION] Revisiting Slab Movable Objects
  2025-04-21 21:54 ` [DISCUSSION] Revisiting Slab Movable Objects Dave Chinner
@ 2025-04-23  1:47   ` Al Viro
  2025-04-23  7:20     ` Harry Yoo
  2025-04-25 11:09   ` Harry Yoo
  1 sibling, 1 reply; 9+ messages in thread
From: Al Viro @ 2025-04-23  1:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Andrew Morton,
	Roman Gushchin, Tobin C. Harding, Matthew Wilcox, Vlastimil Babka,
	Rik van Riel, Andrea Arcangeli, Liam R. Howlett, Lorenzo Stoakes,
	Jann Horn, Pedro Falcato, David Hildenbrand, Oscar Salvador,
	Michal Hocko, Byungchul Park, linux-mm, linux-fsdevel

On Tue, Apr 22, 2025 at 07:54:08AM +1000, Dave Chinner wrote:

> I don't have a solution for the dentry cache reference issues - the
> dentry cache maintains the working set of files, so anything that
> randomly shoots down unused dentries for compaction is likely to
> have negative performance implications for dentry cache intensive
> workloads.

Just to restate the obvious: _relocation_ of dentries is hopeless for
many, many reasons - starting with "hash of dentry depends upon
the address of its parent dentry".  Freeing anything with zero refcount...
sure, no problem - assuming that you are holding rcu_read_lock(),
	if (READ_ONCE(dentry->d_count) == 0) {
		spin_lock(&dentry->d_lock);
		if (dentry->d_count == 0)
			to_shrink_list(dentry, list);
		spin_unlock(&dentry->d_lock);
	}
followed by rcu_read_unlock() and shrink_dentry_list(&list) once you
are done collecting the candidates.  If you want to wait for them to
actually freed, synchronize_rcu() after rcu_read_unlock() (freeing is
RCU-delayed).

Performance implications are separate story - it really depends upon
a lot of details.  But simple "I want all unused dentries in this
page kicked out" is doable.  And in-use dentries are no-go, no matter
what.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [DISCUSSION] Revisiting Slab Movable Objects
  2025-04-23  1:47   ` Al Viro
@ 2025-04-23  7:20     ` Harry Yoo
  2025-04-23  7:40       ` Al Viro
  0 siblings, 1 reply; 9+ messages in thread
From: Harry Yoo @ 2025-04-23  7:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Chinner, Christoph Lameter, David Rientjes, Andrew Morton,
	Roman Gushchin, Tobin C. Harding, Matthew Wilcox, Vlastimil Babka,
	Rik van Riel, Andrea Arcangeli, Liam R. Howlett, Lorenzo Stoakes,
	Jann Horn, Pedro Falcato, David Hildenbrand, Oscar Salvador,
	Michal Hocko, Byungchul Park, linux-mm, linux-fsdevel

On Wed, Apr 23, 2025 at 02:47:32AM +0100, Al Viro wrote:
> On Tue, Apr 22, 2025 at 07:54:08AM +1000, Dave Chinner wrote:
> 
> > I don't have a solution for the dentry cache reference issues - the
> > dentry cache maintains the working set of files, so anything that
> > randomly shoots down unused dentries for compaction is likely to
> > have negative performance implications for dentry cache intensive
> > workloads.
> 
> Just to restate the obvious: _relocation_ of dentries is hopeless for
> many, many reasons - starting with "hash of dentry depends upon
> the address of its parent dentry".

If we can't migrate or reclaim dentries with a nonzero refcount,
can we at least prevent slab pages from containing a mix of dentries
with zero and nonzero refcounts?

An idea: "Migrate a dentry (and inode?) _before_ it becomes unrelocatable"
This is somewhat similar to "Migrate a page out of the movable area before
pinning it" in MM.

For example, suppose we have two slab caches for dentry:
dentry_cache_unref and dentry_cache_ref.

When a dentry with a zero refcount is about to have its refcount
incremented, the VFS allocates a new object from dentry_cache_ref, copies
the dentry into it, frees the original dentry back to
dentry_cache_unref, and returns the newly allocated object.

Similarly when a dentry with a nonzero refcount drops to zero,
it is migrated to dentry_cache_unref. This should be handled on the VFS
side rather than by the slab allocator.

This approach could, at least, help reduce fragmentation.

> Freeing anything with zero refcount...
> sure, no problem - assuming that you are holding rcu_read_lock(),
> 	if (READ_ONCE(dentry->d_count) == 0) {
> 		spin_lock(&dentry->d_lock);
> 		if (dentry->d_count == 0)
> 			to_shrink_list(dentry, list);
> 		spin_unlock(&dentry->d_lock);
> 	}
> followed by rcu_read_unlock() and shrink_dentry_list(&list) once you
> are done collecting the candidates.  If you want to wait for them to
> actually freed, synchronize_rcu() after rcu_read_unlock() (freeing is
> RCU-delayed).
> 
> Performance implications are separate story - it really depends upon
> a lot of details.  But simple "I want all unused dentries in this
> page kicked out" is doable.  And in-use dentries are no-go, no matter
> what.

Thank you for the detailed guidance and confirming that it's doable!
It will be very helpful when implementing this.

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [DISCUSSION] Revisiting Slab Movable Objects
  2025-04-23  7:20     ` Harry Yoo
@ 2025-04-23  7:40       ` Al Viro
  0 siblings, 0 replies; 9+ messages in thread
From: Al Viro @ 2025-04-23  7:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Dave Chinner, Christoph Lameter, David Rientjes, Andrew Morton,
	Roman Gushchin, Tobin C. Harding, Matthew Wilcox, Vlastimil Babka,
	Rik van Riel, Andrea Arcangeli, Liam R. Howlett, Lorenzo Stoakes,
	Jann Horn, Pedro Falcato, David Hildenbrand, Oscar Salvador,
	Michal Hocko, Byungchul Park, linux-mm, linux-fsdevel

On Wed, Apr 23, 2025 at 04:20:20PM +0900, Harry Yoo wrote:

> If we can't migrate or reclaim dentries with a nonzero refcount,
> can we at least prevent slab pages from containing a mix of dentries
> with zero and nonzero refcounts?
> 
> An idea: "Migrate a dentry (and inode?) _before_ it becomes unrelocatable"
> This is somewhat similar to "Migrate a page out of the movable area before
> pinning it" in MM.
> 
> For example, suppose we have two slab caches for dentry:
> dentry_cache_unref and dentry_cache_ref.
> 
> When a dentry with a zero refcount is about to have its refcount
> incremented, the VFS allocates a new object from dentry_cache_ref, copies
> the dentry into it, frees the original dentry back to
> dentry_cache_unref, and returns the newly allocated object.
> 
> Similarly when a dentry with a nonzero refcount drops to zero,
> it is migrated to dentry_cache_unref. This should be handled on the VFS
> side rather than by the slab allocator.
> 
> This approach could, at least, help reduce fragmentation.

No.  This is utterly insane - you'd need to insert RCU delay on each
of those transitions and that is not to mention the frequency with
which those will happen on a lot of loads (any kind of builds included).
Not a chance.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [DISCUSSION] Revisiting Slab Movable Objects
  2025-04-21 21:54 ` [DISCUSSION] Revisiting Slab Movable Objects Dave Chinner
  2025-04-23  1:47   ` Al Viro
@ 2025-04-25 11:09   ` Harry Yoo
  2025-04-28 15:31     ` Jann Horn
  1 sibling, 1 reply; 9+ messages in thread
From: Harry Yoo @ 2025-04-25 11:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Lameter, David Rientjes, Andrew Morton, Roman Gushchin,
	Tobin C. Harding, Alexander Viro, Matthew Wilcox, Vlastimil Babka,
	Rik van Riel, Andrea Arcangeli, Liam R. Howlett, Lorenzo Stoakes,
	Jann Horn, Pedro Falcato, David Hildenbrand, Oscar Salvador,
	Michal Hocko, Byungchul Park, linux-mm, linux-fsdevel

On Tue, Apr 22, 2025 at 07:54:08AM +1000, Dave Chinner wrote:
> On Mon, Apr 21, 2025 at 10:47:39PM +0900, Harry Yoo wrote:
> > Hi folks,
> > 
> > As a long term project, I'm starting to look into resurrecting
> > Slab Movable Objects. The goal is to make certain types of slab memory
> > movable and thus enable targeted reclamation, migration, and
> > defragmentation.
> > 
> > The main purpose of this posting is to briefly review what's been tried
> > in the past, ask people why prior efforts have stalled (due to lack of
> > time or insufficient justification for additional complexity?),
> > and discuss what's feasible today.
> > 
> > Please add anyone I may have missed to Cc. :)
> 
> Adding -fsdevel because dentry/inode cache discussion needs to be
> visible to all the fs/VFS developers.
> 
> I'm going to cut straight to the chase here, but I'll leave the rest
> of the original email quoted below for -fsdevel readers.
> 
> > Previous Work on Slab Movable Objects
> > =====================================
> 
> <snip>
> 
> Without including any sort of viable proposal for dentry/inode
> relocation (i.e. the showstopper for past attempts), what is the
> point of trying to ressurect this?

Migrating slabs still makes sense for other objects such as xarray / maple
tree nodes, and VMAs.

Of course, if filesystem folks could enhance it further and make more of
dentry/inode objects that would be very welcome.

> However, I can think of two possible solutions to the untracked
> external inode reference issue.
> 
> The first is that external inode references need to take an active
> reference to the inode (like a dentry does), and this prevents
> inodes from being relocated whilst such external references exist.
> 
> Josef has proposed an active/passive reference counting mechanism
> for all references to inodes recently on -fsdevel here:
>
> https://lore.kernel.org/linux-fsdevel/20250303170029.GA3964340@perftesting/
>
> However, the ability to revoke external references and/or resolve
> internal references atomically has not really been considered at
> this point in time.

...alright, I expect that'll be more tricker part.

> To allow referenced inodes to be relocated, I'd suggest that any
> subsystem that takes an external reference to the inode needs to
> provide something like a SRCU notifier block to allow the external
> reference to be dynamically removed. Once the relocation is done,
> another notifier method can be called allowing the external
> reference to be updated with the new inode address.  Any attempt to
> access the inode whilst it is being relocated through that external
> mechanism should probably block.
> 
> [ Note: this could be leveraged as a general ->revoke mechanism for
> external inode references. Instead of the external access blocking
> after reference recall, it would return an error if access
> revocation has occurred. This mechanism could likely also solve some
> of the current lifetime issues with fsnotify and landlock objects. ]
> 
> This leaves internal (passive) references that can be resolved by
> locking the inode itself. e.g. getting rid of mapping tree
> references (e.g. folio->mapping->host) by invalidating the
> inode page cache.

Thank you so much for such a detailed writeup.

The former approach would allow allocating them from movable areas,
help mm/compaction.c to build high-order folios, and help slab to reduce
fragmentation.

> The other solution is to prevent excessive inode slab cache
> fragmentation in the first place. i.e. *stop caching unreferenced
> inodes*. In this case, the inode LRU goes away and we rely fully on
> the dentry cache pinning inodes to maintain the working set of
> inodes in memory. This works with/without Josef's proposed reference
> counting changes - though Josef's proposed changes make getting rid
> of the inode LRU a lot easier.
> 
> I talk about some of that stuff in the discussion of this superblock
> inode list iteration patchset here:
> 
> https://lore.kernel.org/linux-fsdevel/20241002014017.3801899-1-david@fromorbit.com/

The latter approach, while it does not make them relocatable, will reduce
fragmentation at least.

Unfortunately, as an MM developer, I don’t have enough experience with
filesystems to assess which proposal is more feasible. It would be really
helpful to get consensus from the FS folks before we push this path
forward—whether it's relocating inode entries or avoiding their
fragmentation.

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [DISCUSSION] Revisiting Slab Movable Objects
  2025-04-25 11:09   ` Harry Yoo
@ 2025-04-28 15:31     ` Jann Horn
  2025-04-30 13:11       ` Harry Yoo
  0 siblings, 1 reply; 9+ messages in thread
From: Jann Horn @ 2025-04-28 15:31 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Dave Chinner, Christoph Lameter, David Rientjes, Andrew Morton,
	Roman Gushchin, Tobin C. Harding, Alexander Viro, Matthew Wilcox,
	Vlastimil Babka, Rik van Riel, Andrea Arcangeli, Liam R. Howlett,
	Lorenzo Stoakes, Pedro Falcato, David Hildenbrand, Oscar Salvador,
	Michal Hocko, Byungchul Park, linux-mm, linux-fsdevel

On Fri, Apr 25, 2025 at 1:09 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> On Tue, Apr 22, 2025 at 07:54:08AM +1000, Dave Chinner wrote:
> > On Mon, Apr 21, 2025 at 10:47:39PM +0900, Harry Yoo wrote:
> > > Hi folks,
> > >
> > > As a long term project, I'm starting to look into resurrecting
> > > Slab Movable Objects. The goal is to make certain types of slab memory
> > > movable and thus enable targeted reclamation, migration, and
> > > defragmentation.
> > >
> > > The main purpose of this posting is to briefly review what's been tried
> > > in the past, ask people why prior efforts have stalled (due to lack of
> > > time or insufficient justification for additional complexity?),
> > > and discuss what's feasible today.
> > >
> > > Please add anyone I may have missed to Cc. :)
> >
> > Adding -fsdevel because dentry/inode cache discussion needs to be
> > visible to all the fs/VFS developers.
> >
> > I'm going to cut straight to the chase here, but I'll leave the rest
> > of the original email quoted below for -fsdevel readers.
> >
> > > Previous Work on Slab Movable Objects
> > > =====================================
> >
> > <snip>
> >
> > Without including any sort of viable proposal for dentry/inode
> > relocation (i.e. the showstopper for past attempts), what is the
> > point of trying to ressurect this?
>
> Migrating slabs still makes sense for other objects such as xarray / maple
> tree nodes, and VMAs.

Do we have examples of how much memory is actually wasted on
sparsely-used slabs, and which slabs this happens in, from some real
workloads?

If sparsely-used slabs are a sufficiently big problem, maybe another
big hammer we have is to use smaller slab pages, or something along
those lines? Though of course a straightforward implementation of that
would probably have negative effects on the performance of SLUB
fastpaths, and depending on object size it might waste more memory on
padding.

(An adventurous idea would be to try to align kmem_cache::size such
that objects start at some subpage boundaries of SLUB folios, and then
figure out a way to shatter SLUB folios into smaller folios at runtime
while they contain objects... but getting the SLUB locking right for
that without slowing down the fastpath for freeing an object would
probably be a large pain.)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [DISCUSSION] Revisiting Slab Movable Objects
  2025-04-28 15:31     ` Jann Horn
@ 2025-04-30 13:11       ` Harry Yoo
  2025-04-30 22:23         ` Jann Horn
  2025-05-05 23:29         ` Dave Chinner
  0 siblings, 2 replies; 9+ messages in thread
From: Harry Yoo @ 2025-04-30 13:11 UTC (permalink / raw)
  To: Jann Horn
  Cc: Dave Chinner, Christoph Lameter, David Rientjes, Andrew Morton,
	Roman Gushchin, Tobin C. Harding, Alexander Viro, Matthew Wilcox,
	Vlastimil Babka, Rik van Riel, Andrea Arcangeli, Liam R. Howlett,
	Lorenzo Stoakes, Pedro Falcato, David Hildenbrand, Oscar Salvador,
	Michal Hocko, Byungchul Park, linux-mm, linux-fsdevel

On Mon, Apr 28, 2025 at 05:31:35PM +0200, Jann Horn wrote:
> On Fri, Apr 25, 2025 at 1:09 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > On Tue, Apr 22, 2025 at 07:54:08AM +1000, Dave Chinner wrote:
> > > On Mon, Apr 21, 2025 at 10:47:39PM +0900, Harry Yoo wrote:
> > > > Hi folks,
> > > >
> > > > As a long term project, I'm starting to look into resurrecting
> > > > Slab Movable Objects. The goal is to make certain types of slab memory
> > > > movable and thus enable targeted reclamation, migration, and
> > > > defragmentation.
> > > >
> > > > The main purpose of this posting is to briefly review what's been tried
> > > > in the past, ask people why prior efforts have stalled (due to lack of
> > > > time or insufficient justification for additional complexity?),
> > > > and discuss what's feasible today.
> > > >
> > > > Please add anyone I may have missed to Cc. :)
> > >
> > > Adding -fsdevel because dentry/inode cache discussion needs to be
> > > visible to all the fs/VFS developers.
> > >
> > > I'm going to cut straight to the chase here, but I'll leave the rest
> > > of the original email quoted below for -fsdevel readers.
> > >
> > > > Previous Work on Slab Movable Objects
> > > > =====================================
> > >
> > > <snip>
> > >
> > > Without including any sort of viable proposal for dentry/inode
> > > relocation (i.e. the showstopper for past attempts), what is the
> > > point of trying to ressurect this?
> >
> > Migrating slabs still makes sense for other objects such as xarray / maple
> > tree nodes, and VMAs.
> 
> Do we have examples of how much memory is actually wasted on
> sparsely-used slabs, and which slabs this happens in, from some real
> workloads?

Workloads that uses a large amount of reclaimable slab memory (inode,
dentry, etc.) and triggers reclamation can observe this problem.

On my laptop, I can reproduce the problem by running 'updatedb' command
that touches many files and triggering reclamation by running programs
that consume large amount of memory. As slab memory is reclaimed, it becomes
sparsely populated (as slab memory is not reclaimed folio by folio)

During reclamation, the total slab memory utilization drops from 95% to 50%.
For very sparsely populated caches, the cache utilization is between
12% and 33%. (ext4_inode_cache, radix_tree_node, dentry, trace_event_file,
and some kmalloc caches on my machine).

At the time OOM-killer is invoked, there's about 50% slab memory wasted
due to sparsely populated slabs, which is about 236 MiB on my laptop.
I would say it's a sufficiently big problem to solve.

I wonder how worse this problem would be on large file servers,
but I don't run such servers :-)

> If sparsely-used slabs are a sufficiently big problem, maybe another
> big hammer we have is to use smaller slab pages, or something along
> those lines? Though of course a straightforward implementation of that
> would probably have negative effects on the performance of SLUB
> fastpaths, and depending on object size it might waste more memory on
> padding.

So it'll be something like prefering low orders when in calculate_order()
while keeping fractional waste reasonably.

One problem could be making n->list_lock contention much worse
on larger machines as you need to grab more slabs from the list?

> (An adventurous idea would be to try to align kmem_cache::size such
> that objects start at some subpage boundaries of SLUB folios, and then
> figure out a way to shatter SLUB folios into smaller folios at runtime
> while they contain objects... but getting the SLUB locking right for
> that without slowing down the fastpath for freeing an object would
> probably be a large pain.)

You can't make virt_to_slab() work if you shatter a slab folio
into smaller ones?

A more general question: will either shattering or allocating
smaller slabs help free more memory anyway? It likely depends on
the spatial pattern of how the objects are reclaimed and remain
populated within a slab?

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [DISCUSSION] Revisiting Slab Movable Objects
  2025-04-30 13:11       ` Harry Yoo
@ 2025-04-30 22:23         ` Jann Horn
  2025-05-05 23:29         ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Jann Horn @ 2025-04-30 22:23 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Dave Chinner, Christoph Lameter, David Rientjes, Andrew Morton,
	Roman Gushchin, Tobin C. Harding, Alexander Viro, Matthew Wilcox,
	Vlastimil Babka, Rik van Riel, Andrea Arcangeli, Liam R. Howlett,
	Lorenzo Stoakes, Pedro Falcato, David Hildenbrand, Oscar Salvador,
	Michal Hocko, Byungchul Park, linux-mm, linux-fsdevel

On Wed, Apr 30, 2025 at 3:11 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> On Mon, Apr 28, 2025 at 05:31:35PM +0200, Jann Horn wrote:
> > On Fri, Apr 25, 2025 at 1:09 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > On Tue, Apr 22, 2025 at 07:54:08AM +1000, Dave Chinner wrote:
> > > > On Mon, Apr 21, 2025 at 10:47:39PM +0900, Harry Yoo wrote:
> > > > > Hi folks,
> > > > >
> > > > > As a long term project, I'm starting to look into resurrecting
> > > > > Slab Movable Objects. The goal is to make certain types of slab memory
> > > > > movable and thus enable targeted reclamation, migration, and
> > > > > defragmentation.
> > > > >
> > > > > The main purpose of this posting is to briefly review what's been tried
> > > > > in the past, ask people why prior efforts have stalled (due to lack of
> > > > > time or insufficient justification for additional complexity?),
> > > > > and discuss what's feasible today.
> > > > >
> > > > > Please add anyone I may have missed to Cc. :)
> > > >
> > > > Adding -fsdevel because dentry/inode cache discussion needs to be
> > > > visible to all the fs/VFS developers.
> > > >
> > > > I'm going to cut straight to the chase here, but I'll leave the rest
> > > > of the original email quoted below for -fsdevel readers.
> > > >
> > > > > Previous Work on Slab Movable Objects
> > > > > =====================================
> > > >
> > > > <snip>
> > > >
> > > > Without including any sort of viable proposal for dentry/inode
> > > > relocation (i.e. the showstopper for past attempts), what is the
> > > > point of trying to ressurect this?
> > >
> > > Migrating slabs still makes sense for other objects such as xarray / maple
> > > tree nodes, and VMAs.
> >
> > Do we have examples of how much memory is actually wasted on
> > sparsely-used slabs, and which slabs this happens in, from some real
> > workloads?
>
> Workloads that uses a large amount of reclaimable slab memory (inode,
> dentry, etc.) and triggers reclamation can observe this problem.
>
> On my laptop, I can reproduce the problem by running 'updatedb' command
> that touches many files and triggering reclamation by running programs
> that consume large amount of memory. As slab memory is reclaimed, it becomes
> sparsely populated (as slab memory is not reclaimed folio by folio)
>
> During reclamation, the total slab memory utilization drops from 95% to 50%.
> For very sparsely populated caches, the cache utilization is between
> 12% and 33%. (ext4_inode_cache, radix_tree_node, dentry, trace_event_file,
> and some kmalloc caches on my machine).
>
> At the time OOM-killer is invoked, there's about 50% slab memory wasted
> due to sparsely populated slabs, which is about 236 MiB on my laptop.
> I would say it's a sufficiently big problem to solve.
>
> I wonder how worse this problem would be on large file servers,
> but I don't run such servers :-)
>
> > If sparsely-used slabs are a sufficiently big problem, maybe another
> > big hammer we have is to use smaller slab pages, or something along
> > those lines? Though of course a straightforward implementation of that
> > would probably have negative effects on the performance of SLUB
> > fastpaths, and depending on object size it might waste more memory on
> > padding.
>
> So it'll be something like prefering low orders when in calculate_order()
> while keeping fractional waste reasonably.
>
> One problem could be making n->list_lock contention much worse
> on larger machines as you need to grab more slabs from the list?

Maybe. I imagine using batched operations could help, such that the
amount of managed memory that is transferred per locking operation
stays the same...

> > (An adventurous idea would be to try to align kmem_cache::size such
> > that objects start at some subpage boundaries of SLUB folios, and then
> > figure out a way to shatter SLUB folios into smaller folios at runtime
> > while they contain objects... but getting the SLUB locking right for
> > that without slowing down the fastpath for freeing an object would
> > probably be a large pain.)
>
> You can't make virt_to_slab() work if you shatter a slab folio
> into smaller ones?

Yeah, I think that would be hard. We could maybe avoid the
virt_to_slab() on the active-slab fastpath, and maybe there is some
kind of RCU-transition scheme that could be used on the path for
non-active slabs (a bit similarly to how percpu refcounts transition
to atomic mode, with a transition period where objects are allowed to
still go on the freelist of the former head page)...

> A more general question: will either shattering or allocating
> smaller slabs help free more memory anyway? It likely depends on
> the spatial pattern of how the objects are reclaimed and remain
> populated within a slab?

Probably, yeah.

As a crude thought experiment, if you (somewhat pessimistically?)
assume that the spatial pattern is "we first allocate a lot of
objects, then for each object we roll a random number and free it with
a 90% probability", and you have something like a kmalloc-512 slab
(normal order 2, which fits 32 objects), then the probability that an
entire order-2 page will be empty would be
pow(0.9, 32) ~= 3.4%
while the probability that an individual order-0 page is empty would be
pow(0.9, 8) ~= 43%
There could be patterns that are worse, like "we preserve exactly
every fourth object"; though SLUB's freelist randomization (if
CONFIG_SLAB_FREELIST_RANDOM is enabled) would probably transform that
into a different pattern, so that it's not actually a sequential
pattern where every fourth object is allocated.

In case you want to do more detailed experiments with this: FYI, I
have a branch "slub-binary-snapshot" at https://github.com/thejh/linux
with a draft patch that provides a debugfs API for getting a binary
dump of SLUB allocations (I wrote that patch for another project):
https://github.com/thejh/linux/commit/685944dc69fd21e92bf110713b491d5c050328af
- maybe with some changes that would be useful for analyzing SLUB
fragmentation from userspace.

But IDK if that's a good way to experiment with this, or if it'd be
easier to directly analyze fragmentation in debugfs code in SLUB.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [DISCUSSION] Revisiting Slab Movable Objects
  2025-04-30 13:11       ` Harry Yoo
  2025-04-30 22:23         ` Jann Horn
@ 2025-05-05 23:29         ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2025-05-05 23:29 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Jann Horn, Christoph Lameter, David Rientjes, Andrew Morton,
	Roman Gushchin, Tobin C. Harding, Alexander Viro, Matthew Wilcox,
	Vlastimil Babka, Rik van Riel, Andrea Arcangeli, Liam R. Howlett,
	Lorenzo Stoakes, Pedro Falcato, David Hildenbrand, Oscar Salvador,
	Michal Hocko, Byungchul Park, linux-mm, linux-fsdevel

On Wed, Apr 30, 2025 at 10:11:22PM +0900, Harry Yoo wrote:
> A more general question: will either shattering or allocating
> smaller slabs help free more memory anyway?

In general, no.

> It likely depends on
> the spatial pattern of how the objects are reclaimed and remain
> populated within a slab?

Right - the pattern of inode/dentry residency in slab pages is
defined by temporal access patterns of any given inode/dentry. If an
application creates a new file and then holds it open, then that
slab page is pinned in memory until the application closes that
file.

Hence if we mix short term file accesses (e.g. access once files (like
updatedb) or short term temp files (like object files during a
code build) with long term open files, the slabs get fragmented
because of the few pinned long term objects in each slab page.

IOWs, the moment we start mixing objects with different temporal
access patterns within a single slab page during rapid cache growth,
we will get fragmentation of the cache as reclaim only removes the
short term objects during subsequent rapid cache shrinkage....

The unsolvable problem here is that we do not know (and cannot know)
what the life time of the object is going to be at object
instantiation time (i.e. path lookup). Hence the temporal access
patterns of objects in the slab pages are going to be largely
random. Experience tells me that even single page slabs (old skool
SLAB cache) had these fragmentation problems (esp. with dentries)
because even a 20:1 ratio of short:long term accesses will leave a
single long term dentry per 4kB slab backing page...

Hence using smaller slabs and/or shattering larger slabs isn't
likely to have all that much impact on the fragmentation of the
slabs because it doesn't do anything to solve the underlying object
lifetime interleaving that causes the fragmentation in the first
place...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-05-05 23:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <aAZMe21Ic2sDIAtY@harry>
2025-04-21 21:54 ` [DISCUSSION] Revisiting Slab Movable Objects Dave Chinner
2025-04-23  1:47   ` Al Viro
2025-04-23  7:20     ` Harry Yoo
2025-04-23  7:40       ` Al Viro
2025-04-25 11:09   ` Harry Yoo
2025-04-28 15:31     ` Jann Horn
2025-04-30 13:11       ` Harry Yoo
2025-04-30 22:23         ` Jann Horn
2025-05-05 23:29         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).