* [LSF/MM/BPF TOPIC] Page cache tracking with the Maple Tree @ 2026-02-24 17:10 Liam R. Howlett 2026-04-17 19:50 ` Matthew Wilcox 0 siblings, 1 reply; 6+ messages in thread From: Liam R. Howlett @ 2026-02-24 17:10 UTC (permalink / raw) To: lsf-pc, linux-mm, linux-fsdevel Cc: Matthew Wilcox, Johannes Weiner, David Hildenbrand, Jan Kara, Ryan Roberts, Christian Brauner Hi, The page cache currently uses a radix tree to store folios. There are several well-known deficiencies to this approach: - Inefficient representation of large folios - Inefficient representation of sparsely accessed files - Poor cache locality - Supports 3 search marks The Maple Tree solves these problems more efficiently than the XArray. In this session, I'd like to discuss what needs to happen to change from using the XArray to the Maple Tree for the page cache. The Maple Tree needs some enhancements: - New node type to efficiently support entries of length of 1 - New node type to support marks - Searching and iterating over marks - Support for purging shadow entries from the page cache The motivation, beyond potential performance gains and memory footprint, is to support more features natively in the data structure. Thanks, Liam ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Page cache tracking with the Maple Tree 2026-02-24 17:10 [LSF/MM/BPF TOPIC] Page cache tracking with the Maple Tree Liam R. Howlett @ 2026-04-17 19:50 ` Matthew Wilcox 2026-04-17 23:14 ` Liam R. Howlett 2026-04-20 20:18 ` Johannes Weiner 0 siblings, 2 replies; 6+ messages in thread From: Matthew Wilcox @ 2026-04-17 19:50 UTC (permalink / raw) To: Liam R. Howlett, lsf-pc, linux-mm, linux-fsdevel, Johannes Weiner, David Hildenbrand, Jan Kara, Ryan Roberts, Christian Brauner On Tue, Feb 24, 2026 at 12:10:26PM -0500, Liam R. Howlett wrote: > The Maple Tree needs some enhancements: > - Support for purging shadow entries from the page cache Liam and I hd some preliminary discussions yesterday around this and we'd like some feedback if anyone has time before LSFMM. For those who aren't aware, when a folio falls off the end of the LRU list, we store a shadow entry in the page cache in its place so that if we access that page again, we know where to put its folio in the LRU list. But this creates a problem (documented in mm/workingset.c) where we can fill up memory with shadow entries. Currently, we embed a list_head in xa_node and add nodes which contain only shadow entries to a list which can be walked by a shrinker when we're low on memory. Ideally we wouldn't do that with the maple tree. There are a few options. The first question we have is whether it's best to keep nodes around to wait for a shrinker to kick in. Was any experimentation done to see whether eagerly freeing a node that contains only shadow entries has a bad effect on performance? The second idea we talked about is that the maple tree is much more flexible than the radix tree. Having even a single folio in a node pins the entire node, so it's "free" to keep the shadow entries in that node around. But with the maple tree, we can be much more granular and delete shadow entries in arbitrary positions. So we could (for example) keep track of inodes which contain shadow entries and purge shadow entries when they reach, say, 10% of the number of pages. Or 1000 entries, or some other threshold. The third idea is that instead of having an injected list_head that we keep a tree pointing to inodes (or even just maple tree nodes which contain a lot of shadow entries). That's not how list_lru works today, so a certain amount of development work would need to be done. Liam, anything I missed? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Page cache tracking with the Maple Tree 2026-04-17 19:50 ` Matthew Wilcox @ 2026-04-17 23:14 ` Liam R. Howlett 2026-04-20 20:18 ` Johannes Weiner 1 sibling, 0 replies; 6+ messages in thread From: Liam R. Howlett @ 2026-04-17 23:14 UTC (permalink / raw) To: Matthew Wilcox Cc: lsf-pc, linux-mm, linux-fsdevel, Johannes Weiner, David Hildenbrand, Jan Kara, Ryan Roberts, Christian Brauner On 26/04/17 08:50PM, Matthew Wilcox wrote: > On Tue, Feb 24, 2026 at 12:10:26PM -0500, Liam R. Howlett wrote: > > The Maple Tree needs some enhancements: > > - Support for purging shadow entries from the page cache > > Liam and I hd some preliminary discussions yesterday around this > and we'd like some feedback if anyone has time before LSFMM. > > For those who aren't aware, when a folio falls off the end of the > LRU list, we store a shadow entry in the page cache in its place > so that if we access that page again, we know where to put its folio in > the LRU list. > > But this creates a problem (documented in mm/workingset.c) where > we can fill up memory with shadow entries. Currently, we embed a > list_head in xa_node and add nodes which contain only shadow entries > to a list which can be walked by a shrinker when we're low on memory. > Ideally we wouldn't do that with the maple tree. There are a few > options. > > The first question we have is whether it's best to keep nodes around to > wait for a shrinker to kick in. Was any experimentation done to > see whether eagerly freeing a node that contains only shadow entries > has a bad effect on performance? > > The second idea we talked about is that the maple tree is much more > flexible than the radix tree. Having even a single folio in a node pins > the entire node, so it's "free" to keep the shadow entries in that node > around. But with the maple tree, we can be much more granular and > delete shadow entries in arbitrary positions. So we could (for example) > keep track of inodes which contain shadow entries and purge shadow > entries when they reach, say, 10% of the number of pages. Or 1000 > entries, or some other threshold. > > The third idea is that instead of having an injected list_head that we > keep a tree pointing to inodes (or even just maple tree nodes which > contain a lot of shadow entries). That's not how list_lru works today, > so a certain amount of development work would need to be done. > > Liam, anything I missed? The only thing is that order 6 trick can be removed and we can support any order folio. Besides that, there are some ideas on implementing the different paths. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Page cache tracking with the Maple Tree 2026-04-17 19:50 ` Matthew Wilcox 2026-04-17 23:14 ` Liam R. Howlett @ 2026-04-20 20:18 ` Johannes Weiner 2026-04-24 16:45 ` Liam R. Howlett 1 sibling, 1 reply; 6+ messages in thread From: Johannes Weiner @ 2026-04-20 20:18 UTC (permalink / raw) To: Matthew Wilcox Cc: Liam R. Howlett, lsf-pc, linux-mm, linux-fsdevel, David Hildenbrand, Jan Kara, Ryan Roberts, Christian Brauner On Fri, Apr 17, 2026 at 08:50:01PM +0100, Matthew Wilcox wrote: > On Tue, Feb 24, 2026 at 12:10:26PM -0500, Liam R. Howlett wrote: > > The Maple Tree needs some enhancements: > > - Support for purging shadow entries from the page cache > > Liam and I hd some preliminary discussions yesterday around this > and we'd like some feedback if anyone has time before LSFMM. > > For those who aren't aware, when a folio falls off the end of the > LRU list, we store a shadow entry in the page cache in its place > so that if we access that page again, we know where to put its folio in > the LRU list. > > But this creates a problem (documented in mm/workingset.c) where > we can fill up memory with shadow entries. Currently, we embed a > list_head in xa_node and add nodes which contain only shadow entries > to a list which can be walked by a shrinker when we're low on memory. > Ideally we wouldn't do that with the maple tree. There are a few > options. > > The first question we have is whether it's best to keep nodes around to > wait for a shrinker to kick in. Was any experimentation done to > see whether eagerly freeing a node that contains only shadow entries > has a bad effect on performance? Hm, I'm not sure how that could work. The LRU order created by readahead makes it highly likely that all the folios in a cache node are reclaimed/made non-resident at once. Going this route would destroy a large part of the non-resident cache the moment it is created. The goal is to garbage-collect the oldest shadow entries whose distances are too long to be actionable at this point. Specifically, their distance to lruvec->nonresident_age (per-cgroup, per-node). In the current scheme, we just go in the order in which nodes became all-shadow - oldest first. And we only do so lazily when the non-resident cache is far into that territory (cache set vastly larger than available memory). That gives us confidence that we're mostly dropping very old entries without having to look at them one by one. We don't have to stick with that design, but whatever replaces it should meet the goal, or approximate it well enough. > The second idea we talked about is that the maple tree is much more > flexible than the radix tree. Having even a single folio in a node pins > the entire node, so it's "free" to keep the shadow entries in that node > around. But with the maple tree, we can be much more granular and > delete shadow entries in arbitrary positions. So we could (for example) > keep track of inodes which contain shadow entries and purge shadow > entries when they reach, say, 10% of the number of pages. Or 1000 > entries, or some other threshold. It's not the volume or the concentration of shadows, it's their age that makes them good candidates for garbage collection. Searching the tree for all-shadow nodes should be possible, but you'd still have to filter for age. Naively, that would be unpack_shadow() -> workingset_test_recent() for each one, but that's likely too heavy. What might work is a monotonic ID counter for each node that becomes all-shadow, then search the trees with quick comparisons for any IDs below a certain cutoff? > The third idea is that instead of having an injected list_head that we > keep a tree pointing to inodes (or even just maple tree nodes which > contain a lot of shadow entries). That's not how list_lru works today, > so a certain amount of development work would need to be done. Ah, so you don't need storage inside the inode or maple tree node. That could work well with the monotonic ID counter, you'd just scan and reclaim through that tree, oldest node first. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Page cache tracking with the Maple Tree 2026-04-20 20:18 ` Johannes Weiner @ 2026-04-24 16:45 ` Liam R. Howlett 2026-04-24 20:00 ` Johannes Weiner 0 siblings, 1 reply; 6+ messages in thread From: Liam R. Howlett @ 2026-04-24 16:45 UTC (permalink / raw) To: Johannes Weiner Cc: Matthew Wilcox, lsf-pc, linux-mm, linux-fsdevel, David Hildenbrand, Jan Kara, Ryan Roberts, Christian Brauner On 26/04/20 04:18PM, Johannes Weiner wrote: > On Fri, Apr 17, 2026 at 08:50:01PM +0100, Matthew Wilcox wrote: > > On Tue, Feb 24, 2026 at 12:10:26PM -0500, Liam R. Howlett wrote: > > > The Maple Tree needs some enhancements: > > > - Support for purging shadow entries from the page cache > > > > Liam and I hd some preliminary discussions yesterday around this > > and we'd like some feedback if anyone has time before LSFMM. > > > > For those who aren't aware, when a folio falls off the end of the > > LRU list, we store a shadow entry in the page cache in its place > > so that if we access that page again, we know where to put its folio in > > the LRU list. > > > > But this creates a problem (documented in mm/workingset.c) where > > we can fill up memory with shadow entries. Currently, we embed a > > list_head in xa_node and add nodes which contain only shadow entries > > to a list which can be walked by a shrinker when we're low on memory. > > Ideally we wouldn't do that with the maple tree. There are a few > > options. > > > > The first question we have is whether it's best to keep nodes around to > > wait for a shrinker to kick in. Was any experimentation done to > > see whether eagerly freeing a node that contains only shadow entries > > has a bad effect on performance? > > Hm, I'm not sure how that could work. > > The LRU order created by readahead makes it highly likely that all the > folios in a cache node are reclaimed/made non-resident at once. Going > this route would destroy a large part of the non-resident cache the > moment it is created. Not the moment it is created, but the moment the entire node only has NULLs or shadow entries. Meaning it's removed the moment the last entry becomes a shadow entry. Still much sooner, but not as extreme as just dropping them immediately. > > The goal is to garbage-collect the oldest shadow entries whose > distances are too long to be actionable at this point. Specifically, > their distance to lruvec->nonresident_age (per-cgroup, per-node). > > In the current scheme, we just go in the order in which nodes became > all-shadow - oldest first. And we only do so lazily when the > non-resident cache is far into that territory (cache set vastly larger > than available memory). That gives us confidence that we're mostly > dropping very old entries without having to look at them one by one. > > We don't have to stick with that design, but whatever replaces it > should meet the goal, or approximate it well enough. The goal of garbage collecting the oldest shadow entries based on the nonresident_age. That's a broad target, which I think is made more specific within the implementation to 'groupings of 64 shadow entries', which I bring up later. Besides the locking improvements that we can bring, is there anything that you have found doesn't work optimal in the current solution that may be nice to fix? > > > The second idea we talked about is that the maple tree is much more > > flexible than the radix tree. Having even a single folio in a node pins > > the entire node, so it's "free" to keep the shadow entries in that node > > around. But with the maple tree, we can be much more granular and > > delete shadow entries in arbitrary positions. So we could (for example) > > keep track of inodes which contain shadow entries and purge shadow > > entries when they reach, say, 10% of the number of pages. Or 1000 > > entries, or some other threshold. > > It's not the volume or the concentration of shadows, it's their age > that makes them good candidates for garbage collection. Maybe I'm reading this wrong but it seems workingset_update_node() adds nodes to the list that is composed of all shadow entries (or removes it if we have changed one to resident). Doesn't that mean there is a concentration of shadow entries since the entries in the tree are in order? That is, reclaim is at node level granularity, and the nodes are sorted sets. I think, sibling entries aside, it's fair to call that a grouping of 64 shadow entries? This also means that with today's code we are keeping older entries over newer entries because the node has at least one resident entry. But that's fine, because we can't gain anything by removing them today. So, we can change this behaviour to something new if we want, if 64 shadow entries are too many to accumulate before removal, or too small. > > Searching the tree for all-shadow nodes should be possible, but you'd > still have to filter for age. Naively, that would be unpack_shadow() > -> workingset_test_recent() for each one, but that's likely too heavy. > > What might work is a monotonic ID counter for each node that becomes > all-shadow, then search the trees with quick comparisons for any IDs > below a certain cutoff? > > > The third idea is that instead of having an injected list_head that we > > keep a tree pointing to inodes (or even just maple tree nodes which > > contain a lot of shadow entries). That's not how list_lru works today, > > so a certain amount of development work would need to be done. > > Ah, so you don't need storage inside the inode or maple tree node. > > That could work well with the monotonic ID counter, you'd just scan > and reclaim through that tree, oldest node first. I think this would end up looking like a linked list with a few extra metadata points. Basically a linked list, node pointer, min of the node, maybe a size if we support multiple node types., I'm not sure the node pointer will do us much good, considering how the maple tree node allocation needs to track the tree status on walking to minimize allocations. In fact, if we do a write that rebalances the tree, we may have stale pointers. I don't expect this to happen under any normal circumstance - the node would need to be altered, but maybe the node type changes.. It might be better to just track ranges and the largest (thus newest to become a shadow) monotonic ID counter. There's many ways to do this, but it'd probably be best to decide on a size that makes sense (I say size because of the siblings..). On that note, the siblings stuff will go away with the maple tree as well. Thanks, Liam ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Page cache tracking with the Maple Tree 2026-04-24 16:45 ` Liam R. Howlett @ 2026-04-24 20:00 ` Johannes Weiner 0 siblings, 0 replies; 6+ messages in thread From: Johannes Weiner @ 2026-04-24 20:00 UTC (permalink / raw) To: Liam R. Howlett Cc: Matthew Wilcox, lsf-pc, linux-mm, linux-fsdevel, David Hildenbrand, Jan Kara, Ryan Roberts, Christian Brauner On Fri, Apr 24, 2026 at 12:45:32PM -0400, Liam R. Howlett wrote: > On 26/04/20 04:18PM, Johannes Weiner wrote: > > On Fri, Apr 17, 2026 at 08:50:01PM +0100, Matthew Wilcox wrote: > > > On Tue, Feb 24, 2026 at 12:10:26PM -0500, Liam R. Howlett wrote: > > > > The Maple Tree needs some enhancements: > > > > - Support for purging shadow entries from the page cache > > > > > > Liam and I hd some preliminary discussions yesterday around this > > > and we'd like some feedback if anyone has time before LSFMM. > > > > > > For those who aren't aware, when a folio falls off the end of the > > > LRU list, we store a shadow entry in the page cache in its place > > > so that if we access that page again, we know where to put its folio in > > > the LRU list. > > > > > > But this creates a problem (documented in mm/workingset.c) where > > > we can fill up memory with shadow entries. Currently, we embed a > > > list_head in xa_node and add nodes which contain only shadow entries > > > to a list which can be walked by a shrinker when we're low on memory. > > > Ideally we wouldn't do that with the maple tree. There are a few > > > options. > > > > > > The first question we have is whether it's best to keep nodes around to > > > wait for a shrinker to kick in. Was any experimentation done to > > > see whether eagerly freeing a node that contains only shadow entries > > > has a bad effect on performance? > > > > Hm, I'm not sure how that could work. > > > > The LRU order created by readahead makes it highly likely that all the > > folios in a cache node are reclaimed/made non-resident at once. Going > > this route would destroy a large part of the non-resident cache the > > moment it is created. > > Not the moment it is created, but the moment the entire node only has > NULLs or shadow entries. Meaning it's removed the moment the last > entry becomes a shadow entry. I'm saying they can be often the same event. Readahead means LRU order matches file index / node range order. So there is a good chance that reclaim batches will create all shadow entries in a node in one go. > Still much sooner, but not as extreme as just dropping them immediately. > > > > > The goal is to garbage-collect the oldest shadow entries whose > > distances are too long to be actionable at this point. Specifically, > > their distance to lruvec->nonresident_age (per-cgroup, per-node). > > > > In the current scheme, we just go in the order in which nodes became > > all-shadow - oldest first. And we only do so lazily when the > > non-resident cache is far into that territory (cache set vastly larger > > than available memory). That gives us confidence that we're mostly > > dropping very old entries without having to look at them one by one. > > > > We don't have to stick with that design, but whatever replaces it > > should meet the goal, or approximate it well enough. > > The goal of garbage collecting the oldest shadow entries based on the > nonresident_age. That's a broad target, which I think is made more > specific within the implementation to 'groupings of 64 shadow entries', > which I bring up later. > > Besides the locking improvements that we can bring, is there anything > that you have found doesn't work optimal in the current solution that > may be nice to fix? > > > > > > The second idea we talked about is that the maple tree is much more > > > flexible than the radix tree. Having even a single folio in a node pins > > > the entire node, so it's "free" to keep the shadow entries in that node > > > around. But with the maple tree, we can be much more granular and > > > delete shadow entries in arbitrary positions. So we could (for example) > > > keep track of inodes which contain shadow entries and purge shadow > > > entries when they reach, say, 10% of the number of pages. Or 1000 > > > entries, or some other threshold. > > > > It's not the volume or the concentration of shadows, it's their age > > that makes them good candidates for garbage collection. > > Maybe I'm reading this wrong but it seems workingset_update_node() adds > nodes to the list that is composed of all shadow entries (or removes it > if we have changed one to resident). > > Doesn't that mean there is a concentration of shadow entries since the > entries in the tree are in order? > > That is, reclaim is at node level granularity, and the nodes are sorted > sets. I think, sibling entries aside, it's fair to call that a grouping > of 64 shadow entries? > > This also means that with today's code we are keeping older entries over > newer entries because the node has at least one resident entry. But > that's fine, because we can't gain anything by removing them today. Yes that's a great catch. It's the last eviction, i.e. the very youngest shadow entry at this point, that suddenly demotes the whole node to the "old" set. And there could be much older entries in circulation that just happen to still be next to a resident entry. But I think this would require a pretty spectacular breakdown in access locality. More likely you get mixed nodes when individual entries refault. And yes, not much we can or need to do about it, since the resident entries pin the whole node anyway. So the question, I think, is still: in what order do you reclaim a gazillion all-shadow nodes when the time comes. And the answer to that, IMO, is still best approximated by going in the order in which they became all-shadow. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-24 20:00 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-24 17:10 [LSF/MM/BPF TOPIC] Page cache tracking with the Maple Tree Liam R. Howlett 2026-04-17 19:50 ` Matthew Wilcox 2026-04-17 23:14 ` Liam R. Howlett 2026-04-20 20:18 ` Johannes Weiner 2026-04-24 16:45 ` Liam R. Howlett 2026-04-24 20:00 ` Johannes Weiner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox