* [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
@ 2026-03-30 21:23 Lorenzo Stoakes (Oracle)
2026-03-31 23:30 ` Barry Song
2026-05-02 6:53 ` Lorenzo Stoakes
0 siblings, 2 replies; 10+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-30 21:23 UTC (permalink / raw)
To: lsf-pc
Cc: linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
Suren Baghdasaryan, Pedro Falcato, Ryan Roberts, Harry Yoo,
Rik van Riel, Jann Horn, Chris Li, Barry Song
[sorry subject line was typo'd, resending with correct subject line for
visibility. Original at
https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
Currently we track the reverse mapping between folios and VMAs at a VMA level,
utilising a complicated and confusing combination of anon_vma objects and
anon_vma_chain's linking them, which must be updated when VMAs are split,
merged, remapped or forked.
It's further complicated by various optimisations intended to avoid scalability
issues in locking and memory allocation.
I have done recent work to improve the situation [0] which has also lead to a
reported improvement in lock scalability [1], but fundamentally the situation
remains the same.
The logic is actually, when you think hard enough about it, is a fairly
reasonable means of implementing the reverse mapping at a VMA level.
It is, however, a very broken abstraction as it stands. In order to work with
the logic, you have to essentially keep a broad understanding of the entire
implementation in your head at one time - that is, not much is really
abstracted.
This results in confusion, mistakes, and bit rot. It's also very time-consuming
to work with - personally I've gone to the lengths of writing a private set of
slides for myself on the topic as a reminder each time I come back to it.
There are also issues with lock scalability - the use of interval trees to
maintain a connection between an anon_vma and AVCs connected to VMAs requires
that a lock must be held across the entire 'CoW hierarchy' of parent and child
VMAs whenever performing an rmap walk or performing a merge, split, remap or
fork.
This is because we tear down all interval tree mappings and reestablish them
each time we might see changes in VMA geometry. This is an issue Barry Song
identified as problematic in a real world use case [2].
So what do we do to improve the situation?
Recently I have been working on an experimental new approach to the anonymous
reverse mapping, in which we instead track anonymous remaps, and then use the
VMA's virtual page offset to locate VMAs from the folio.
I have got the implementation working to the point where it tracks the exact
same VMAs as the anon_vma implementation, and it seems a lot of it can be done
under RCU.
It avoids the need to maintain expensive mappings at a VMA level, though it
incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO
(they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is
pretty sub-optimal).
I am investigating whether I can change how MAP_PRIVATE file-backed mappings
work to avoid this issue, and will be developing tests to see how lock
scalability, throughput and memory usage compare to the anon_vma approach under
different workloads.
This experiment may or may not work out, either way it will be interesting to
discuss it.
By the time LSF/MM comes around I may even have already decided on a different
approach but that's what makes things interesting :)
[0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/
[1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/
[2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
2026-03-30 21:23 [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] Lorenzo Stoakes (Oracle)
@ 2026-03-31 23:30 ` Barry Song
2026-04-01 8:43 ` Lorenzo Stoakes (Oracle)
2026-05-02 6:53 ` Lorenzo Stoakes
1 sibling, 1 reply; 10+ messages in thread
From: Barry Song @ 2026-03-31 23:30 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle)
Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett,
Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts,
Harry Yoo, Rik van Riel, Jann Horn, Chris Li
Hi Lorenzo,
Thank you very much for bringing this up for discussion.
On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> [sorry subject line was typo'd, resending with correct subject line for
> visibility. Original at
> https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
>
> Currently we track the reverse mapping between folios and VMAs at a VMA level,
> utilising a complicated and confusing combination of anon_vma objects and
> anon_vma_chain's linking them, which must be updated when VMAs are split,
> merged, remapped or forked.
>
> It's further complicated by various optimisations intended to avoid scalability
> issues in locking and memory allocation.
>
> I have done recent work to improve the situation [0] which has also lead to a
> reported improvement in lock scalability [1], but fundamentally the situation
> remains the same.
>
> The logic is actually, when you think hard enough about it, is a fairly
> reasonable means of implementing the reverse mapping at a VMA level.
>
> It is, however, a very broken abstraction as it stands. In order to work with
> the logic, you have to essentially keep a broad understanding of the entire
> implementation in your head at one time - that is, not much is really
> abstracted.
>
> This results in confusion, mistakes, and bit rot. It's also very time-consuming
> to work with - personally I've gone to the lengths of writing a private set of
> slides for myself on the topic as a reminder each time I come back to it.
>
> There are also issues with lock scalability - the use of interval trees to
> maintain a connection between an anon_vma and AVCs connected to VMAs requires
> that a lock must be held across the entire 'CoW hierarchy' of parent and child
> VMAs whenever performing an rmap walk or performing a merge, split, remap or
> fork.
>
> This is because we tear down all interval tree mappings and reestablish them
> each time we might see changes in VMA geometry. This is an issue Barry Song
> identified as problematic in a real world use case [2].
>
> So what do we do to improve the situation?
>
> Recently I have been working on an experimental new approach to the anonymous
> reverse mapping, in which we instead track anonymous remaps, and then use the
> VMA's virtual page offset to locate VMAs from the folio.
Please forgive my confusion. I’m still struggling to fully
understand your approach of “tracking anonymous remaps.”
Could you provide a concrete example to illustrate how it works?
For example, if A forks B, and then B forks C, how do we
determine the VMAs for a folio from the original A that has
not yet been COWed in B or C?
Additionally, if B COWs and obtains a new folio before forking
C, how do we determine its VMAs in B and C?
Also, what happens if C performs a remap on the inherited VMA
in the two cases described above?
>
> I have got the implementation working to the point where it tracks the exact
> same VMAs as the anon_vma implementation, and it seems a lot of it can be done
> under RCU.
>
> It avoids the need to maintain expensive mappings at a VMA level, though it
> incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO
> (they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is
> pretty sub-optimal).
>
> I am investigating whether I can change how MAP_PRIVATE file-backed mappings
> work to avoid this issue, and will be developing tests to see how lock
> scalability, throughput and memory usage compare to the anon_vma approach under
> different workloads.
>
> This experiment may or may not work out, either way it will be interesting to
> discuss it.
>
> By the time LSF/MM comes around I may even have already decided on a different
> approach but that's what makes things interesting :)
>
> [0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/
> [1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/
> [2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/
>
> Cheers, Lorenzo
Thanks
Barry
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
2026-03-31 23:30 ` Barry Song
@ 2026-04-01 8:43 ` Lorenzo Stoakes (Oracle)
2026-04-01 21:03 ` Barry Song
0 siblings, 1 reply; 10+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-04-01 8:43 UTC (permalink / raw)
To: Barry Song
Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett,
Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts,
Harry Yoo, Rik van Riel, Jann Horn, Chris Li
On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote:
> Hi Lorenzo,
>
> Thank you very much for bringing this up for discussion.
>
> On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > [sorry subject line was typo'd, resending with correct subject line for
> > visibility. Original at
> > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
> >
> > Currently we track the reverse mapping between folios and VMAs at a VMA level,
> > utilising a complicated and confusing combination of anon_vma objects and
> > anon_vma_chain's linking them, which must be updated when VMAs are split,
> > merged, remapped or forked.
> >
> > It's further complicated by various optimisations intended to avoid scalability
> > issues in locking and memory allocation.
> >
> > I have done recent work to improve the situation [0] which has also lead to a
> > reported improvement in lock scalability [1], but fundamentally the situation
> > remains the same.
> >
> > The logic is actually, when you think hard enough about it, is a fairly
> > reasonable means of implementing the reverse mapping at a VMA level.
> >
> > It is, however, a very broken abstraction as it stands. In order to work with
> > the logic, you have to essentially keep a broad understanding of the entire
> > implementation in your head at one time - that is, not much is really
> > abstracted.
> >
> > This results in confusion, mistakes, and bit rot. It's also very time-consuming
> > to work with - personally I've gone to the lengths of writing a private set of
> > slides for myself on the topic as a reminder each time I come back to it.
> >
> > There are also issues with lock scalability - the use of interval trees to
> > maintain a connection between an anon_vma and AVCs connected to VMAs requires
> > that a lock must be held across the entire 'CoW hierarchy' of parent and child
> > VMAs whenever performing an rmap walk or performing a merge, split, remap or
> > fork.
> >
> > This is because we tear down all interval tree mappings and reestablish them
> > each time we might see changes in VMA geometry. This is an issue Barry Song
> > identified as problematic in a real world use case [2].
> >
> > So what do we do to improve the situation?
> >
> > Recently I have been working on an experimental new approach to the anonymous
> > reverse mapping, in which we instead track anonymous remaps, and then use the
> > VMA's virtual page offset to locate VMAs from the folio.
>
> Please forgive my confusion. I’m still struggling to fully
> understand your approach of “tracking anonymous remaps.”
> Could you provide a concrete example to illustrate how it works?
I should really put this code somewhere :)
>
> For example, if A forks B, and then B forks C, how do we
> determine the VMAs for a folio from the original A that has
> not yet been COWed in B or C?
The folio references the cow_context associated with the mm in A.
So mm has a new cow_context field that points to cow_context, and the
cow_context can outlive the mm if it has children.
Each cow context tracks its forked children also, so an rmap search will
traverse A, B, C.
>
> Additionally, if B COWs and obtains a new folio before forking
> C, how do we determine its VMAs in B and C?
The new folio would point to B's cow context, and it'd traverse B and C to find
relevant folios.
Overall we pay a higher search price (though arguably, not too bad still) but
get to do it _all_ under RCU.
In exchange, we avoid the locking issues and use ~30x less memory.
(Of course I am yet to solve rmap lock stabilisation so got to try and do that
first :)
>
> Also, what happens if C performs a remap on the inherited VMA
> in the two cases described above?
Remaps are tracked within cow_context's via an extended maple tree (currently
maple tree -> dynamic arrays) that also handles multiple entries and overlaps.
>
> >
> > I have got the implementation working to the point where it tracks the exact
> > same VMAs as the anon_vma implementation, and it seems a lot of it can be done
> > under RCU.
> >
> > It avoids the need to maintain expensive mappings at a VMA level, though it
> > incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO
> > (they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is
> > pretty sub-optimal).
> >
> > I am investigating whether I can change how MAP_PRIVATE file-backed mappings
> > work to avoid this issue, and will be developing tests to see how lock
> > scalability, throughput and memory usage compare to the anon_vma approach under
> > different workloads.
> >
> > This experiment may or may not work out, either way it will be interesting to
> > discuss it.
> >
> > By the time LSF/MM comes around I may even have already decided on a different
> > approach but that's what makes things interesting :)
> >
> > [0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/
> > [1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/
> > [2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/
> >
> > Cheers, Lorenzo
>
> Thanks
> Barry
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
2026-04-01 8:43 ` Lorenzo Stoakes (Oracle)
@ 2026-04-01 21:03 ` Barry Song
2026-04-02 12:20 ` Lorenzo Stoakes (Oracle)
0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2026-04-01 21:03 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle)
Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett,
Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts,
Harry Yoo, Rik van Riel, Jann Horn, Chris Li
On Wed, Apr 1, 2026 at 4:43 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote:
> > Hi Lorenzo,
> >
> > Thank you very much for bringing this up for discussion.
> >
> > On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > >
> > > [sorry subject line was typo'd, resending with correct subject line for
> > > visibility. Original at
> > > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
> > >
> > > Currently we track the reverse mapping between folios and VMAs at a VMA level,
> > > utilising a complicated and confusing combination of anon_vma objects and
> > > anon_vma_chain's linking them, which must be updated when VMAs are split,
> > > merged, remapped or forked.
> > >
> > > It's further complicated by various optimisations intended to avoid scalability
> > > issues in locking and memory allocation.
> > >
> > > I have done recent work to improve the situation [0] which has also lead to a
> > > reported improvement in lock scalability [1], but fundamentally the situation
> > > remains the same.
> > >
> > > The logic is actually, when you think hard enough about it, is a fairly
> > > reasonable means of implementing the reverse mapping at a VMA level.
> > >
> > > It is, however, a very broken abstraction as it stands. In order to work with
> > > the logic, you have to essentially keep a broad understanding of the entire
> > > implementation in your head at one time - that is, not much is really
> > > abstracted.
> > >
> > > This results in confusion, mistakes, and bit rot. It's also very time-consuming
> > > to work with - personally I've gone to the lengths of writing a private set of
> > > slides for myself on the topic as a reminder each time I come back to it.
> > >
> > > There are also issues with lock scalability - the use of interval trees to
> > > maintain a connection between an anon_vma and AVCs connected to VMAs requires
> > > that a lock must be held across the entire 'CoW hierarchy' of parent and child
> > > VMAs whenever performing an rmap walk or performing a merge, split, remap or
> > > fork.
> > >
> > > This is because we tear down all interval tree mappings and reestablish them
> > > each time we might see changes in VMA geometry. This is an issue Barry Song
> > > identified as problematic in a real world use case [2].
> > >
> > > So what do we do to improve the situation?
> > >
> > > Recently I have been working on an experimental new approach to the anonymous
> > > reverse mapping, in which we instead track anonymous remaps, and then use the
> > > VMA's virtual page offset to locate VMAs from the folio.
> >
> > Please forgive my confusion. I’m still struggling to fully
> > understand your approach of “tracking anonymous remaps.”
> > Could you provide a concrete example to illustrate how it works?
>
> I should really put this code somewhere :)
>
> >
> > For example, if A forks B, and then B forks C, how do we
> > determine the VMAs for a folio from the original A that has
> > not yet been COWed in B or C?
>
> The folio references the cow_context associated with the mm in A.
>
> So mm has a new cow_context field that points to cow_context, and the
> cow_context can outlive the mm if it has children.
So we can’t use list_for_each_entry_rcu(child, &parent->children, sibling)
because in vfork() and exec() cases the mm_struct is not inherited?
>
> Each cow context tracks its forked children also, so an rmap search will
> traverse A, B, C.
I still don’t understand how we can get a folio’s VMA from the folio itself.
For anonymous VMAs, vma->vm_pgoff is always zero, right?
Are you changing vm_pgoff to a value equal to vm_start >> PAGE_SHIFT?
In case A forks B, and B unmaps a VMA then maps a new
VMA at the same address as before, what happens? Will the
traversal find the new VMA, which doesn’t actually map the folio?
>
> >
> > Additionally, if B COWs and obtains a new folio before forking
> > C, how do we determine its VMAs in B and C?
>
> The new folio would point to B's cow context, and it'd traverse B and C to find
> relevant folios.
>
> Overall we pay a higher search price (though arguably, not too bad still) but
> get to do it _all_ under RCU.
Yep. I see that list_for_each_entry_rcu(child, &parent->children, sibling)
can work safely under RCU.
>
> In exchange, we avoid the locking issues and use ~30x less memory.
>
> (Of course I am yet to solve rmap lock stabilisation so got to try and do that
> first :)
>
> >
> > Also, what happens if C performs a remap on the inherited VMA
> > in the two cases described above?
>
> Remaps are tracked within cow_context's via an extended maple tree (currently
> maple tree -> dynamic arrays) that also handles multiple entries and overlaps.
If we have multiple remaps for multiple VMAs within one mm_struct,
will we end up traversing all the dynamic arrays for any folio that
might be located in a VMA that has been remapped?
>
> >
> > >
> > > I have got the implementation working to the point where it tracks the exact
> > > same VMAs as the anon_vma implementation, and it seems a lot of it can be done
> > > under RCU.
> > >
> > > It avoids the need to maintain expensive mappings at a VMA level, though it
> > > incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO
> > > (they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is
> > > pretty sub-optimal).
> > >
> > > I am investigating whether I can change how MAP_PRIVATE file-backed mappings
> > > work to avoid this issue, and will be developing tests to see how lock
> > > scalability, throughput and memory usage compare to the anon_vma approach under
> > > different workloads.
> > >
> > > This experiment may or may not work out, either way it will be interesting to
> > > discuss it.
> > >
> > > By the time LSF/MM comes around I may even have already decided on a different
> > > approach but that's what makes things interesting :)
> > >
> > > [0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/
> > > [1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/
> > > [2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/
> > >
Thanks
Barry
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
2026-04-01 21:03 ` Barry Song
@ 2026-04-02 12:20 ` Lorenzo Stoakes (Oracle)
2026-04-02 21:49 ` Barry Song
0 siblings, 1 reply; 10+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-04-02 12:20 UTC (permalink / raw)
To: Barry Song
Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett,
Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts,
Harry Yoo, Rik van Riel, Jann Horn, Chris Li
On Thu, Apr 02, 2026 at 05:03:42AM +0800, Barry Song wrote:
> On Wed, Apr 1, 2026 at 4:43 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote:
> > > Hi Lorenzo,
> > >
> > > Thank you very much for bringing this up for discussion.
> > >
> > > On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > > >
> > > > [sorry subject line was typo'd, resending with correct subject line for
> > > > visibility. Original at
> > > > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
> > > >
> > > > Currently we track the reverse mapping between folios and VMAs at a VMA level,
> > > > utilising a complicated and confusing combination of anon_vma objects and
> > > > anon_vma_chain's linking them, which must be updated when VMAs are split,
> > > > merged, remapped or forked.
> > > >
> > > > It's further complicated by various optimisations intended to avoid scalability
> > > > issues in locking and memory allocation.
> > > >
> > > > I have done recent work to improve the situation [0] which has also lead to a
> > > > reported improvement in lock scalability [1], but fundamentally the situation
> > > > remains the same.
> > > >
> > > > The logic is actually, when you think hard enough about it, is a fairly
> > > > reasonable means of implementing the reverse mapping at a VMA level.
> > > >
> > > > It is, however, a very broken abstraction as it stands. In order to work with
> > > > the logic, you have to essentially keep a broad understanding of the entire
> > > > implementation in your head at one time - that is, not much is really
> > > > abstracted.
> > > >
> > > > This results in confusion, mistakes, and bit rot. It's also very time-consuming
> > > > to work with - personally I've gone to the lengths of writing a private set of
> > > > slides for myself on the topic as a reminder each time I come back to it.
> > > >
> > > > There are also issues with lock scalability - the use of interval trees to
> > > > maintain a connection between an anon_vma and AVCs connected to VMAs requires
> > > > that a lock must be held across the entire 'CoW hierarchy' of parent and child
> > > > VMAs whenever performing an rmap walk or performing a merge, split, remap or
> > > > fork.
> > > >
> > > > This is because we tear down all interval tree mappings and reestablish them
> > > > each time we might see changes in VMA geometry. This is an issue Barry Song
> > > > identified as problematic in a real world use case [2].
> > > >
> > > > So what do we do to improve the situation?
> > > >
> > > > Recently I have been working on an experimental new approach to the anonymous
> > > > reverse mapping, in which we instead track anonymous remaps, and then use the
> > > > VMA's virtual page offset to locate VMAs from the folio.
> > >
> > > Please forgive my confusion. I’m still struggling to fully
> > > understand your approach of “tracking anonymous remaps.”
> > > Could you provide a concrete example to illustrate how it works?
> >
> > I should really put this code somewhere :)
> >
> > >
> > > For example, if A forks B, and then B forks C, how do we
> > > determine the VMAs for a folio from the original A that has
> > > not yet been COWed in B or C?
> >
> > The folio references the cow_context associated with the mm in A.
> >
> > So mm has a new cow_context field that points to cow_context, and the
> > cow_context can outlive the mm if it has children.
>
> So we can’t use list_for_each_entry_rcu(child, &parent->children, sibling)
> because in vfork() and exec() cases the mm_struct is not inherited?
Umm, memory is not preserved across an exec() :) so it works fine with that.
vfork() is CLONE_VM so the mm is shared and everything works fine.
>
> >
> > Each cow context tracks its forked children also, so an rmap search will
> > traverse A, B, C.
>
> I still don’t understand how we can get a folio’s VMA from the folio itself.
> For anonymous VMAs, vma->vm_pgoff is always zero, right?
No, not at all.
vma->vm_pgoff is equal to vma->vm_start >> PAGE_SHIFT when first faulted in for
anon.
That reduces the problem to tracking remaps, which I do.
>
> Are you changing vm_pgoff to a value equal to vm_start >> PAGE_SHIFT?
No, that's how anon works already.
>
> In case A forks B, and B unmaps a VMA then maps a new
> VMA at the same address as before, what happens? Will the
> traversal find the new VMA, which doesn’t actually map the folio?
Well you're missing stuff there, the folio would have to be non-anon exclusive
(which is rare). Yes it'd find the new VMA, then traverse, and find the folio
does not match, and traverse children.
rmap walks _always_ allow for you walking VMAs that a folio does not belong
to.
For instance, with anon_vma, if you CoW a bunch of folios to child process VMAs,
the non-CoW'd folio will _still_ traverse all of that uselessly.
In any case, this isn't a common case.
However note that if a folio _becomes_ anon exclusive, it switches its 'root'
cow context to the one associated with the mm which it became exclusive to.
>
> >
> > >
> > > Additionally, if B COWs and obtains a new folio before forking
> > > C, how do we determine its VMAs in B and C?
> >
> > The new folio would point to B's cow context, and it'd traverse B and C to find
> > relevant folios.
> >
> > Overall we pay a higher search price (though arguably, not too bad still) but
> > get to do it _all_ under RCU.
>
> Yep. I see that list_for_each_entry_rcu(child, &parent->children, sibling)
> can work safely under RCU.
>
> >
> > In exchange, we avoid the locking issues and use ~30x less memory.
> >
> > (Of course I am yet to solve rmap lock stabilisation so got to try and do that
> > first :)
> >
> > >
> > > Also, what happens if C performs a remap on the inherited VMA
> > > in the two cases described above?
> >
> > Remaps are tracked within cow_context's via an extended maple tree (currently
> > maple tree -> dynamic arrays) that also handles multiple entries and overlaps.
>
> If we have multiple remaps for multiple VMAs within one mm_struct,
> will we end up traversing all the dynamic arrays for any folio that
> might be located in a VMA that has been remapped?
Yup. But there aren't all that many, and it's all under RCU so :)
That part of the search should be quick, parts of the search involving page
tables, less so.
Also I need to figure out how to maintain stabilisation without an rmap lock, an
ongoing open problem in all this.
In the end, as the original mail said, I may conclude _this_ approach is
unworkable and come up with an alternative that's more conventional.
BUT. Doing it this way saves 30x the amount of kernel allocated memory. I tried
a heavy load case and it was very substantial. That's not to be sniffed at.
In any case, all of this is going to be _very_ driven by metrics. How slow is
it, how much overhead does it actually produce, is it workable, are the
trade-offs right, etc.
It's an exploration rather than a fait accompli.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
2026-04-02 12:20 ` Lorenzo Stoakes (Oracle)
@ 2026-04-02 21:49 ` Barry Song
2026-05-04 8:10 ` Lorenzo Stoakes
0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2026-04-02 21:49 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle)
Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett,
Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts,
Harry Yoo, Rik van Riel, Jann Horn, Chris Li
On Thu, Apr 2, 2026 at 8:20 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Thu, Apr 02, 2026 at 05:03:42AM +0800, Barry Song wrote:
> > On Wed, Apr 1, 2026 at 4:43 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > >
> > > On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote:
> > > > Hi Lorenzo,
> > > >
> > > > Thank you very much for bringing this up for discussion.
> > > >
> > > > On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > > > >
> > > > > [sorry subject line was typo'd, resending with correct subject line for
> > > > > visibility. Original at
> > > > > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
> > > > >
> > > > > Currently we track the reverse mapping between folios and VMAs at a VMA level,
> > > > > utilising a complicated and confusing combination of anon_vma objects and
> > > > > anon_vma_chain's linking them, which must be updated when VMAs are split,
> > > > > merged, remapped or forked.
> > > > >
> > > > > It's further complicated by various optimisations intended to avoid scalability
> > > > > issues in locking and memory allocation.
> > > > >
> > > > > I have done recent work to improve the situation [0] which has also lead to a
> > > > > reported improvement in lock scalability [1], but fundamentally the situation
> > > > > remains the same.
> > > > >
> > > > > The logic is actually, when you think hard enough about it, is a fairly
> > > > > reasonable means of implementing the reverse mapping at a VMA level.
> > > > >
> > > > > It is, however, a very broken abstraction as it stands. In order to work with
> > > > > the logic, you have to essentially keep a broad understanding of the entire
> > > > > implementation in your head at one time - that is, not much is really
> > > > > abstracted.
> > > > >
> > > > > This results in confusion, mistakes, and bit rot. It's also very time-consuming
> > > > > to work with - personally I've gone to the lengths of writing a private set of
> > > > > slides for myself on the topic as a reminder each time I come back to it.
> > > > >
> > > > > There are also issues with lock scalability - the use of interval trees to
> > > > > maintain a connection between an anon_vma and AVCs connected to VMAs requires
> > > > > that a lock must be held across the entire 'CoW hierarchy' of parent and child
> > > > > VMAs whenever performing an rmap walk or performing a merge, split, remap or
> > > > > fork.
> > > > >
> > > > > This is because we tear down all interval tree mappings and reestablish them
> > > > > each time we might see changes in VMA geometry. This is an issue Barry Song
> > > > > identified as problematic in a real world use case [2].
> > > > >
> > > > > So what do we do to improve the situation?
> > > > >
> > > > > Recently I have been working on an experimental new approach to the anonymous
> > > > > reverse mapping, in which we instead track anonymous remaps, and then use the
> > > > > VMA's virtual page offset to locate VMAs from the folio.
> > > >
> > > > Please forgive my confusion. I’m still struggling to fully
> > > > understand your approach of “tracking anonymous remaps.”
> > > > Could you provide a concrete example to illustrate how it works?
> > >
> > > I should really put this code somewhere :)
> > >
> > > >
> > > > For example, if A forks B, and then B forks C, how do we
> > > > determine the VMAs for a folio from the original A that has
> > > > not yet been COWed in B or C?
> > >
> > > The folio references the cow_context associated with the mm in A.
> > >
> > > So mm has a new cow_context field that points to cow_context, and the
> > > cow_context can outlive the mm if it has children.
> >
> > So we can’t use list_for_each_entry_rcu(child, &parent->children, sibling)
> > because in vfork() and exec() cases the mm_struct is not inherited?
>
> Umm, memory is not preserved across an exec() :) so it works fine with that.
>
> vfork() is CLONE_VM so the mm is shared and everything works fine.
My question is whether we can reuse the process tree, similar to
walk_tg_tree_from(). With some flags in mm_struct, it might be
possible to distinguish whether an mm_struct was copied from the
parent or created by a new exec.
>
> >
> > >
> > > Each cow context tracks its forked children also, so an rmap search will
> > > traverse A, B, C.
> >
> > I still don’t understand how we can get a folio’s VMA from the folio itself.
> > For anonymous VMAs, vma->vm_pgoff is always zero, right?
>
> No, not at all.
>
> vma->vm_pgoff is equal to vma->vm_start >> PAGE_SHIFT when first faulted in for
> anon.
>
> That reduces the problem to tracking remaps, which I do.
>
> >
> > Are you changing vm_pgoff to a value equal to vm_start >> PAGE_SHIFT?
>
> No, that's how anon works already.
Sorry for my mistake. I was somehow reading incorrect information from
/proc/<pid>/maps, where anonymous VMAs always appeared as zero.
A simple patch like the one below proves that you are absolutely right:
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 33e5094a7842..0cecff1c6307 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -475,9 +475,9 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
dev = inode->i_sb->s_dev;
ino = inode->i_ino;
- pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
+ //pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
}
-
+ pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
start = vma->vm_start;
end = vma->vm_end;
show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino);
>
> >
> > In case A forks B, and B unmaps a VMA then maps a new
> > VMA at the same address as before, what happens? Will the
> > traversal find the new VMA, which doesn’t actually map the folio?
>
> Well you're missing stuff there, the folio would have to be non-anon exclusive
> (which is rare). Yes it'd find the new VMA, then traverse, and find the folio
> does not match, and traverse children.
>
> rmap walks _always_ allow for you walking VMAs that a folio does not belong
> to.
I understand that we can check whether the folio belongs to the new
VMA, but I’m curious whether this will occur more frequently in practice
after the change. In the rmap case, I assume the original A’s folio
anon_vma would be detached from process B once B unmaps and then maps
a new VMA, so we wouldn’t search B anymore—is that correct?
>
> For instance, with anon_vma, if you CoW a bunch of folios to child process VMAs,
> the non-CoW'd folio will _still_ traverse all of that uselessly.
>
> In any case, this isn't a common case.
>
> However note that if a folio _becomes_ anon exclusive, it switches its 'root'
> cow context to the one associated with the mm which it became exclusive to.
>
Agreed. I’m curious about the case of A’s folio, whose VMA has been
completely replaced in B after the unmap and map. In the old anon_vma
case, we wouldn’t search B anymore, but now we’ll need to check B's
vm_pgoff since it covers the folio’s address—is that correct?
> >
> > >
> > > >
> > > > Additionally, if B COWs and obtains a new folio before forking
> > > > C, how do we determine its VMAs in B and C?
> > >
> > > The new folio would point to B's cow context, and it'd traverse B and C to find
> > > relevant folios.
> > >
> > > Overall we pay a higher search price (though arguably, not too bad still) but
> > > get to do it _all_ under RCU.
> >
> > Yep. I see that list_for_each_entry_rcu(child, &parent->children, sibling)
> > can work safely under RCU.
> >
> > >
> > > In exchange, we avoid the locking issues and use ~30x less memory.
> > >
> > > (Of course I am yet to solve rmap lock stabilisation so got to try and do that
> > > first :)
> > >
> > > >
> > > > Also, what happens if C performs a remap on the inherited VMA
> > > > in the two cases described above?
> > >
> > > Remaps are tracked within cow_context's via an extended maple tree (currently
> > > maple tree -> dynamic arrays) that also handles multiple entries and overlaps.
> >
> > If we have multiple remaps for multiple VMAs within one mm_struct,
> > will we end up traversing all the dynamic arrays for any folio that
> > might be located in a VMA that has been remapped?
>
> Yup. But there aren't all that many, and it's all under RCU so :)
>
> That part of the search should be quick, parts of the search involving page
> tables, less so.
>
> Also I need to figure out how to maintain stabilisation without an rmap lock, an
> ongoing open problem in all this.
>
> In the end, as the original mail said, I may conclude _this_ approach is
> unworkable and come up with an alternative that's more conventional.
I’m genuinely interested in the new approach. If you have the code, I’d be
happy to read, test, and work on it.
>
> BUT. Doing it this way saves 30x the amount of kernel allocated memory. I tried
> a heavy load case and it was very substantial. That's not to be sniffed at.
>
> In any case, all of this is going to be _very_ driven by metrics. How slow is
> it, how much overhead does it actually produce, is it workable, are the
> trade-offs right, etc.
>
> It's an exploration rather than a fait accompli.
Right now, I’m still at the stage of trying to understand the details of
your new approach and would like to learn more—so I might have quite a
few naive questions :-)
Thanks
Barry
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
2026-03-30 21:23 [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] Lorenzo Stoakes (Oracle)
2026-03-31 23:30 ` Barry Song
@ 2026-05-02 6:53 ` Lorenzo Stoakes
2026-05-03 18:26 ` Rik van Riel
1 sibling, 1 reply; 10+ messages in thread
From: Lorenzo Stoakes @ 2026-05-02 6:53 UTC (permalink / raw)
To: lsf-pc
Cc: linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
Suren Baghdasaryan, Pedro Falcato, Ryan Roberts, Harry Yoo,
Rik van Riel, Jann Horn, Chris Li, Barry Song
As is time-honoured LSF tradition, I am sharing code for my proposal.
I worked a very long day yesterday and got the _very_ rough PoC code into
some kind of vaguely shareable state.
https://git.kernel.org/pub/scm/linux/kernel/git/ljs/linux.git/log/?h=project/cow-context
CAVEATS:
* The code is not great, it's 'experimental, wave your arms, hope for the
best' stuff used for experimentation.
* I know the dynamic array implementation is probably entirely broken from
a concurrency point of view, inefficient (an n gets squared, *gasp*!),
etc. etc. - it is _not_ what I am proposing to actually do in any even
RFC of this code, it's just for PoC purposes.
* By default it runs CoW context alongside anon_vma, and will pr_err() if
there are mismatches between the two.
* However you can enable 'pure' CoW context mode via
CONFIG_COW_CONTEXT_ANON_RMAP.
* This is, as the talk will cover, currently broken for migration, not
because of bugs etc. but because I've not decided on the synchronisation
method yet (_everything_ is RCU in this mode).
The kernel boots in either mode :)
Obviously this is going to go through a lot more changes before any RFC,
but wanted to get this code out there in the 'discussion topic at LSF, have
code for it' tradition.
See all those who are attending in Zagreb! :)
Cheers, Lorenzo
On Mon, Mar 30, 2026 at 10:23:57PM +0100, Lorenzo Stoakes (Oracle) wrote:
> [sorry subject line was typo'd, resending with correct subject line for
> visibility. Original at
> https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
>
> Currently we track the reverse mapping between folios and VMAs at a VMA level,
> utilising a complicated and confusing combination of anon_vma objects and
> anon_vma_chain's linking them, which must be updated when VMAs are split,
> merged, remapped or forked.
>
> It's further complicated by various optimisations intended to avoid scalability
> issues in locking and memory allocation.
>
> I have done recent work to improve the situation [0] which has also lead to a
> reported improvement in lock scalability [1], but fundamentally the situation
> remains the same.
>
> The logic is actually, when you think hard enough about it, is a fairly
> reasonable means of implementing the reverse mapping at a VMA level.
>
> It is, however, a very broken abstraction as it stands. In order to work with
> the logic, you have to essentially keep a broad understanding of the entire
> implementation in your head at one time - that is, not much is really
> abstracted.
>
> This results in confusion, mistakes, and bit rot. It's also very time-consuming
> to work with - personally I've gone to the lengths of writing a private set of
> slides for myself on the topic as a reminder each time I come back to it.
>
> There are also issues with lock scalability - the use of interval trees to
> maintain a connection between an anon_vma and AVCs connected to VMAs requires
> that a lock must be held across the entire 'CoW hierarchy' of parent and child
> VMAs whenever performing an rmap walk or performing a merge, split, remap or
> fork.
>
> This is because we tear down all interval tree mappings and reestablish them
> each time we might see changes in VMA geometry. This is an issue Barry Song
> identified as problematic in a real world use case [2].
>
> So what do we do to improve the situation?
>
> Recently I have been working on an experimental new approach to the anonymous
> reverse mapping, in which we instead track anonymous remaps, and then use the
> VMA's virtual page offset to locate VMAs from the folio.
>
> I have got the implementation working to the point where it tracks the exact
> same VMAs as the anon_vma implementation, and it seems a lot of it can be done
> under RCU.
>
> It avoids the need to maintain expensive mappings at a VMA level, though it
> incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO
> (they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is
> pretty sub-optimal).
>
> I am investigating whether I can change how MAP_PRIVATE file-backed mappings
> work to avoid this issue, and will be developing tests to see how lock
> scalability, throughput and memory usage compare to the anon_vma approach under
> different workloads.
>
> This experiment may or may not work out, either way it will be interesting to
> discuss it.
>
> By the time LSF/MM comes around I may even have already decided on a different
> approach but that's what makes things interesting :)
>
> [0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/
> [1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/
> [2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/
>
> Cheers, Lorenzo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
2026-05-02 6:53 ` Lorenzo Stoakes
@ 2026-05-03 18:26 ` Rik van Riel
2026-05-04 8:01 ` Lorenzo Stoakes
0 siblings, 1 reply; 10+ messages in thread
From: Rik van Riel @ 2026-05-03 18:26 UTC (permalink / raw)
To: Lorenzo Stoakes, lsf-pc
Cc: linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
Suren Baghdasaryan, Pedro Falcato, Ryan Roberts, Harry Yoo,
Jann Horn, Chris Li, Barry Song
On Sat, 2026-05-02 at 07:53 +0100, Lorenzo Stoakes wrote:
> As is time-honoured LSF tradition, I am sharing code for my proposal.
>
> I worked a very long day yesterday and got the _very_ rough PoC code
> into
> some kind of vaguely shareable state.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ljs/linux.git/log/?h=project/cow-context
>
> CAVEATS:
>
> * The code is not great, it's 'experimental, wave your arms, hope for
> the
> best' stuff used for experimentation.
First, some refcounting that confuses me.
The changelog, and the code in dup_cow_context
shows that only the parent's cow context gets
an increased refcount.
However, the code in __put_cow_context seems
to unconditionally decrement refcounts all up
the hierarchy, instead of bailing out once it
encounters a parent that still has a non-zero
refcount.
How is that supposed to work?
Now, having the remaps array cloned at fork
time does make the refcounting on that side
a lot simpler. I like that.
However, it does raise another question.
Say we have process A, with child process B.
Process A has memory mapped at address X.
Process B munmaps memory at address X, and
then maps new memory at address X.
If I haven't missed something important, the
remap table does not need to get used, because
the offset and the virtual address match.
How does the COW walk handle that situation?
Overall, I like that you are trying to tackle
the problems associated with anon_vma, but
have to wonder if this implementation will
be able to avoid some of the complexity
inherent in the problem space.
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
2026-05-03 18:26 ` Rik van Riel
@ 2026-05-04 8:01 ` Lorenzo Stoakes
0 siblings, 0 replies; 10+ messages in thread
From: Lorenzo Stoakes @ 2026-05-04 8:01 UTC (permalink / raw)
To: Rik van Riel
Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett,
Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts,
Harry Yoo, Jann Horn, Chris Li, Barry Song
On Sun, May 03, 2026 at 02:26:36PM -0400, Rik van Riel wrote:
> On Sat, 2026-05-02 at 07:53 +0100, Lorenzo Stoakes wrote:
> > As is time-honoured LSF tradition, I am sharing code for my proposal.
> >
> > I worked a very long day yesterday and got the _very_ rough PoC code
> > into
> > some kind of vaguely shareable state.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/ljs/linux.git/log/?h=project/cow-context
> >
> > CAVEATS:
> >
> > * The code is not great, it's 'experimental, wave your arms, hope for
> > the
> > best' stuff used for experimentation.
>
> First, some refcounting that confuses me.
>
> The changelog, and the code in dup_cow_context
> shows that only the parent's cow context gets
> an increased refcount.
>
> However, the code in __put_cow_context seems
> to unconditionally decrement refcounts all up
> the hierarchy, instead of bailing out once it
> encounters a parent that still has a non-zero
> refcount.
>
> How is that supposed to work?
Ah no it does bail out :)
This code is very much PoC so not perhaps ideally clear :)
So __put_cow_context() calls delete_child_from_parent():
void __put_cow_context(struct cow_context *context)
{
...
for (curr = context; curr; curr = parent) {
...
parent = delete_child_from_parent(curr);
...
}
}
static struct cow_context *delete_child_from_parent(struct cow_context *context)
{
...
struct cow_context *parent = context->parent;
if (!parent)
return NULL;
...
if (!refcount_dec_and_test(&parent->refcnt))
return NULL;
And only if the refcount drops to 0 do we propagate (because the parent
being dropped drops a ref from its parent).
return parent;
}
>
> Now, having the remaps array cloned at fork
> time does make the refcounting on that side
> a lot simpler. I like that.
Thanks :)
>
> However, it does raise another question.
>
> Say we have process A, with child process B.
>
> Process A has memory mapped at address X.
>
> Process B munmaps memory at address X, and
> then maps new memory at address X.
>
> If I haven't missed something important, the
> remap table does not need to get used, because
> the offset and the virtual address match.
>
> How does the COW walk handle that situation?
So in the example given the folio would become AnonExclusive (or rather
!folio_maybe_mapped_shared(folio)) so would only walk process A.
If you unmapped in the parent, the folio would become AnonExclusive and
then get moved to processs B's mm's cow context.
So it'd work perfectly fine and be efficient in that case.
But in an example where say process C also forks so the folio remains
shared, we would end up doing a useless walk into process B, then find
either that there isn't a folio there or that the folio was unrelated.
In my slides (I will put them somewhere after LSF) I argue that these kinds
of situations are likely to be the minority, because most memory is
AnonExclusive and that which isn't largely remains untouched (i.e. all the
walks would be valid).
There are cases also where anon_vma can do useless walks (anon_vma is at
the mapping granularity, whereas a folio might or might not still be mapped
at lower levels, assuming not moved by folio_move_anon_rmap()).
>
> Overall, I like that you are trying to tackle
> the problems associated with anon_vma, but
> have to wonder if this implementation will
> be able to avoid some of the complexity
> inherent in the problem space.
Thanks :) And yeah I think unavoidably there were be difficult corner
cases. I also don't suggest that this approach is necessarily going to be
the one that ultimately works, there's a HUGE TODO left on stabilisataion
(esp. in the migration case), and I plan to do a lot of testing around
latency and edge cases to really exercise it.
However, no matter the outcome, this should give us insights into the anon
rmap (and the testing work will provide a good testbed too) - so either
way, I am determined to improve the anon rmap even if I need to look to
another approach.
>
> --
> All Rights Reversed.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
2026-04-02 21:49 ` Barry Song
@ 2026-05-04 8:10 ` Lorenzo Stoakes
0 siblings, 0 replies; 10+ messages in thread
From: Lorenzo Stoakes @ 2026-05-04 8:10 UTC (permalink / raw)
To: Barry Song
Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett,
Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts,
Harry Yoo, Rik van Riel, Jann Horn, Chris Li
Sorry my email is a mess lately, finally catching up after a month or so...
On Fri, Apr 03, 2026 at 05:49:10AM +0800, Barry Song wrote:
> >
> > Umm, memory is not preserved across an exec() :) so it works fine with that.
> >
> > vfork() is CLONE_VM so the mm is shared and everything works fine.
>
> My question is whether we can reuse the process tree, similar to
> walk_tg_tree_from(). With some flags in mm_struct, it might be
That's an interesting bit of code thanks for pointing me at that :)
> possible to distinguish whether an mm_struct was copied from the
> parent or created by a new exec.
In the new exec case there's no copying right? You're always overwriting the mm?
> > Well you're missing stuff there, the folio would have to be non-anon exclusive
> > (which is rare). Yes it'd find the new VMA, then traverse, and find the folio
> > does not match, and traverse children.
> >
> > rmap walks _always_ allow for you walking VMAs that a folio does not belong
> > to.
>
> I understand that we can check whether the folio belongs to the new
> VMA, but I’m curious whether this will occur more frequently in practice
> after the change. In the rmap case, I assume the original A’s folio
> anon_vma would be detached from process B once B unmaps and then maps
> a new VMA, so we wouldn’t search B anymore—is that correct?
In both cases folio_move_anon_rmap() changes folio->mapping. In the anon_vma
case, it moves it to the 'leaf' anon_vma, in the Cow context case it moves it to
the leaf cow context.
For CoW context we stop looking past the first CoW context if
!folio_maybe_mapped_shared().
So the usual situation will incur just the same amount of walking (but some edge
cases might be slower yes)
>
> >
> > For instance, with anon_vma, if you CoW a bunch of folios to child process VMAs,
> > the non-CoW'd folio will _still_ traverse all of that uselessly.
> >
> > In any case, this isn't a common case.
> >
> > However note that if a folio _becomes_ anon exclusive, it switches its 'root'
> > cow context to the one associated with the mm which it became exclusive to.
> >
>
> Agreed. I’m curious about the case of A’s folio, whose VMA has been
> completely replaced in B after the unmap and map. In the old anon_vma
> case, we wouldn’t search B anymore, but now we’ll need to check B's
> vm_pgoff since it covers the folio’s address—is that correct?
It's anon exclusive so we wouldn't bother looking past the first CoW context
level.
If it was shared we might do a useless work, yes. But again I think a rare case.
> > > If we have multiple remaps for multiple VMAs within one mm_struct,
> > > will we end up traversing all the dynamic arrays for any folio that
> > > might be located in a VMA that has been remapped?
> >
> > Yup. But there aren't all that many, and it's all under RCU so :)
> >
> > That part of the search should be quick, parts of the search involving page
> > tables, less so.
> >
> > Also I need to figure out how to maintain stabilisation without an rmap lock, an
> > ongoing open problem in all this.
> >
> > In the end, as the original mail said, I may conclude _this_ approach is
> > unworkable and come up with an alternative that's more conventional.
>
> I’m genuinely interested in the new approach. If you have the code, I’d be
> happy to read, test, and work on it.
I've posted on the thread, but it's very much a proof of concept and
stabilisation is currently broken so it's not in a testable state YET. But you
can see the rough shape of it now:
https://git.kernel.org/pub/scm/linux/kernel/git/ljs/linux.git?h=project%2Fcow-context
>
> >
> > BUT. Doing it this way saves 30x the amount of kernel allocated memory. I tried
> > a heavy load case and it was very substantial. That's not to be sniffed at.
> >
> > In any case, all of this is going to be _very_ driven by metrics. How slow is
> > it, how much overhead does it actually produce, is it workable, are the
> > trade-offs right, etc.
> >
> > It's an exploration rather than a fait accompli.
>
> Right now, I’m still at the stage of trying to understand the details of
> your new approach and would like to learn more—so I might have quite a
> few naive questions :-)
No problem, you will never ask anything more naive than what I might ask and I
may very well have made some very naive mistakes so :) health to discuss it I
think!
>
> Thanks
> Barry
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-05-04 8:10 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-30 21:23 [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] Lorenzo Stoakes (Oracle)
2026-03-31 23:30 ` Barry Song
2026-04-01 8:43 ` Lorenzo Stoakes (Oracle)
2026-04-01 21:03 ` Barry Song
2026-04-02 12:20 ` Lorenzo Stoakes (Oracle)
2026-04-02 21:49 ` Barry Song
2026-05-04 8:10 ` Lorenzo Stoakes
2026-05-02 6:53 ` Lorenzo Stoakes
2026-05-03 18:26 ` Rik van Riel
2026-05-04 8:01 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox