From: Lorenzo Stoakes <ljs@kernel.org>
To: wangtao <tao.wangtao@honor.com>
Cc: Harry Yoo <harry@kernel.org>,
"catalin.marinas@arm.com" <catalin.marinas@arm.com>,
"will@kernel.org" <will@kernel.org>,
"tglx@kernel.org" <tglx@kernel.org>,
"mingo@redhat.com" <mingo@redhat.com>,
"bp@alien8.de" <bp@alien8.de>,
"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
"x86@kernel.org" <x86@kernel.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"david@kernel.org" <david@kernel.org>,
"willy@infradead.org" <willy@infradead.org>,
"sj@kernel.org" <sj@kernel.org>,
"kees@kernel.org" <kees@kernel.org>,
"luizcap@redhat.com" <luizcap@redhat.com>,
"zhangjiao2@cmss.chinamobile.com"
<zhangjiao2@cmss.chinamobile.com>,
"kas@kernel.org" <kas@kernel.org>,
"hpa@zytor.com" <hpa@zytor.com>,
"liam@infradead.org" <liam@infradead.org>,
"vbabka@kernel.org" <vbabka@kernel.org>,
"rppt@kernel.org" <rppt@kernel.org>,
"surenb@google.com" <surenb@google.com>,
"mhocko@suse.com" <mhocko@suse.com>,
"jack@suse.cz" <jack@suse.cz>,
"riel@surriel.com" <riel@surriel.com>,
"jannh@google.com" <jannh@google.com>,
"jgg@ziepe.ca" <jgg@ziepe.ca>,
"jhubbard@nvidia.com" <jhubbard@nvidia.com>,
"peterx@redhat.com" <peterx@redhat.com>,
"ziy@nvidia.com" <ziy@nvidia.com>,
"baolin.wang@linux.alibaba.com" <baolin.wang@linux.alibaba.com>,
"npache@redhat.com" <npache@redhat.com>,
"ryan.roberts@arm.com" <ryan.roberts@arm.com>,
"dev.jain@arm.com" <dev.jain@arm.com>,
"baohua@kernel.org" <baohua@kernel.org>,
"lance.yang@linux.dev" <lance.yang@linux.dev>,
"xu.xin16@zte.com.cn" <xu.xin16@zte.com.cn>,
"chengming.zhou@linux.dev" <chengming.zhou@linux.dev>,
"nao.horiguchi@gmail.com" <nao.horiguchi@gmail.com>,
"matthew.brost@intel.com" <matthew.brost@intel.com>,
"joshua.hahnjy@gmail.com" <joshua.hahnjy@gmail.com>,
"rakie.kim@sk.com" <rakie.kim@sk.com>,
"byungchul@sk.com" <byungchul@sk.com>,
"gourry@gourry.net" <gourry@gourry.net>,
"ying.huang@linux.alibaba.com" <ying.huang@linux.alibaba.com>,
"apopple@nvidia.com" <apopple@nvidia.com>,
"pfalcato@suse.de" <pfalcato@suse.de>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"damon@lists.linux.dev" <damon@lists.linux.dev>,
"shakeel.butt@linux.dev" <shakeel.butt@linux.dev>,
"ryncsn@gmail.com" <ryncsn@gmail.com>,
"21cnbao@gmail.com" <21cnbao@gmail.com>,
"jparsana@google.com" <jparsana@google.com>,
"dvander@google.com" <dvander@google.com>,
zhangji <zhangji1@honor.com>,
wangzicheng <wangzicheng@honor.com>
Subject: Re: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation
Date: Wed, 3 Jun 2026 08:54:08 +0100 [thread overview]
Message-ID: <ah_VMf0ZJTRsrArV@lucifer> (raw)
In-Reply-To: <7319ad82f9ee4fc4b18b50b1842c9f99@honor.com>
On Wed, Jun 03, 2026 at 02:59:04AM +0000, wangtao wrote:
> > On 5/27/26 8:01 PM, tao wrote:
> > > Design overview
> > > ---------------
> > >
> > > ANON_VMA_LAZY defers anon_vma allocation until it is actually needed
> > > (for example during fork). VMAs that never participate in sharing can
> > > avoid creating anon_vma structures entirely.
> > >
> > > Before an anon_vma exists, rmap operations rely directly on VMA
> > > information, so no anon_vma locking is required. An anon_vma is
> > > created and linked only when sharing semantics are required.
> >
> > It is unfortunate that the design overview doesn't cover correctness aspect
> > at all. VMAs are subject to change (even before being shared with other
> > processes), and rmap needs something that doesn't go away across VMA
> > merging, split, etc.
> >
> > I'm not sure how the idea is supposed work correctly.
> >
> > --
> > Cheers,
> > Harry / Hyeonggon
>
Against my better judgment I'll address the stuff here...
> VMA operations can be roughly divided into three categories. The handling
> of ANON_VMA_LAZY is briefly described below.
I don't agree, there are plenty more VMA operations. But with respect to anon
rmap there are:
- fork
- merge/split
- remap
Your approach seems to completely ignore VMA split and the need to maintain
an interval tree to _multiple_ VMAs from a single anon_vma.
You may also actually split a VMA against a single large folio (waiting on
the deferred shrinker) and have a SINGLE _leaf_ anonymous folio that is
mapped in two places.
The lazy approach doesn't seem to address this properly. And fatally it
ties an actual VMA afaict to the folio and has to implement a VMA reference
count mechanism which interferes with the ordinarily VMA lifecycle to do
it.
The fact of us taking advantage of most stuff being AnonExclusive,
i.e. 'leaves' is something that my approach is exactly taking into account.
Of course also extending anon_vma is a real non-starter.
Also the below + the series ignores MAP_PRIVATE file-backed mappings which
is a pretty fatal flaw.
It also, as Harry says, has zero description of correctness in a way we'd
want and no tests.
>
> 1. fork
>
> fork duplicates the parent's mm/mmap. (exec creates a new mm/mmap and is
> not involved here.) This can be viewed as copying the VMAs with identical
> virtual addresses into a new address space.
>
> If the parent VMA (pvma) is ANON_VMA_LAZY, it is first upgraded to a
> regular anon_vma. The corresponding folio->mapping is then fixed in
> try_dup_anon_rmap().
And so we make fork, a very sensitive path in the kernel more expensive.
I also question the locking situation with the conversion mentioned,
updating folios in this manner is extremely difficult.
>
> 2. mmap / brk / mprotect / munmap
>
> These operations create, modify, or remove VMAs in the current mm. They
> may split existing VMAs, merge adjacent VMAs, or remove a VMA from mm_mt.
mmap and brk are not at all relevant to anon_vma, as no anon_vma is
assigned upon mapping. It's on fault.
mprotect/mlock/munmap/etc. might split, but I don't see how the lazy
approach in any way addresses any of that.
>
> When a new VMA is created, vm_start, vm_end and vm_pgoff are initialized
> and the VMA is inserted into mm_mt. Although these fields may later be
> modified, the following value remains invariant:
>
> (vm_start - vm_pgoff * PAGE_SIZE)
Err no it doesn't at all?
If I fault in a VMA at vm_start, vm_pgoff = vm_start >> PAGE_SHIFT.
Then if I remap it, vm_start changes, vm_pgoff stays the same, so:
vm_start - vm_pgoff * PAGE_SIZE
Changes right? And then that becomes essentially the offset from where it
was faulted in.
>
> We refer to this value as:
>
> vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE
This is mysteriously close to being the offset I mention in my CoW context
work...
I'm not sure what 'mapping base' means here.
>
> This value also remains unchanged when the VMA is removed from mm_mt.
Why does it matter what this value is on unmap?
>
> If a VMA is split and produces new_vma, the following holds:
>
> vma_mapping_base(new_vma) == vma_mapping_base(vma)
This is a roundabout way of saying we offset the vma->vm_pgoff after split.
>
> If two adjacent VMAs vma_a and vma_b are merged into vma_x, then:
>
> vma_mapping_base(vma_a) == vma_mapping_base(vma_b) ==
> vma_mapping_base(vma_x)
This is just a roundabout way of saying the pgoff has to be aligned.
>
> Assume the VMA where the first page fault occurs is called root_vma, and
> ensure that any VMA produced by split or merge holds a reference to
> root_vma.
But this VMA can be unmapped later? Or remapped?
Holding on to a VMA and treating it as some kind of canonical reference
with a reference count completely changes what VMAs are, impacts the VMA
lifecycle, and produces unwanted memory overhead in itself.
It also raises concerns and issues around lock order which is very
sensitive.
>
> During rmap we can compute the folio address using root_vma:
>
> vma_address(vma, pgoff, 1) =
What's the parameters here? What's 1?
> vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT)
> = vma_mapping_base(vma) + pgoff * PAGE_SIZE
> = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE
>
> We can then use folio_addr to locate the VMA covering this folio.
I'm really confused by this, you're kind of mixing and match parameters
here.
What I think you're saying is that, if a folio hasn't been remapped, you
can figure out its address based on page offset.
That's completely broken for MAP_PRIVATE file-backed mappings which also
use anon_vma and also have to keep on working.
It seems that for the lazy approach what you are doing is essentially
caching the 'root' VMA in the folio. But this doesn't account for large
folios and split VMAs.
Even if you disabled it for those cases (which adds a ton of complexity in
itself), you then have issues with locking - the anon_vma lock has to take
a lock (that cannot be a VMA-level lock - results in lock inversion) even
on these leaf entries, or you break locking.
And we can't reasonably start pinning VMAs and using them as a sort of
proto cached thing on top of the existing anon_vma logic.
You also then need to, on remap, undo all this, which requires updating
folio->mapping on remap, something I tried doing previously myself, but
that's fraught with issues around lock inversion itself.
>
> 3. mremap / uffd_move
userfaultfd moving is not relevant as it actually updates the folio
correctly.
>
> If only the size changes and the start address remains the same, there
> is no impact.
>
> If the start address changes, the page is moved from (vma, addr) to
> (new_vma, new_addr). In this case:
>
> vma_mapping_base(new_vma) =
> vma_mapping_base(vma) + new_addr - old_addr
You say above that the mapping base never changes? But here it changes?
>
> We first upgrade the VMA, and then fix folio->mapping in move_ptes().
What's 'upgrading' a VMA? You mean converting the lazy anon_vma to a
'normal' one.
As above, this is fraught with lock inversion issues.
>
> If performance becomes a concern, ANON_VMA_LAZY can be enabled only for
> relatively small VMAs.
I think you've got serious correctness, lock management and complexity
issues and it's all a non-starter as the costs deeply exceed the benefits.
This is one of the fundamental, frustrating aspects of the anon rmap - you
keep thinking that 'surely' you can do sensible thing X, but it turns out
you can't for various annoying reasons.
It's one of the reasons it's really fraught for somebody coming to make
changes, and one of the reasons why I am very keen on fundamentally
changing it.
And also on a not-wasting-time basis - I was already working in parallel on
a rework here, so I think the civil thing is to at least wait for my work
before issuing alternative solutions.
Thanks, Lorenzo
>
>
> vma操作可以分为3类,下面简单说明下ANON_VMA_LAZY的处理:
>
> 1. fork 从父进程复制mm/mmap;(exev 创建一个新的mm/mmap,不涉及)。
> 这可以理解为在一个新的地址空间复制一份相同地址的VMAs.
> 如果pvma是ANON_VMA_LAZY,先升级为regular anon_vma,并在try_dup_anon_rmap中升级修正folio->mapping.
>
> 2. mmap/brk/mprotect/munmap
> 创建、修改或删除当前mm的VMA,可能合并或拆分出新的VMAs或者将VMA从mm_mt删除。
> 创建一个新的vma并设置vm_start、vm_end、vm_pgoff插入mm_mt后,虽然后续可能修改这个VMA的vm_start、vm_end、vm_pgoff,但是保持
> (vm_start - vm_pgoff * PAGE_SIZE)不变,我们可以把这个称之为vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE。
> 这个vma从mm_mt删除时,vma_mapping_base(vma)也保持不变。
> 从这个vma拆分出的new_vma,有vma_mapping_base(new_vma) == vma_mapping_base(vma)
> 合并相邻vma_a、vma_b为vma_x时,也有vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == vma_mapping_base(vma_x)
> 如果我们第一次发生缺页的VMA称为root_vma,并在split或merge时都确保使用的vma持有root_vma的引用。
> 在rmap时我们可以用root_vma计算folio地址:
> vma_address(vma, pgoff, 1) = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT)
> = vma_mapping_base(vma) + pgoff * PAGE_SIZE
> = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE
> 然后用folio_addr查找folio所在的vma。
>
> 3. mremap/uffd_move
> 如果只是修改大小,起始地址不变,不影响。
> 如果改变起始地址,将page从vma/addr移动到new_vma/new_addr
> 这时vma_mapping_base(new_vma) = vma_mapping_base(vma) + new_addr - old_addr
> 我们先升级vma,在move_ptes中再修正folio->mapping。
> 如果担心性能影响,可以只在较小的vma上使能ANON_VMA_LAZY。
>
next prev parent reply other threads:[~2026-06-03 7:54 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-27 11:01 [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation tao
2026-05-27 11:01 ` [PATCH 01/15] mm/rmap: introduce anon_rmap APIs for anonymous folios tao
2026-05-27 11:44 ` Lorenzo Stoakes
2026-05-28 7:47 ` wangtao
2026-05-27 11:01 ` [PATCH 02/15] mm: convert anon_vma rmap APIs to anon_rmap tao
2026-05-27 11:49 ` Lorenzo Stoakes
2026-05-28 8:55 ` wangtao
2026-05-27 11:01 ` [PATCH 03/15] mm: introduce anon_vma_tree_t for multiple anon_vma topologies tao
2026-05-27 11:56 ` Lorenzo Stoakes
2026-05-28 9:00 ` wangtao
2026-05-27 11:01 ` [PATCH 04/15] mm: switch to anon_vma_tree_t APIs in preparation for ANON_VMA_LAZY tao
2026-05-27 11:01 ` [PATCH 05/15] mm: add CONFIG_ANON_VMA_LAZY and folio helpers tao
2026-05-27 11:01 ` [PATCH 06/15] mm: add CONFIG_VMA_REF and VMA helpers tao
2026-05-27 11:01 ` [PATCH 07/15] mm: replace direct FOLIO_MAPPING_ANON usage with helpers tao
2026-05-27 11:01 ` [PATCH 08/15] mm: prepare rmap infrastructure for ANON_VMA_LAZY tao
2026-05-27 11:01 ` [PATCH 09/15] mm: implement ANON_VMA_LAZY rmap semantics tao
2026-05-27 11:01 ` [PATCH 10/15] mm: defer anon_vma creation with ANON_VMA_LAZY tao
2026-05-27 11:01 ` [PATCH 11/15] mm: handle ANON_VMA_LAZY in huge page operations tao
2026-05-27 11:01 ` [PATCH 12/15] mm: handle ANON_VMA_LAZY during migration tao
2026-05-27 11:01 ` [PATCH 13/15] mm: support setup and upgrade of ANON_VMA_LAZY folios tao
2026-05-27 11:01 ` [PATCH 14/15] mm: support merging of ANON_VMA_LAZY VMAs tao
2026-05-27 11:01 ` [PATCH 15/15] mm: enable CONFIG_ANON_VMA_LAZY on arm64 and x86_64 tao
2026-05-27 11:23 ` [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation Pedro Falcato
2026-05-28 6:45 ` wangtao
2026-05-28 7:14 ` Lorenzo Stoakes
2026-05-27 11:30 ` Lorenzo Stoakes
2026-05-28 7:11 ` wangtao
2026-05-28 7:22 ` Lorenzo Stoakes
2026-05-27 14:33 ` Lorenzo Stoakes
2026-05-28 7:57 ` wangtao
2026-05-28 8:14 ` Lorenzo Stoakes
[not found] ` <CAGsJ_4zy=-m5wjm0BC-vQXMHGRkHymC-5S_L9Oi708v339vvPw@mail.gmail.com>
2026-05-29 2:20 ` wangzicheng
2026-05-29 6:56 ` Lorenzo Stoakes
2026-05-29 6:45 ` Lorenzo Stoakes
2026-05-29 9:41 ` wangtao
2026-05-29 12:03 ` Lorenzo Stoakes
2026-06-01 1:46 ` wangtao
2026-06-02 2:15 ` Barry Song
2026-06-02 2:46 ` Lance Yang
2026-06-02 15:37 ` Lorenzo Stoakes
2026-06-02 19:44 ` Pedro Falcato
2026-06-02 23:03 ` Barry Song
2026-06-03 7:07 ` Lorenzo Stoakes
2026-06-02 19:56 ` Harry Yoo
2026-06-02 22:27 ` Barry Song
2026-06-02 20:47 ` Lorenzo Stoakes
2026-05-29 15:07 ` Jonathan Corbet
2026-05-29 15:40 ` Lorenzo Stoakes
2026-05-30 11:28 ` Barry Song
2026-06-02 16:07 ` Harry Yoo
2026-06-03 2:59 ` wangtao
2026-06-03 3:12 ` wangtao
2026-06-03 7:54 ` Lorenzo Stoakes [this message]
2026-06-03 11:05 ` wangtao
2026-06-03 11:53 ` Lorenzo Stoakes
2026-06-04 3:50 ` wangtao
2026-06-03 20:25 ` David Hildenbrand (Arm)
2026-06-03 22:14 ` Barry Song
2026-06-04 4:03 ` wangtao
2026-06-04 4:20 ` Barry Song
2026-06-04 7:35 ` wangtao
2026-06-04 3:10 ` xu.xin16
2026-06-04 4:10 ` wangtao
2026-06-04 9:40 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ah_VMf0ZJTRsrArV@lucifer \
--to=ljs@kernel.org \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bp@alien8.de \
--cc=byungchul@sk.com \
--cc=catalin.marinas@arm.com \
--cc=chengming.zhou@linux.dev \
--cc=damon@lists.linux.dev \
--cc=dave.hansen@linux.intel.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=dvander@google.com \
--cc=gourry@gourry.net \
--cc=harry@kernel.org \
--cc=hpa@zytor.com \
--cc=jack@suse.cz \
--cc=jannh@google.com \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=joshua.hahnjy@gmail.com \
--cc=jparsana@google.com \
--cc=kas@kernel.org \
--cc=kees@kernel.org \
--cc=lance.yang@linux.dev \
--cc=liam@infradead.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luizcap@redhat.com \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=nao.horiguchi@gmail.com \
--cc=npache@redhat.com \
--cc=peterx@redhat.com \
--cc=pfalcato@suse.de \
--cc=rakie.kim@sk.com \
--cc=riel@surriel.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=ryncsn@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=sj@kernel.org \
--cc=surenb@google.com \
--cc=tao.wangtao@honor.com \
--cc=tglx@kernel.org \
--cc=vbabka@kernel.org \
--cc=wangzicheng@honor.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
--cc=xu.xin16@zte.com.cn \
--cc=ying.huang@linux.alibaba.com \
--cc=zhangji1@honor.com \
--cc=zhangjiao2@cmss.chinamobile.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox