From: wangtao <tao.wangtao@honor.com>
To: Lorenzo Stoakes <ljs@kernel.org>
Cc: Harry Yoo <harry@kernel.org>,
"catalin.marinas@arm.com" <catalin.marinas@arm.com>,
"will@kernel.org" <will@kernel.org>,
"tglx@kernel.org" <tglx@kernel.org>,
"mingo@redhat.com" <mingo@redhat.com>,
"bp@alien8.de" <bp@alien8.de>,
"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
"x86@kernel.org" <x86@kernel.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"david@kernel.org" <david@kernel.org>,
"willy@infradead.org" <willy@infradead.org>,
"sj@kernel.org" <sj@kernel.org>,
"kees@kernel.org" <kees@kernel.org>,
"luizcap@redhat.com" <luizcap@redhat.com>,
"zhangjiao2@cmss.chinamobile.com"
<zhangjiao2@cmss.chinamobile.com>,
"kas@kernel.org" <kas@kernel.org>,
"hpa@zytor.com" <hpa@zytor.com>,
"liam@infradead.org" <liam@infradead.org>,
"vbabka@kernel.org" <vbabka@kernel.org>,
"rppt@kernel.org" <rppt@kernel.org>,
"surenb@google.com" <surenb@google.com>,
"mhocko@suse.com" <mhocko@suse.com>,
"jack@suse.cz" <jack@suse.cz>,
"riel@surriel.com" <riel@surriel.com>,
"jannh@google.com" <jannh@google.com>,
"jgg@ziepe.ca" <jgg@ziepe.ca>,
"jhubbard@nvidia.com" <jhubbard@nvidia.com>,
"peterx@redhat.com" <peterx@redhat.com>,
"ziy@nvidia.com" <ziy@nvidia.com>,
"baolin.wang@linux.alibaba.com" <baolin.wang@linux.alibaba.com>,
"npache@redhat.com" <npache@redhat.com>,
"ryan.roberts@arm.com" <ryan.roberts@arm.com>,
"dev.jain@arm.com" <dev.jain@arm.com>,
"baohua@kernel.org" <baohua@kernel.org>,
"lance.yang@linux.dev" <lance.yang@linux.dev>,
"xu.xin16@zte.com.cn" <xu.xin16@zte.com.cn>,
"chengming.zhou@linux.dev" <chengming.zhou@linux.dev>,
"nao.horiguchi@gmail.com" <nao.horiguchi@gmail.com>,
"matthew.brost@intel.com" <matthew.brost@intel.com>,
"joshua.hahnjy@gmail.com" <joshua.hahnjy@gmail.com>,
"rakie.kim@sk.com" <rakie.kim@sk.com>,
"byungchul@sk.com" <byungchul@sk.com>,
"gourry@gourry.net" <gourry@gourry.net>,
"ying.huang@linux.alibaba.com" <ying.huang@linux.alibaba.com>,
"apopple@nvidia.com" <apopple@nvidia.com>,
"pfalcato@suse.de" <pfalcato@suse.de>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"damon@lists.linux.dev" <damon@lists.linux.dev>,
"shakeel.butt@linux.dev" <shakeel.butt@linux.dev>,
"ryncsn@gmail.com" <ryncsn@gmail.com>,
"21cnbao@gmail.com" <21cnbao@gmail.com>,
"jparsana@google.com" <jparsana@google.com>,
"dvander@google.com" <dvander@google.com>,
zhangji <zhangji1@honor.com>, wangzicheng <wangzicheng@honor.com>
Subject: RE: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation
Date: Thu, 4 Jun 2026 03:50:54 +0000 [thread overview]
Message-ID: <a073c529df7841d98dcec9ddc3dad8bc@honor.com> (raw)
In-Reply-To: <aiAMHhq9QCp-z3V9@lucifer>
>
> Thanks for your replies, but I really have to stop doing deeper analyses like
> these for time management purposes.
Of course I will respond to technical discussions.
>
> I did this more so to make the point from [0] as to why, in lower trust
> environments, this is just not feasible.
>
> We could loop around for hours and hours and hours here.
>
> In general as before, even if all worked perfectly (I'm very much not at all
> convinced), extending anon_vma and pinning VMAs is simply a no-go for
> architectural and complexity reasons.
>
> I also find the locking story dubious and the lack of tests or anything
> corroborating correctness is additionally fatal.
>
During rmap, anon_vma provides a superset of VMAs. We first confirm
with vma_address(), and then in each rmap_one we further check whether
the VMA needs to be processed through page_vma_mapped_walk() and
check_pte().
The lazy VMA used by ANON_VMA_LAZY provides only one VMA: if there is
no fork or mremap, then this single VMA is sufficient. To avoid taking
the folio_lock during fork and mremap, after anon_walk_anon, if
folio->mapping is upgraded to anon_vma, we retry once.
If your concern is about the lack of locking during rmap, you could
also refer to folio_wait_table and add a set of anon_vma_locks. That
was how I handled it during my initial debugging. Later, after
reviewing the code flow, I found that the lock might not be necessary,
so I removed it.
> And finally, I was already working on a replacement for anon_vma, and the
> generally done thing in these situations is for my work to take precedence.
>
> So I'm going to bail out on futher deeper analyses here as otherwise I simply
> can't work on anything else :)
>
> Thanks, Lorenzo
>
> [0]:https://lore.kernel.org/all/ah887A5VkXOcmq-g@lucifer/
>
>
> On Wed, Jun 03, 2026 at 11:05:28AM +0000, wangtao wrote:
> > > >
> > >
> > > Against my better judgment I'll address the stuff here...
> > >
> > > > VMA operations can be roughly divided into three categories. The
> > > > handling of ANON_VMA_LAZY is briefly described below.
> > >
> > > I don't agree, there are plenty more VMA operations. But with
> > > respect to anon rmap there are:
> > >
> > > - fork
> > > - merge/split
> > > - remap
> > >
> >
> > Yes, these are the three categories. I originally intended to explain
> > them by classifying based on system calls; I should have used mremap
> instead of move_vma.
>
> I don't think you mentioned move_vma()? Maybe I missed it.
>
> The categorisation is most usefully based on callers of anon_vma_clone().
>
> >
> > 是的,是这三类,我本想从系统调用去分类说明,应该将move_vma
> 换成mremap的。
> >
> > > Your approach seems to completely ignore VMA split and the need to
> > > maintain an interval tree to _multiple_ VMAs from a single anon_vma.
> > >
> >
> > The folio uses vma->root_vma to compute folio_address. A VMA split
> > from it, vma_a, also uses vma_a->root_vma = vma->root_vma to compute
> folio_address.
> > During rmap, once folio_address is obtained, the VMA can be found
> > through mm_mt. Without fork, there is no need to maintain the interval
> tree.
>
> Well you need to search for every possible split VMA in mm_mt now, so you
> have to go page-by-page searching for each page for the rmap walked range.
>
ANON_VMA_LAZY has only one VMA. When I first looked at
rmap_walk_ksm, I also thought it would need to search page by page,
which seemed unacceptable. Later I realized that it only needs to
check whether this VMA falls within the rmap walk range.
@@ -3173,20 +3171,20 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
- anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
+ anon_rmap_foreach_vma(vma, vmac, anon_rmap,
0, ULONG_MAX) {
> You're also potentially racing against a remap, as you say below you don't
> folio lock on remap so concurrent rmap walkers can be present, the VMA can
> already be copied.
>
> We already have VMA lifecycle state around detached VMAs, so a VMA
> could be in a detached state, assumed by the existing logic to be entirely
> unavailable for use, out of the maple tree altogether but kept around in a
> zombie state.
>
> We'd then have lifecycle issues and races and edge cases around process
> teardown otherwise we might leak memory.
>
> Also, presumably you set vma->anon_vma to some lazy sentinel value so
> that mremap doesn't change vma->vm_pgoff when unfaulted?
>
> You would need to update any path that manipulates vma->anon_vma also
> so it doesn't incorrectly dereference it.
>
Yes, most of the code in this patch series is intended to prevent
incorrect dereferencing of anon_vma. If we assume it will not be
misused, some of the code could be simplified or removed.
> >
> > folio使用vma->root_vma 计算folio_address;从vma拆分出的vma_a,
> 使用vma_a->root_vma =
> > folio使用vma->vma->root_vma计算folio_address。
> > rmap时得到folio_address就可以通过mm_mt查找到vma。
> > 不fork就不需要维护interval tree。
> >
> > > You may also actually split a VMA against a single large folio
> > > (waiting on the deferred shrinker) and have a SINGLE _leaf_
> > > anonymous folio that is mapped in two places.
> > >
> > > The lazy approach doesn't seem to address this properly. And fatally
> > > it ties an actual VMA afaict to the folio and has to implement a VMA
> > > reference count mechanism which interferes with the ordinarily VMA
> lifecycle to do it.
> > >
> > > The fact of us taking advantage of most stuff being AnonExclusive, i.e.
> > > 'leaves' is something that my approach is exactly taking into account.
> > >
> > > Of course also extending anon_vma is a real non-starter.
> > >
> > > Also the below + the series ignores MAP_PRIVATE file-backed mappings
> > > which is a pretty fatal flaw.
> > >
> > > It also, as Harry says, has zero description of correctness in a way
> > > we'd want and no tests.
> > >
> >
> > 可以正确处理拆分vma在一个大页。拆分的vma_a或vma_b上的
> sub_page使用如下方式计算地址。
> > 对于文件vma的cow 匿名页,也用同样方式计算page/folio地址。
> >
> > It can correctly handle the case where a VMA is split within a large
> > page. The address of a sub_page in the split VMA (vma_a or vma_b) is
> > computed using the following method.
> >
> > For COW anonymous pages originating from file VMAs, the page/folio
> > address is also computed using the same method.
> >
> > subpage_address = vma_address(vma_a, subpage_pgoff, 1) =
> > vma_a->vm_start + (subpage_pgoff - vma_a->vm_pgoff) * PAGE_SIZE =
> > vma_a->vm_start - vma_a->vm_pgoff * PAGE_SIZE + subpage_pgoff *
> > PAGE_SIZE = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE
> =
> > vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE
>
> OK but you want to walk entries in a _range_ in the interval tree.
>
> So you are then now looking up VMAs (in a racey way) using mm_mt (which
> is the whole basis of my work actually) which could change under you.
>
> I guess what you're doing is using the pinned 'root' VMA as the basis of
> everything, and the second a VMA is moved you (somehow) walk the page
> tables to update the folio->mapping.
>
> Again pinning the VMA like this and putting it in a folio is really not something
> we want to do.
>
> It adds a ton of complexity and also impacts VMA lifecycle which is already
> fairly fraught.
>
> It makes the VMA no longer just a VMA but rather also a 'memory' of where
> something was first faulted in as a hack more or less.
>
Maybe you're right. mm/mm_mt/vma/pagetable each have their own roles
in implementing VM. Perhaps considering them together could lead to
better ideas.
next prev parent reply other threads:[~2026-06-04 3:51 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-27 11:01 [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation tao
2026-05-27 11:01 ` [PATCH 01/15] mm/rmap: introduce anon_rmap APIs for anonymous folios tao
2026-05-27 11:44 ` Lorenzo Stoakes
2026-05-28 7:47 ` wangtao
2026-05-27 11:01 ` [PATCH 02/15] mm: convert anon_vma rmap APIs to anon_rmap tao
2026-05-27 11:49 ` Lorenzo Stoakes
2026-05-28 8:55 ` wangtao
2026-05-27 11:01 ` [PATCH 03/15] mm: introduce anon_vma_tree_t for multiple anon_vma topologies tao
2026-05-27 11:56 ` Lorenzo Stoakes
2026-05-28 9:00 ` wangtao
2026-05-27 11:01 ` [PATCH 04/15] mm: switch to anon_vma_tree_t APIs in preparation for ANON_VMA_LAZY tao
2026-05-27 11:01 ` [PATCH 05/15] mm: add CONFIG_ANON_VMA_LAZY and folio helpers tao
2026-05-27 11:01 ` [PATCH 06/15] mm: add CONFIG_VMA_REF and VMA helpers tao
2026-05-27 11:01 ` [PATCH 07/15] mm: replace direct FOLIO_MAPPING_ANON usage with helpers tao
2026-05-27 11:01 ` [PATCH 08/15] mm: prepare rmap infrastructure for ANON_VMA_LAZY tao
2026-05-27 11:01 ` [PATCH 09/15] mm: implement ANON_VMA_LAZY rmap semantics tao
2026-05-27 11:01 ` [PATCH 10/15] mm: defer anon_vma creation with ANON_VMA_LAZY tao
2026-05-27 11:01 ` [PATCH 11/15] mm: handle ANON_VMA_LAZY in huge page operations tao
2026-05-27 11:01 ` [PATCH 12/15] mm: handle ANON_VMA_LAZY during migration tao
2026-05-27 11:01 ` [PATCH 13/15] mm: support setup and upgrade of ANON_VMA_LAZY folios tao
2026-05-27 11:01 ` [PATCH 14/15] mm: support merging of ANON_VMA_LAZY VMAs tao
2026-05-27 11:01 ` [PATCH 15/15] mm: enable CONFIG_ANON_VMA_LAZY on arm64 and x86_64 tao
2026-05-27 11:23 ` [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation Pedro Falcato
2026-05-28 6:45 ` wangtao
2026-05-28 7:14 ` Lorenzo Stoakes
2026-05-27 11:30 ` Lorenzo Stoakes
2026-05-28 7:11 ` wangtao
2026-05-28 7:22 ` Lorenzo Stoakes
2026-05-27 14:33 ` Lorenzo Stoakes
2026-05-28 7:57 ` wangtao
2026-05-28 8:14 ` Lorenzo Stoakes
[not found] ` <CAGsJ_4zy=-m5wjm0BC-vQXMHGRkHymC-5S_L9Oi708v339vvPw@mail.gmail.com>
2026-05-29 2:20 ` wangzicheng
2026-05-29 6:56 ` Lorenzo Stoakes
2026-05-29 6:45 ` Lorenzo Stoakes
2026-05-29 9:41 ` wangtao
2026-05-29 12:03 ` Lorenzo Stoakes
2026-06-01 1:46 ` wangtao
2026-06-02 2:15 ` Barry Song
2026-06-02 2:46 ` Lance Yang
2026-06-02 15:37 ` Lorenzo Stoakes
2026-06-02 19:44 ` Pedro Falcato
2026-06-02 23:03 ` Barry Song
2026-06-03 7:07 ` Lorenzo Stoakes
2026-06-02 19:56 ` Harry Yoo
2026-06-02 22:27 ` Barry Song
2026-06-02 20:47 ` Lorenzo Stoakes
2026-05-29 15:07 ` Jonathan Corbet
2026-05-29 15:40 ` Lorenzo Stoakes
2026-05-30 11:28 ` Barry Song
2026-06-02 16:07 ` Harry Yoo
2026-06-03 2:59 ` wangtao
2026-06-03 3:12 ` wangtao
2026-06-03 7:54 ` Lorenzo Stoakes
2026-06-03 11:05 ` wangtao
2026-06-03 11:53 ` Lorenzo Stoakes
2026-06-04 3:50 ` wangtao [this message]
2026-06-03 20:25 ` David Hildenbrand (Arm)
2026-06-03 22:14 ` Barry Song
2026-06-04 4:03 ` wangtao
2026-06-04 4:20 ` Barry Song
2026-06-04 7:35 ` wangtao
2026-06-04 3:10 ` xu.xin16
2026-06-04 4:10 ` wangtao
2026-06-04 9:40 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a073c529df7841d98dcec9ddc3dad8bc@honor.com \
--to=tao.wangtao@honor.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bp@alien8.de \
--cc=byungchul@sk.com \
--cc=catalin.marinas@arm.com \
--cc=chengming.zhou@linux.dev \
--cc=damon@lists.linux.dev \
--cc=dave.hansen@linux.intel.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=dvander@google.com \
--cc=gourry@gourry.net \
--cc=harry@kernel.org \
--cc=hpa@zytor.com \
--cc=jack@suse.cz \
--cc=jannh@google.com \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=joshua.hahnjy@gmail.com \
--cc=jparsana@google.com \
--cc=kas@kernel.org \
--cc=kees@kernel.org \
--cc=lance.yang@linux.dev \
--cc=liam@infradead.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=luizcap@redhat.com \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=nao.horiguchi@gmail.com \
--cc=npache@redhat.com \
--cc=peterx@redhat.com \
--cc=pfalcato@suse.de \
--cc=rakie.kim@sk.com \
--cc=riel@surriel.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=ryncsn@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=sj@kernel.org \
--cc=surenb@google.com \
--cc=tglx@kernel.org \
--cc=vbabka@kernel.org \
--cc=wangzicheng@honor.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
--cc=xu.xin16@zte.com.cn \
--cc=ying.huang@linux.alibaba.com \
--cc=zhangji1@honor.com \
--cc=zhangjiao2@cmss.chinamobile.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox