RE: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: wangtao <tao.wangtao@honor.com>
To: Lorenzo Stoakes <ljs@kernel.org>
Cc: Harry Yoo <harry@kernel.org>,
	"catalin.marinas@arm.com" <catalin.marinas@arm.com>,
	"will@kernel.org" <will@kernel.org>,
	"tglx@kernel.org" <tglx@kernel.org>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"bp@alien8.de" <bp@alien8.de>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"x86@kernel.org" <x86@kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"david@kernel.org" <david@kernel.org>,
	"willy@infradead.org" <willy@infradead.org>,
	"sj@kernel.org" <sj@kernel.org>,
	"kees@kernel.org" <kees@kernel.org>,
	"luizcap@redhat.com" <luizcap@redhat.com>,
	"zhangjiao2@cmss.chinamobile.com"
	<zhangjiao2@cmss.chinamobile.com>,
	"kas@kernel.org" <kas@kernel.org>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"liam@infradead.org" <liam@infradead.org>,
	"vbabka@kernel.org" <vbabka@kernel.org>,
	"rppt@kernel.org" <rppt@kernel.org>,
	"surenb@google.com" <surenb@google.com>,
	"mhocko@suse.com" <mhocko@suse.com>,
	"jack@suse.cz" <jack@suse.cz>,
	"riel@surriel.com" <riel@surriel.com>,
	"jannh@google.com" <jannh@google.com>,
	"jgg@ziepe.ca" <jgg@ziepe.ca>,
	"jhubbard@nvidia.com" <jhubbard@nvidia.com>,
	"peterx@redhat.com" <peterx@redhat.com>,
	"ziy@nvidia.com" <ziy@nvidia.com>,
	"baolin.wang@linux.alibaba.com" <baolin.wang@linux.alibaba.com>,
	"npache@redhat.com" <npache@redhat.com>,
	"ryan.roberts@arm.com" <ryan.roberts@arm.com>,
	"dev.jain@arm.com" <dev.jain@arm.com>,
	"baohua@kernel.org" <baohua@kernel.org>,
	"lance.yang@linux.dev" <lance.yang@linux.dev>,
	"xu.xin16@zte.com.cn" <xu.xin16@zte.com.cn>,
	"chengming.zhou@linux.dev" <chengming.zhou@linux.dev>,
	"nao.horiguchi@gmail.com" <nao.horiguchi@gmail.com>,
	"matthew.brost@intel.com" <matthew.brost@intel.com>,
	"joshua.hahnjy@gmail.com" <joshua.hahnjy@gmail.com>,
	"rakie.kim@sk.com" <rakie.kim@sk.com>,
	"byungchul@sk.com" <byungchul@sk.com>,
	"gourry@gourry.net" <gourry@gourry.net>,
	"ying.huang@linux.alibaba.com" <ying.huang@linux.alibaba.com>,
	"apopple@nvidia.com" <apopple@nvidia.com>,
	"pfalcato@suse.de" <pfalcato@suse.de>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"damon@lists.linux.dev" <damon@lists.linux.dev>,
	"shakeel.butt@linux.dev" <shakeel.butt@linux.dev>,
	"ryncsn@gmail.com" <ryncsn@gmail.com>,
	"21cnbao@gmail.com" <21cnbao@gmail.com>,
	"jparsana@google.com" <jparsana@google.com>,
	"dvander@google.com" <dvander@google.com>,
	zhangji <zhangji1@honor.com>, wangzicheng <wangzicheng@honor.com>
Subject: RE: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation
Date: Thu, 4 Jun 2026 03:50:54 +0000	[thread overview]
Message-ID: <a073c529df7841d98dcec9ddc3dad8bc@honor.com> (raw)
In-Reply-To: <aiAMHhq9QCp-z3V9@lucifer>

> 
> Thanks for your replies, but I really have to stop doing deeper analyses like
> these for time management purposes.
Of course I will respond to technical discussions.

> 
> I did this more so to make the point from [0] as to why, in lower trust
> environments, this is just not feasible.
> 
> We could loop around for hours and hours and hours here.
> 
> In general as before, even if all worked perfectly (I'm very much not at all
> convinced), extending anon_vma and pinning VMAs is simply a no-go for
> architectural and complexity reasons.
> 
> I also find the locking story dubious and the lack of tests or anything
> corroborating correctness is additionally fatal.
> 
During rmap, anon_vma provides a superset of VMAs. We first confirm
with vma_address(), and then in each rmap_one we further check whether
the VMA needs to be processed through page_vma_mapped_walk() and
check_pte().

The lazy VMA used by ANON_VMA_LAZY provides only one VMA: if there is
no fork or mremap, then this single VMA is sufficient. To avoid taking
the folio_lock during fork and mremap, after anon_walk_anon, if
folio->mapping is upgraded to anon_vma, we retry once.

If your concern is about the lack of locking during rmap, you could
also refer to folio_wait_table and add a set of anon_vma_locks. That
was how I handled it during my initial debugging. Later, after
reviewing the code flow, I found that the lock might not be necessary,
so I removed it.

> And finally, I was already working on a replacement for anon_vma, and the
> generally done thing in these situations is for my work to take precedence.
> 
> So I'm going to bail out on futher deeper analyses here as otherwise I simply
> can't work on anything else :)
> 
> Thanks, Lorenzo
> 
> [0]:https://lore.kernel.org/all/ah887A5VkXOcmq-g@lucifer/
> 
> 
> On Wed, Jun 03, 2026 at 11:05:28AM +0000, wangtao wrote:
> > > >
> > >
> > > Against my better judgment I'll address the stuff here...
> > >
> > > > VMA operations can be roughly divided into three categories. The
> > > > handling of ANON_VMA_LAZY is briefly described below.
> > >
> > > I don't agree, there are plenty more VMA operations. But with
> > > respect to anon rmap there are:
> > >
> > > - fork
> > > - merge/split
> > > - remap
> > >
> >
> > Yes, these are the three categories. I originally intended to explain
> > them by classifying based on system calls; I should have used mremap
> instead of move_vma.
> 
> I don't think you mentioned move_vma()? Maybe I missed it.
> 
> The categorisation is most usefully based on callers of anon_vma_clone().
> 
> >
> > 是的，是这三类，我本想从系统调用去分类说明，应该将move_vma
> 换成mremap的。
> >
> > > Your approach seems to completely ignore VMA split and the need to
> > > maintain an interval tree to _multiple_ VMAs from a single anon_vma.
> > >
> >
> > The folio uses vma->root_vma to compute folio_address. A VMA split
> > from it, vma_a, also uses vma_a->root_vma = vma->root_vma to compute
> folio_address.
> > During rmap, once folio_address is obtained, the VMA can be found
> > through mm_mt. Without fork, there is no need to maintain the interval
> tree.
> 
> Well you need to search for every possible split VMA in mm_mt now, so you
> have to go page-by-page searching for each page for the rmap walked range.
> 
ANON_VMA_LAZY has only one VMA. When I first looked at
rmap_walk_ksm, I also thought it would need to search page by page,
which seemed unacceptable. Later I realized that it only needs to
check whether this VMA falls within the rmap walk range.

@@ -3173,20 +3171,20 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
-		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
+		anon_rmap_foreach_vma(vma, vmac, anon_rmap,
 					       0, ULONG_MAX) {

> You're also potentially racing against a remap, as you say below you don't
> folio lock on remap so concurrent rmap walkers can be present, the VMA can
> already be copied.
> 
> We already have VMA lifecycle state around detached VMAs, so a VMA
> could be in a detached state, assumed by the existing logic to be entirely
> unavailable for use, out of the maple tree altogether but kept around in a
> zombie state.
> 
> We'd then have lifecycle issues and races and edge cases around process
> teardown otherwise we might leak memory.
> 
> Also, presumably you set vma->anon_vma to some lazy sentinel value so
> that mremap doesn't change vma->vm_pgoff when unfaulted?
> 
> You would need to update any path that manipulates vma->anon_vma also
> so it doesn't incorrectly dereference it.
> 
Yes, most of the code in this patch series is intended to prevent
incorrect dereferencing of anon_vma. If we assume it will not be
misused, some of the code could be simplified or removed.

> >
> > folio使用vma->root_vma 计算folio_address；从vma拆分出的vma_a，
> 使用vma_a->root_vma =
> > folio使用vma->vma->root_vma计算folio_address。
> > rmap时得到folio_address就可以通过mm_mt查找到vma。
> > 不fork就不需要维护interval tree。
> >
> > > You may also actually split a VMA against a single large folio
> > > (waiting on the deferred shrinker) and have a SINGLE _leaf_
> > > anonymous folio that is mapped in two places.
> > >
> > > The lazy approach doesn't seem to address this properly. And fatally
> > > it ties an actual VMA afaict to the folio and has to implement a VMA
> > > reference count mechanism which interferes with the ordinarily VMA
> lifecycle to do it.
> > >
> > > The fact of us taking advantage of most stuff being AnonExclusive, i.e.
> > > 'leaves' is something that my approach is exactly taking into account.
> > >
> > > Of course also extending anon_vma is a real non-starter.
> > >
> > > Also the below + the series ignores MAP_PRIVATE file-backed mappings
> > > which is a pretty fatal flaw.
> > >
> > > It also, as Harry says, has zero description of correctness in a way
> > > we'd want and no tests.
> > >
> >
> > 可以正确处理拆分vma在一个大页。拆分的vma_a或vma_b上的
> sub_page使用如下方式计算地址。
> > 对于文件vma的cow 匿名页，也用同样方式计算page/folio地址。
> >
> > It can correctly handle the case where a VMA is split within a large
> > page. The address of a sub_page in the split VMA (vma_a or vma_b) is
> > computed using the following method.
> >
> > For COW anonymous pages originating from file VMAs, the page/folio
> > address is also computed using the same method.
> >
> > subpage_address = vma_address(vma_a, subpage_pgoff, 1)  =
> > vma_a->vm_start + (subpage_pgoff - vma_a->vm_pgoff) * PAGE_SIZE  =
> > vma_a->vm_start - vma_a->vm_pgoff * PAGE_SIZE + subpage_pgoff *
> > PAGE_SIZE  = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE
> =
> > vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE
> 
> OK but you want to walk entries in a _range_ in the interval tree.
> 
> So you are then now looking up VMAs (in a racey way) using mm_mt (which
> is the whole basis of my work actually) which could change under you.
> 
> I guess what you're doing is using the pinned 'root' VMA as the basis of
> everything, and the second a VMA is moved you (somehow) walk the page
> tables to update the folio->mapping.
> 
> Again pinning the VMA like this and putting it in a folio is really not something
> we want to do.
> 
> It adds a ton of complexity and also impacts VMA lifecycle which is already
> fairly fraught.
> 
> It makes the VMA no longer just a VMA but rather also a 'memory' of where
> something was first faulted in as a hack more or less.
> 
Maybe you're right. mm/mm_mt/vma/pagetable each have their own roles
in implementing VM. Perhaps considering them together could lead to
better ideas.

next prev parent reply	other threads:[~2026-06-04  3:51 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27 11:01 [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation tao
2026-05-27 11:01 ` [PATCH 01/15] mm/rmap: introduce anon_rmap APIs for anonymous folios tao
2026-05-27 11:44   ` Lorenzo Stoakes
2026-05-28  7:47     ` wangtao
2026-05-27 11:01 ` [PATCH 02/15] mm: convert anon_vma rmap APIs to anon_rmap tao
2026-05-27 11:49   ` Lorenzo Stoakes
2026-05-28  8:55     ` wangtao
2026-05-27 11:01 ` [PATCH 03/15] mm: introduce anon_vma_tree_t for multiple anon_vma topologies tao
2026-05-27 11:56   ` Lorenzo Stoakes
2026-05-28  9:00     ` wangtao
2026-05-27 11:01 ` [PATCH 04/15] mm: switch to anon_vma_tree_t APIs in preparation for ANON_VMA_LAZY tao
2026-05-27 11:01 ` [PATCH 05/15] mm: add CONFIG_ANON_VMA_LAZY and folio helpers tao
2026-05-27 11:01 ` [PATCH 06/15] mm: add CONFIG_VMA_REF and VMA helpers tao
2026-05-27 11:01 ` [PATCH 07/15] mm: replace direct FOLIO_MAPPING_ANON usage with helpers tao
2026-05-27 11:01 ` [PATCH 08/15] mm: prepare rmap infrastructure for ANON_VMA_LAZY tao
2026-05-27 11:01 ` [PATCH 09/15] mm: implement ANON_VMA_LAZY rmap semantics tao
2026-05-27 11:01 ` [PATCH 10/15] mm: defer anon_vma creation with ANON_VMA_LAZY tao
2026-05-27 11:01 ` [PATCH 11/15] mm: handle ANON_VMA_LAZY in huge page operations tao
2026-05-27 11:01 ` [PATCH 12/15] mm: handle ANON_VMA_LAZY during migration tao
2026-05-27 11:01 ` [PATCH 13/15] mm: support setup and upgrade of ANON_VMA_LAZY folios tao
2026-05-27 11:01 ` [PATCH 14/15] mm: support merging of ANON_VMA_LAZY VMAs tao
2026-05-27 11:01 ` [PATCH 15/15] mm: enable CONFIG_ANON_VMA_LAZY on arm64 and x86_64 tao
2026-05-27 11:23 ` [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation Pedro Falcato
2026-05-28  6:45   ` wangtao
2026-05-28  7:14     ` Lorenzo Stoakes
2026-05-27 11:30 ` Lorenzo Stoakes
2026-05-28  7:11   ` wangtao
2026-05-28  7:22     ` Lorenzo Stoakes
2026-05-27 14:33 ` Lorenzo Stoakes
2026-05-28  7:57   ` wangtao
2026-05-28  8:14     ` Lorenzo Stoakes
     [not found]       ` <CAGsJ_4zy=-m5wjm0BC-vQXMHGRkHymC-5S_L9Oi708v339vvPw@mail.gmail.com>
2026-05-29  2:20         ` wangzicheng
2026-05-29  6:56           ` Lorenzo Stoakes
2026-05-29  6:45         ` Lorenzo Stoakes
2026-05-29  9:41         ` wangtao
2026-05-29 12:03           ` Lorenzo Stoakes
2026-06-01  1:46             ` wangtao
2026-06-02  2:15               ` Barry Song
2026-06-02  2:46                 ` Lance Yang
2026-06-02 15:37                   ` Lorenzo Stoakes
2026-06-02 19:44                     ` Pedro Falcato
2026-06-02 23:03                     ` Barry Song
2026-06-03  7:07                       ` Lorenzo Stoakes
2026-06-02 19:56                 ` Harry Yoo
2026-06-02 22:27                   ` Barry Song
2026-06-02 20:47             ` Lorenzo Stoakes
2026-05-29 15:07         ` Jonathan Corbet
2026-05-29 15:40           ` Lorenzo Stoakes
2026-05-30 11:28             ` Barry Song
2026-06-02 16:07 ` Harry Yoo
2026-06-03  2:59   ` wangtao
2026-06-03  3:12     ` wangtao
2026-06-03  7:54     ` Lorenzo Stoakes
2026-06-03 11:05       ` wangtao
2026-06-03 11:53         ` Lorenzo Stoakes
2026-06-04  3:50           ` wangtao [this message]
2026-06-03 20:25 ` David Hildenbrand (Arm)
2026-06-03 22:14   ` Barry Song
2026-06-04  4:03     ` wangtao
2026-06-04  4:20       ` Barry Song
2026-06-04  7:35         ` wangtao
2026-06-04  3:10   ` xu.xin16
2026-06-04  4:10     ` wangtao
2026-06-04  9:40   ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a073c529df7841d98dcec9ddc3dad8bc@honor.com \
    --to=tao.wangtao@honor.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bp@alien8.de \
    --cc=byungchul@sk.com \
    --cc=catalin.marinas@arm.com \
    --cc=chengming.zhou@linux.dev \
    --cc=damon@lists.linux.dev \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=dvander@google.com \
    --cc=gourry@gourry.net \
    --cc=harry@kernel.org \
    --cc=hpa@zytor.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=jparsana@google.com \
    --cc=kas@kernel.org \
    --cc=kees@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=luizcap@redhat.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=npache@redhat.com \
    --cc=peterx@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=tglx@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=wangzicheng@honor.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=xu.xin16@zte.com.cn \
    --cc=ying.huang@linux.alibaba.com \
    --cc=zhangji1@honor.com \
    --cc=zhangjiao2@cmss.chinamobile.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox