From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4B22FCD6E55 for ; Wed, 3 Jun 2026 07:54:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AB5C06B008A; Wed, 3 Jun 2026 03:54:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A66856B008C; Wed, 3 Jun 2026 03:54:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 955E76B0092; Wed, 3 Jun 2026 03:54:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 80AD86B008A for ; Wed, 3 Jun 2026 03:54:27 -0400 (EDT) Received: from smtpin21.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay05.hostedemail.com (Postfix) with ESMTP id BD560402A8 for ; Wed, 3 Jun 2026 07:54:26 +0000 (UTC) X-FDA: 84837838932.21.B8807F9 Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31]) by imf16.hostedemail.com (Postfix) with ESMTP id E4C7B180005 for ; Wed, 3 Jun 2026 07:54:24 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20260515 header.b=iBQdR899; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf16.hostedemail.com: domain of ljs@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=ljs@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1780473265; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yVEWQXkYyS1Sz1wvYDfa0kDHIfb6GACqf4EYG4ej3sk=; b=7yE2IzhCD1qbkfvYFsFTwZJC+UqGvPnySGup+SJyH5kvC0rhiX64536SIj80i574hcbmly gtiL8qnX4MzpyzWNgkGL2DTcjAhTZkw4Acw0P1FafVGeWTm9NVGwPNh3INpguBNfBTi0HW HW8FWt6gWBCnJLyVJZfAwCcATeXMVLk= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20260515 header.b=iBQdR899; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf16.hostedemail.com: domain of ljs@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=ljs@kernel.org ARC-Seal: i=1; a=rsa-sha256; d=hostedemail.com; s=arc-20220608; cv=none; t=1780473265; b=DJ8wDohG8HFQbb2kSODkPy13LudnyPIRWevaBy6cbuIqMIzemJ1QNahttz4Mha56rA85OQ /pSfwSEhpW4dgtpmFT4YeAZ/4cHaI9Rsw87RoBFjAyQLmgjSdAZChABk4zdJoNPiErLGTf cwumRUrPyZgc9NbP+nectr9sGTrNAlA= Received: from smtp.kernel.org (quasi.space.kernel.org [100.103.45.18]) by sea.source.kernel.org (Postfix) with ESMTP id DAD3041824; Wed, 3 Jun 2026 07:54:23 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3FA281F00893; Wed, 3 Jun 2026 07:54:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1780473263; bh=yVEWQXkYyS1Sz1wvYDfa0kDHIfb6GACqf4EYG4ej3sk=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=iBQdR899JgjQykJRr6/veC3nzp/U3AzVBUI7qT+uQIgJJd7ZBW9l9QZchBrf66Lhr 0qI4HFuAOQqiRm2/77apvrCvt7IRTRHEa4wt717jm2i39+vV5nmUzsM/4cKYO+k+ro FlZZEF6HcMNPQ0I1XgOiejr5bUf2bL6NeY1vMs9n7UyycHig/Zpgmhzzw44UbpfZW9 I/bhLe6VplTnVgMlkef6VbPnUZKT9CLdrT5guTrCNlMdH0XXLD086xrwhtinZEPUJc uFZuS/vxmUWw7DT/VpxH4TKVP1lgwDqEDTSFIfHhw+UeQ6vQkeZ4rDROLg0pc8VPLg IW18pcyKw93lQ== Date: Wed, 3 Jun 2026 08:54:08 +0100 From: Lorenzo Stoakes To: wangtao Cc: Harry Yoo , "catalin.marinas@arm.com" , "will@kernel.org" , "tglx@kernel.org" , "mingo@redhat.com" , "bp@alien8.de" , "dave.hansen@linux.intel.com" , "x86@kernel.org" , "akpm@linux-foundation.org" , "david@kernel.org" , "willy@infradead.org" , "sj@kernel.org" , "kees@kernel.org" , "luizcap@redhat.com" , "zhangjiao2@cmss.chinamobile.com" , "kas@kernel.org" , "hpa@zytor.com" , "liam@infradead.org" , "vbabka@kernel.org" , "rppt@kernel.org" , "surenb@google.com" , "mhocko@suse.com" , "jack@suse.cz" , "riel@surriel.com" , "jannh@google.com" , "jgg@ziepe.ca" , "jhubbard@nvidia.com" , "peterx@redhat.com" , "ziy@nvidia.com" , "baolin.wang@linux.alibaba.com" , "npache@redhat.com" , "ryan.roberts@arm.com" , "dev.jain@arm.com" , "baohua@kernel.org" , "lance.yang@linux.dev" , "xu.xin16@zte.com.cn" , "chengming.zhou@linux.dev" , "nao.horiguchi@gmail.com" , "matthew.brost@intel.com" , "joshua.hahnjy@gmail.com" , "rakie.kim@sk.com" , "byungchul@sk.com" , "gourry@gourry.net" , "ying.huang@linux.alibaba.com" , "apopple@nvidia.com" , "pfalcato@suse.de" , "linux-arm-kernel@lists.infradead.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , "damon@lists.linux.dev" , "shakeel.butt@linux.dev" , "ryncsn@gmail.com" , "21cnbao@gmail.com" <21cnbao@gmail.com>, "jparsana@google.com" , "dvander@google.com" , zhangji , wangzicheng Subject: Re: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation Message-ID: References: <20260527110147.17815-1-tao.wangtao@honor.com> <0867dac0-bc48-4aa9-891f-2066a3eff989@kernel.org> <7319ad82f9ee4fc4b18b50b1842c9f99@honor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <7319ad82f9ee4fc4b18b50b1842c9f99@honor.com> X-Rspamd-Server: rspam10 X-Rspam-User: X-Stat-Signature: thbesdxb5mt9ogdsk9jhze9djhdpirqe X-Rspamd-Queue-Id: E4C7B180005 X-HE-Tag: 1780473264-918670 X-HE-Meta: U2FsdGVkX1++ddSm7tdcmm+wxLI/xIlqUYythlBbZAKYg0vzYAijO+UwdrrQdIG1JiUG+NHYT4gpRUQKljkpfSCqoD1EuNXj+N4B3jACn3tAD7dF0PDfE4dCjYqonEwMHDXGp6zmr0VZGPLKU3JiKeUMNjmpsxsHuB/H6vn8J8b/+cXRM7KYi9RO923nrOrEJn+RNwgEAIGvErhjYJeqsKAMEtN32WLwOwVtGjnn9mUItsiXkfO0jHUMsrGlXWi128zczgIc6LUDz9lU/GtpoxV4ZG0w5rkZ40KYuTTRsBYHEybDMXWfuGv7bkgFVvKyDRRB0j5ccB6asp7NgaWeRqDdx2iTT9L25TaE6+dqBOvk0MJhF2ENSnC2fV+J0wQprijP4pTwxO+xt/KLHWWIz2L+B92+MD/8nuIgWBJbYvyCJIiEKsPeWI/g1n9UWYHNt5hMMWldSEli5haLrQ67gtexwaVdUA/3de/cbfE6jDyWS7OSYSLCBIarf9Sy+U5EvrYJaVDELoBmAYJrug7b0ieR2ybitQLCXI2vetCto5mUBR64QIMaI9JM8An72z2A4oanveGKKiK0oKTj8XM5G6Huyss3iLxQ6lpE5T182XGaTnHFxyRlhwfsc1nTsOi9UmdkV/A0Sge2YkDcVH9P1+yyLzEaQwGoO5UvoVPPL/LFiE3FKtstMCfEjBt4bCjax3Dyr2SliKmrmP3r1tF04OrAhwun9cy85iMpVztaek6a0G6NYbZ6OdYsWsQ6jvollNZ+nXwhNtJPg1dcEndPDa8VYPYHQIcte8AUWGeFCfxd+G4mprFuSSaQ5yzGo4Jy6cGSVt/771LBPenwHrCA1rfUZrX2bITHVPcHaJukkyjpAvz49b9u9j4T2CZCSpmcBWn4FRdGoMelsK4+0vW7qK6Pe+nNHFlfS4RaiuF2RmXIVXDAUzizR4hozUI4eS3mrm2yKaTi4yHDXvO//YI UGh1jMSr EHTa1jkvpJfM3rC5CFhICycgprdkM9lBrv4fZPz/dTKCIZjfsdq9J58UqLym9VpGd8AmmzoRY+4kt+2ktbtMAPRmKFUhok620cU5KRmBIdWSDDBmZFZlYFiT4DGeYeTiY/KqYp7l8y9S1wxgL06Z2IzPmgq+P5RgmNJfS5M0S2ct6Wvz7UXY8+6BZa0yIFRgqAxfkUGe+T89ZotwJ+tYcfOBeUCPq/yS+3TS+nHAsgnIHODEr4+EoWIeO86Tq7I3ypM0BIKbxcEOIlZtoaH12mxT7g4HdpXd+NbACfwoW0zChj90ZVu7AQZqPxH1PgOu4p/voBGJhmoCyf4mIBzhSYru6H6vmBfy9H/TS9WxcXNHh0bjaniKjhjMQ4tMKTKBXZUp2boC8ps2D2re26UjwPVJvht66fWHtO/wSanpISNOo/XfPbP9H9DXNFw== Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jun 03, 2026 at 02:59:04AM +0000, wangtao wrote: > > On 5/27/26 8:01 PM, tao wrote: > > > Design overview > > > --------------- > > > > > > ANON_VMA_LAZY defers anon_vma allocation until it is actually needed > > > (for example during fork). VMAs that never participate in sharing can > > > avoid creating anon_vma structures entirely. > > > > > > Before an anon_vma exists, rmap operations rely directly on VMA > > > information, so no anon_vma locking is required. An anon_vma is > > > created and linked only when sharing semantics are required. > > > > It is unfortunate that the design overview doesn't cover correctness aspect > > at all. VMAs are subject to change (even before being shared with other > > processes), and rmap needs something that doesn't go away across VMA > > merging, split, etc. > > > > I'm not sure how the idea is supposed work correctly. > > > > -- > > Cheers, > > Harry / Hyeonggon > Against my better judgment I'll address the stuff here... > VMA operations can be roughly divided into three categories. The handling > of ANON_VMA_LAZY is briefly described below. I don't agree, there are plenty more VMA operations. But with respect to anon rmap there are: - fork - merge/split - remap Your approach seems to completely ignore VMA split and the need to maintain an interval tree to _multiple_ VMAs from a single anon_vma. You may also actually split a VMA against a single large folio (waiting on the deferred shrinker) and have a SINGLE _leaf_ anonymous folio that is mapped in two places. The lazy approach doesn't seem to address this properly. And fatally it ties an actual VMA afaict to the folio and has to implement a VMA reference count mechanism which interferes with the ordinarily VMA lifecycle to do it. The fact of us taking advantage of most stuff being AnonExclusive, i.e. 'leaves' is something that my approach is exactly taking into account. Of course also extending anon_vma is a real non-starter. Also the below + the series ignores MAP_PRIVATE file-backed mappings which is a pretty fatal flaw. It also, as Harry says, has zero description of correctness in a way we'd want and no tests. > > 1. fork > > fork duplicates the parent's mm/mmap. (exec creates a new mm/mmap and is > not involved here.) This can be viewed as copying the VMAs with identical > virtual addresses into a new address space. > > If the parent VMA (pvma) is ANON_VMA_LAZY, it is first upgraded to a > regular anon_vma. The corresponding folio->mapping is then fixed in > try_dup_anon_rmap(). And so we make fork, a very sensitive path in the kernel more expensive. I also question the locking situation with the conversion mentioned, updating folios in this manner is extremely difficult. > > 2. mmap / brk / mprotect / munmap > > These operations create, modify, or remove VMAs in the current mm. They > may split existing VMAs, merge adjacent VMAs, or remove a VMA from mm_mt. mmap and brk are not at all relevant to anon_vma, as no anon_vma is assigned upon mapping. It's on fault. mprotect/mlock/munmap/etc. might split, but I don't see how the lazy approach in any way addresses any of that. > > When a new VMA is created, vm_start, vm_end and vm_pgoff are initialized > and the VMA is inserted into mm_mt. Although these fields may later be > modified, the following value remains invariant: > > (vm_start - vm_pgoff * PAGE_SIZE) Err no it doesn't at all? If I fault in a VMA at vm_start, vm_pgoff = vm_start >> PAGE_SHIFT. Then if I remap it, vm_start changes, vm_pgoff stays the same, so: vm_start - vm_pgoff * PAGE_SIZE Changes right? And then that becomes essentially the offset from where it was faulted in. > > We refer to this value as: > > vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE This is mysteriously close to being the offset I mention in my CoW context work... I'm not sure what 'mapping base' means here. > > This value also remains unchanged when the VMA is removed from mm_mt. Why does it matter what this value is on unmap? > > If a VMA is split and produces new_vma, the following holds: > > vma_mapping_base(new_vma) == vma_mapping_base(vma) This is a roundabout way of saying we offset the vma->vm_pgoff after split. > > If two adjacent VMAs vma_a and vma_b are merged into vma_x, then: > > vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == > vma_mapping_base(vma_x) This is just a roundabout way of saying the pgoff has to be aligned. > > Assume the VMA where the first page fault occurs is called root_vma, and > ensure that any VMA produced by split or merge holds a reference to > root_vma. But this VMA can be unmapped later? Or remapped? Holding on to a VMA and treating it as some kind of canonical reference with a reference count completely changes what VMAs are, impacts the VMA lifecycle, and produces unwanted memory overhead in itself. It also raises concerns and issues around lock order which is very sensitive. > > During rmap we can compute the folio address using root_vma: > > vma_address(vma, pgoff, 1) = What's the parameters here? What's 1? > vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE > > We can then use folio_addr to locate the VMA covering this folio. I'm really confused by this, you're kind of mixing and match parameters here. What I think you're saying is that, if a folio hasn't been remapped, you can figure out its address based on page offset. That's completely broken for MAP_PRIVATE file-backed mappings which also use anon_vma and also have to keep on working. It seems that for the lazy approach what you are doing is essentially caching the 'root' VMA in the folio. But this doesn't account for large folios and split VMAs. Even if you disabled it for those cases (which adds a ton of complexity in itself), you then have issues with locking - the anon_vma lock has to take a lock (that cannot be a VMA-level lock - results in lock inversion) even on these leaf entries, or you break locking. And we can't reasonably start pinning VMAs and using them as a sort of proto cached thing on top of the existing anon_vma logic. You also then need to, on remap, undo all this, which requires updating folio->mapping on remap, something I tried doing previously myself, but that's fraught with issues around lock inversion itself. > > 3. mremap / uffd_move userfaultfd moving is not relevant as it actually updates the folio correctly. > > If only the size changes and the start address remains the same, there > is no impact. > > If the start address changes, the page is moved from (vma, addr) to > (new_vma, new_addr). In this case: > > vma_mapping_base(new_vma) = > vma_mapping_base(vma) + new_addr - old_addr You say above that the mapping base never changes? But here it changes? > > We first upgrade the VMA, and then fix folio->mapping in move_ptes(). What's 'upgrading' a VMA? You mean converting the lazy anon_vma to a 'normal' one. As above, this is fraught with lock inversion issues. > > If performance becomes a concern, ANON_VMA_LAZY can be enabled only for > relatively small VMAs. I think you've got serious correctness, lock management and complexity issues and it's all a non-starter as the costs deeply exceed the benefits. This is one of the fundamental, frustrating aspects of the anon rmap - you keep thinking that 'surely' you can do sensible thing X, but it turns out you can't for various annoying reasons. It's one of the reasons it's really fraught for somebody coming to make changes, and one of the reasons why I am very keen on fundamentally changing it. And also on a not-wasting-time basis - I was already working in parallel on a rework here, so I think the civil thing is to at least wait for my work before issuing alternative solutions. Thanks, Lorenzo > > > vma操作可以分为3类,下面简单说明下ANON_VMA_LAZY的处理: > > 1. fork 从父进程复制mm/mmap;(exev 创建一个新的mm/mmap,不涉及)。 > 这可以理解为在一个新的地址空间复制一份相同地址的VMAs. > 如果pvma是ANON_VMA_LAZY,先升级为regular anon_vma,并在try_dup_anon_rmap中升级修正folio->mapping. > > 2. mmap/brk/mprotect/munmap > 创建、修改或删除当前mm的VMA,可能合并或拆分出新的VMAs或者将VMA从mm_mt删除。 > 创建一个新的vma并设置vm_start、vm_end、vm_pgoff插入mm_mt后,虽然后续可能修改这个VMA的vm_start、vm_end、vm_pgoff,但是保持 > (vm_start - vm_pgoff * PAGE_SIZE)不变,我们可以把这个称之为vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE。 > 这个vma从mm_mt删除时,vma_mapping_base(vma)也保持不变。 > 从这个vma拆分出的new_vma,有vma_mapping_base(new_vma) == vma_mapping_base(vma) > 合并相邻vma_a、vma_b为vma_x时,也有vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == vma_mapping_base(vma_x) > 如果我们第一次发生缺页的VMA称为root_vma,并在split或merge时都确保使用的vma持有root_vma的引用。 > 在rmap时我们可以用root_vma计算folio地址: > vma_address(vma, pgoff, 1) = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE > 然后用folio_addr查找folio所在的vma。 > > 3. mremap/uffd_move > 如果只是修改大小,起始地址不变,不影响。 > 如果改变起始地址,将page从vma/addr移动到new_vma/new_addr > 这时vma_mapping_base(new_vma) = vma_mapping_base(vma) + new_addr - old_addr > 我们先升级vma,在move_ptes中再修正folio->mapping。 > 如果担心性能影响,可以只在较小的vma上使能ANON_VMA_LAZY。 >