From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1DEC2CA0FED for ; Tue, 26 Aug 2025 22:23:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1DA316B0107; Tue, 26 Aug 2025 18:23:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1B0066B0108; Tue, 26 Aug 2025 18:23:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0C5846B0109; Tue, 26 Aug 2025 18:23:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id EF63A6B0107 for ; Tue, 26 Aug 2025 18:23:44 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 5EECD5888E for ; Tue, 26 Aug 2025 22:23:44 +0000 (UTC) X-FDA: 83820336768.15.3CDDB8C Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49]) by imf21.hostedemail.com (Postfix) with ESMTP id 67A651C0004 for ; Tue, 26 Aug 2025 22:23:42 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=aE7DHrfc; spf=pass (imf21.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1756247022; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HdzAj4hdgXbHM9u8eX6/Ze1u+O9VPPOiYd1/e8bzusU=; b=rokQ2XqjCgcru91LqecdeqOEirrdSavDaTbDl2sP2LOnRvKpWjRugGvzXlXismU2/hqj9c VbMK/MW8ilbmucM0aUkj1SJ7CDgSHl7r/wXm/fED8luFLmjh1gSphRUt+2Q1dBlUqV0cRq vdUXMVOuPMSW4g+yDW5kcC6HJg6RRCQ= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=aE7DHrfc; spf=pass (imf21.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1756247022; a=rsa-sha256; cv=none; b=2cXHafBOoEaX60rPkfCzRlQkGgPLh4dbCiReHvf8X9xS9MIUY43UrYvvkb8ZKiRAIu2ZiG DCynFII1kSElATwsRYnUPq6YFRR/3IwipGG91D93zWJSFbL1hzoLass9Ho13G3k6wn5pAp XPMBLjYX9O1g4y8cYgHt9y8OfnjI4Wc= Received: by mail-ed1-f49.google.com with SMTP id 4fb4d7f45d1cf-61c869b4623so2101a12.1 for ; Tue, 26 Aug 2025 15:23:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1756247021; x=1756851821; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HdzAj4hdgXbHM9u8eX6/Ze1u+O9VPPOiYd1/e8bzusU=; b=aE7DHrfcWmugXm9mhXT5X9eJBxLVl2CJzB4vXsK6svLhogBTLFP1Zamk8R8GKmjvq8 jDxVQ4JYXmDuqwyhi8gUQ0wmquqFrvY1DSNNPDvsxCYrzgJrNntNTRvaLrkxEZEVmpwj g1R9aHq0OmQ1/ur+eLlm/RaudU3kZ3aBgcyitPeUMTKKS/if5edd0+ynPL7lF89iSGC0 ptbdZdvaD1Eoc286/mf4fUmBqIRkC8h3wfnxV3vGyByBIsCSzxNMojLDvBXCwJXzFak1 wxdQyfyZi0Z9j/2TjHnoEx9Mb9+K/iu+UrsoJQZAY9++eQpWHRtEh8Xw33ZKI8Z6zU3G NdZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756247021; x=1756851821; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HdzAj4hdgXbHM9u8eX6/Ze1u+O9VPPOiYd1/e8bzusU=; b=NMopkGlkqWpYu7cGD78ROPn7XSPEx9PIUg7Jpt3gRbL3PKQSA2rUSREabx7N+tsUqJ HysYoer4pfGplYN5ZkypKZ4D10AbqEk9xe7p2J7BEZfxVb2nZ/Ox2d0WcwhEcdM8sLQS HPvYlkzjuVuBoeRN9+Kho2ONAR1CmBnO1Po8rq2ajSK5M+m8IBmimFXW/Nx80v/3YnGM bclAkODw4tribG9PQlCx7wcA95qDbTUpk9HhZe8+EvZl1qA8LzJ+lvWoOphhjFSZSJAe TMq90zCRTjLwsCZ7ifP6s2pXYxFOJBmmIRDsd4gjjyoVodtU7rQr+/1DUKu4U6azLUWC Ewiw== X-Forwarded-Encrypted: i=1; AJvYcCVaAKcqcLMB0qPFSIzSQZZhFHCT+AtCe+E9QdiHLnhMZy5M2YUl5lXBBvpqm3Jau+r+P6rMz7VD1A==@kvack.org X-Gm-Message-State: AOJu0Yx68NXowh8LrgAwJCJ11L3Wrdq3JW66WcO1pt5iWtaAyUJ7ojOo 0NAMI0X7bQhMTPxt0+/PO8jBanbphTNR4iYPHAtJG6EdTj8HssxrNUdhD0z2CDoRx06Ge80E+pU TA2MKpFnOPG+FAhwo4INvtmBtKS252dTwulQ5kz54 X-Gm-Gg: ASbGncvPciJ6j4RRi252QtmhucGOIguyDZ+TflrKm8/K8E32h4eX/b1r8B46BlcupB6 1aEtuMNat1HKq5Xz1NIdR6ZgKa9c6EBPSLm3kKVMrWjxA1j0YUgHw9DBkrNYrx9L7PNSYwy2s/N CMV4IdeDwd3hTi7l3HvCoyYCZISpuUO4FAKj89An7AC+sivBoj5b6AzSqPERhhLcjEOGoLXRk6o 2txTo7OJe8xsx6ZG5ZJ86uBJ/sA6HNfJcRkDcjxfg== X-Google-Smtp-Source: AGHT+IH5laHTN2ceQXtx48Rb1PyJw6NCArX3BwpFynb13vIQot7VtB6qXl2BVraBEpEyuBdH6PnHzmOOsPl0buTNmrg= X-Received: by 2002:aa7:d817:0:b0:61c:ad8d:eeb3 with SMTP id 4fb4d7f45d1cf-61cad8def77mr105462a12.5.1756247020395; Tue, 26 Aug 2025 15:23:40 -0700 (PDT) MIME-Version: 1.0 References: <65dd5d54-87ab-49a8-8734-2201a0014feb@lucifer.local> In-Reply-To: <65dd5d54-87ab-49a8-8734-2201a0014feb@lucifer.local> From: Lokesh Gidra Date: Tue, 26 Aug 2025 15:23:28 -0700 X-Gm-Features: Ac12FXxHsj5ejTjMAqm2KdWR-My4JCztbeCu5nshJDlgNWqk9Y_R3t6Y32y6-PI Message-ID: Subject: Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() To: Lorenzo Stoakes Cc: David Hildenbrand , Andrew Morton , Harry Yoo , Zi Yan , Barry Song <21cnbao@gmail.com>, "open list:MEMORY MANAGEMENT" , Peter Xu , Suren Baghdasaryan , Kalesh Singh , android-mm , linux-kernel , Jann Horn , Rik van Riel , Vlastimil Babka , "Liam R. Howlett" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 67A651C0004 X-Rspamd-Server: rspam04 X-Rspam-User: X-Stat-Signature: qmgwb7s9un87n65gp13qme47i6fs9i7z X-HE-Tag: 1756247022-474513 X-HE-Meta: U2FsdGVkX18PpY1J/ytfSecvwcgVN6JKjT9CbCB6HdEITdOeZkOaroIF5Hspb7nMlfEG8ls9b7bE0QA7r8pb7SLfiNYIjSGmtxn8/2nYenUVLoaCnmM3LCjHBTcpqAf2nZ08XXAh0ZSW2VOYwMg8Nc70Y9/bt3EVGbcQrez8+DTBetOabZHCOflDw9qa+h1wDIaMyuaHt7haZwJrQLmndCJGgdJEwjgtUV7Cq5EzEslWDIyVJE89nUlr3f0Ook7w2Ihzxk4sP+ibXBXZGAhuEwRvtgxJn8KFlOZtO/Bnh1z/dp2p1xgxasCwFT9Mchr3qNaD0RK6IKc2asWeNlP25Ks9F7ryG0q2oPt/FSsc0663O8mWS1hqQe8VCQ4vrgsfa3Yg1GfshQhs+HFr9S1o9VMBMQBU5YFaxi4xXQ0Un1clNYKdPVXk0QSutvu5oa+r6t1TI4D3KB/MAhXSeHfzCZmj6tGlOUfV9ulbgb+HhfgkF9qFSPkARsJ0O7Yw3oz3P2ls7pkT/61xGKeuuaHe5vwmPkFID55q27oeWhD2iJlkgnssc54zDry3Ygp3/b4zS+csBWIefusokKltzry5Ow1zPjj0TzacHOImu8lehlTn2j9dPym31bLfxc14kVL1xYPSmSW3hvcvPkDyTdu3Znfro13JEIw+3pDDgetdfNoyTgu9JdhSQbOBDw65/V9eNOCMFJIFmLzVWr69nKyuUcrkqWOwbEjlqQAWU3TMqC8S9V1PtnBS1nXItBXY3nVX6JS5fteefEye67E+nTxAn4+L9a/9X8uRHlXlLGMDOyeDgpsn7A9M72KHqcm675FMhByl5L7j17BmJRESWeerqdl+I3YyNeu53ffOeRRlJLgf2DVGjG6nF8Eau/Wruwk7TOn8rxbmgy7jFkYR3/LKZDsntoIO/bplcOtnmeUlc0BFoZLNytA+tBtJw+M09baiLBvD0Vd+Av3txh4SxIZ EFSKweQd vVwjpMp42iHf7Kp3MFv5ldwHA4HM0gQeAiMOzmA94yiIFwW/ohJoN5wxX6DIJyjCu0SI7YhcmfG2sSrs7l1YZIByhzPOwEzMLEjI+NfeRiAeqoSxQ2bR51we4VfIp250SC82iPWYPKFKJ8/On0unUZPOopmXUTkOIf/6mYgvxMeZnKma6GlITjduUNW7MkK78YxiCeajQsrWMmBGvqfyf29IB57kdpmK4yesgAS1oz5m5X7l5pknvw4yhjaKgr+c9/yVDYuwhQI/VbRU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Aug 26, 2025 at 8:52=E2=80=AFAM Lorenzo Stoakes wrote: > > On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote: > > Hi all, > > > > Currently, some callers of rmap_walk() conditionally avoid try-locking > > non-ksm anon folios. This necessitates serialization through anon_vma > > write-lock elsewhere when folio->mapping and/or folio->index (fields > > involved in rmap_walk()) are to be updated. This hurts scalability due > > to coarse granularity of the lock. For instance, when multiple threads > > invoke userfaultfd=E2=80=99s MOVE ioctl simultaneously to move distinct= pages > > from the same src VMA, they all contend for the corresponding > > anon_vma=E2=80=99s lock. Field traces for arm64 android devices reveal = over > > 30ms of uninterruptible sleep in the main UI thread, leading to janky > > user interactions. > > Can we clarify whether this is simply an example, or rather the entire > motivating reason for raising this issue? > When I started off I thought maybe there are other cases too, but it looks like as of now only uffd MOVE updates folio->mapping to a different root anon_vma. > It's important, because it strikes me that this is a very specific use > case, and you're now suggesting changing core locking to suit it. > > While this is a discussion, and I'm glad you raised it, I think it's > important in these cases to really exhaustively examine all of the possib= le > consequences. > > OK so to clarify: > > - You want to traverse the rmap entirely without any rmap locks whatsoeve= r > for anon, relying solely on the folio lock to serialise, because > otherwise rmap read locks here block other rmap write lock calls. > There is a misunderstanding. I'm suggesting locking *both* folio as well as anon_vma during rmap walk. To avoid any confusion, here are the simplifications in mm/rmap.c that I suggest: diff --git a/mm/rmap.c b/mm/rmap.c index 568198e9efc2..81c177b0cddf 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -547,7 +547,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio, struct anon_vma *root_anon_vma; unsigned long anon_mapping; -retry: rcu_read_lock(); anon_mapping =3D (unsigned long)READ_ONCE(folio->mapping); if ((anon_mapping & FOLIO_MAPPING_FLAGS) !=3D FOLIO_MAPPING_ANON) @@ -558,17 +557,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio, anon_vma =3D (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON= ); root_anon_vma =3D READ_ONCE(anon_vma->root); if (down_read_trylock(&root_anon_vma->rwsem)) { - /* - * folio_move_anon_rmap() might have changed the anon_vma a= s we - * might not hold the folio lock here. - */ - if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=3D - anon_mapping)) { - up_read(&root_anon_vma->rwsem); - rcu_read_unlock(); - goto retry; - } - /* * If the folio is still mapped, then this anon_vma is stil= l * its anon_vma, and holding the mutex ensures that it will @@ -603,18 +591,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio, rcu_read_unlock(); anon_vma_lock_read(anon_vma); - /* - * folio_move_anon_rmap() might have changed the anon_vma as we mig= ht - * not hold the folio lock here. - */ - if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=3D - anon_mapping)) { - anon_vma_unlock_read(anon_vma); - put_anon_vma(anon_vma); - anon_vma =3D NULL; - goto retry; - } - if (atomic_dec_and_test(&anon_vma->refcount)) { /* * Oops, we held the last refcount, release the lock @@ -1006,7 +982,7 @@ int folio_referenced(struct folio *folio, int is_locke= d, if (!folio_raw_mapping(folio)) return 0; - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio)= )) { + if (!is_locked) { we_locked =3D folio_trylock(folio); if (!we_locked) return 1; > - You want to unconditionally folio lock all anon and kSM folios for at > least folio_referenced(). > Actually file and KSM folios are always locked today. The anon folios are conditionally left out. So my proposal actually standardizes this locking, which is an overall simplification. > In order to resolve a scalability issue specific to a uffd usecase? > With the requirement of locking anon_vma in write mode, uffd MOVE currently is unusable in practice due to poor scalability. The above change in mm/rmap.c allows us to make the following improvement to MOVE ioctl: diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 45e6290e2e8b..c4fc87d73ab7 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1192,7 +1192,6 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pmd_t dummy_pmdval; pmd_t dst_pmdval; struct folio *src_folio =3D NULL; - struct anon_vma *src_anon_vma =3D NULL; struct mmu_notifier_range range; int err =3D 0; @@ -1353,28 +1352,6 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, goto retry; } - if (!src_anon_vma) { - /* - * folio_referenced walks the anon_vma chain - * without the folio lock. Serialize against it wit= h - * the anon_vma lock, the folio lock is not enough. - */ - src_anon_vma =3D folio_get_anon_vma(src_folio); - if (!src_anon_vma) { - /* page was unmapped from under us */ - err =3D -EAGAIN; - goto out; - } - if (!anon_vma_trylock_write(src_anon_vma)) { - pte_unmap(src_pte); - pte_unmap(dst_pte); - src_pte =3D dst_pte =3D NULL; - /* now we can block and wait */ - anon_vma_lock_write(src_anon_vma); - goto retry; - } - } - err =3D move_present_pte(mm, dst_vma, src_vma, dst_addr, src_addr, dst_pte, src_pte= , orig_dst_pte, orig_src_pte, dst_pmd, @@ -1445,10 +1422,6 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, } out: - if (src_anon_vma) { - anon_vma_unlock_write(src_anon_vma); - put_anon_vma(src_anon_vma); - } if (src_folio) { folio_unlock(src_folio); folio_put(src_folio); > Is this the case? Happy to be corrected if I've misinterpreted. > > I don't see how this could possibly work, unless I'm missing something > here, because: > > 1. When we lock anon_vma's it's at the root which covers all anon_vma's > covering parent/children of forked processes. > > 2. We do "top down" operations that acquire the rmap lock on the assumpti= on > we have exclusive access to the rmapping that have nothing to do with > the folio nor could we even know what the folio is at this point. > > 3. We manipulate higher level page tables on the basis that the rmap lock > excludes other page table walkers. > > So this proposal seems to violate all of that? > > For instance, in many VMA operations we perform: > > anon_vma_interval_tree_pre_update_vma() > > and > > anon_vma_interval_tree_post_update_vma() > > Which removes _all_ R/B tree mappings. > > So you can now race with this (it of course doesn't care about folio lock= ) > and then get completely incorrect results? > > This seems fairly disasterous? > > In free_pgtables() also we call unlink_anon_vmas() which iterates through > the vma->anon_vma_chain and uses the anon lock to tear down higher order > page tables which you now might race with and that seems even more > disasterous... > > > > > > Among all rmap_walk() callers that don=E2=80=99t lock anon folios, > > folio_referenced() is the most critical (others are > > page_idle_clear_pte_refs(), damon_folio_young(), and > > damon_folio_mkold()). The relevant code in folio_referenced() is: > > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > > we_locked =3D folio_trylock(folio); > > if (!we_locked) > > return 1; > > } > > > > It=E2=80=99s unclear why locking anon_vma exclusively (when updating > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > > with folio locked. It=E2=80=99s in the reclaim path, so should not be a > > critical path that necessitates some special treatment, unless I=E2=80= =99m > > missing something. > > Therefore, I propose simplifying the locking mechanism by ensuring the > > folio is locked before calling rmap_walk(). This helps avoid locking > > anon_vma when updating folio->mapping, which, for instance, will help > > eliminate the uninterruptible sleep observed in the field traces > > mentioned earlier. Furthermore, it enables us to simplify the code in > > folio_lock_anon_vma_read() by removing the re-check to ensure that the > > field hasn=E2=80=99t changed under us. > > > I mean this is why I get confused here though, because you seem to be > saying 'don't take rmap lock at all' to referencing > folio_lock_anon_vma_read()? > > Perhaps I misinterpreted (forgive me if so) and indeed you meant this, bu= t > then I don't see how you impact contention on the anon_vma lock by making > this change? > > I think in general - let's clarify what exactly you intend to do here, an= d > then we need to delineate what we need to confirm and test to have any > confidence in making such a change. > > anon_vma locks (and rmap locks) are very very sensitive in general and > we've had actual security issues come up due to race windows emerging fro= m > inappropriate handling, not to mention that performance around this > obviously matters a great deal. I couldn't agree more. My changes seemed to simplify, otherwise I wouldn't have suggested this. And David's reply yesterday gives confidence that it wouldn't negatively affect performance either. Thanks, Lokesh > > So we must tread carefully here. > > Thanks, Lorenzo