* [DISCUSSION] Unconditionally lock folios when calling rmap_walk() @ 2025-08-22 17:29 Lokesh Gidra 2025-08-22 17:44 ` Matthew Wilcox ` (3 more replies) 0 siblings, 4 replies; 21+ messages in thread From: Lokesh Gidra @ 2025-08-22 17:29 UTC (permalink / raw) To: David Hildenbrand, Lorenzo Stoakes, Andrew Morton Cc: Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett Hi all, Currently, some callers of rmap_walk() conditionally avoid try-locking non-ksm anon folios. This necessitates serialization through anon_vma write-lock elsewhere when folio->mapping and/or folio->index (fields involved in rmap_walk()) are to be updated. This hurts scalability due to coarse granularity of the lock. For instance, when multiple threads invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages from the same src VMA, they all contend for the corresponding anon_vma’s lock. Field traces for arm64 android devices reveal over 30ms of uninterruptible sleep in the main UI thread, leading to janky user interactions. Among all rmap_walk() callers that don’t lock anon folios, folio_referenced() is the most critical (others are page_idle_clear_pte_refs(), damon_folio_young(), and damon_folio_mkold()). The relevant code in folio_referenced() is: if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { we_locked = folio_trylock(folio); if (!we_locked) return 1; } It’s unclear why locking anon_vma exclusively (when updating folio->mapping, like in uffd MOVE) is beneficial over walking rmap with folio locked. It’s in the reclaim path, so should not be a critical path that necessitates some special treatment, unless I’m missing something. Therefore, I propose simplifying the locking mechanism by ensuring the folio is locked before calling rmap_walk(). This helps avoid locking anon_vma when updating folio->mapping, which, for instance, will help eliminate the uninterruptible sleep observed in the field traces mentioned earlier. Furthermore, it enables us to simplify the code in folio_lock_anon_vma_read() by removing the re-check to ensure that the field hasn’t changed under us. Thanks, Lokesh ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-22 17:29 [DISCUSSION] Unconditionally lock folios when calling rmap_walk() Lokesh Gidra @ 2025-08-22 17:44 ` Matthew Wilcox 2025-08-22 18:08 ` Lorenzo Stoakes 2025-08-24 4:18 ` Lokesh Gidra ` (2 subsequent siblings) 3 siblings, 1 reply; 21+ messages in thread From: Matthew Wilcox @ 2025-08-22 17:44 UTC (permalink / raw) To: Lokesh Gidra Cc: David Hildenbrand, Lorenzo Stoakes, Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote: > Currently, some callers of rmap_walk() conditionally avoid try-locking > non-ksm anon folios. This necessitates serialization through anon_vma ... this seems awfully familiar. Why did you send it again after a bunch of people had already said useful things on that thread? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-22 17:44 ` Matthew Wilcox @ 2025-08-22 18:08 ` Lorenzo Stoakes 0 siblings, 0 replies; 21+ messages in thread From: Lorenzo Stoakes @ 2025-08-22 18:08 UTC (permalink / raw) To: Matthew Wilcox Cc: Lokesh Gidra, David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On Fri, Aug 22, 2025 at 06:44:13PM +0100, Matthew Wilcox wrote: > On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote: > > Currently, some callers of rmap_walk() conditionally avoid try-locking > > non-ksm anon folios. This necessitates serialization through anon_vma > > ... this seems awfully familiar. Why did you send it again after > a bunch of people had already said useful things on that thread? This is on me - I asked him to resend because the original was mislabelled and didn't cc the right people, and lei blew up when I tried to grab it, so I asked various people to hold off on reply/resend replies on resent thread and asked Lokesh to resend. Apologies for loss of context. rmap locking is important to me as it is deeply sensitive and has caused numerous painful issues in the past so want to make sure myself, David and the reivewers are all included. Cheers, Lorenzo ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-22 17:29 [DISCUSSION] Unconditionally lock folios when calling rmap_walk() Lokesh Gidra 2025-08-22 17:44 ` Matthew Wilcox @ 2025-08-24 4:18 ` Lokesh Gidra 2025-08-24 5:31 ` Harry Yoo 2025-08-25 15:19 ` David Hildenbrand 2025-08-26 15:52 ` Lorenzo Stoakes 3 siblings, 1 reply; 21+ messages in thread From: Lokesh Gidra @ 2025-08-24 4:18 UTC (permalink / raw) To: Harry Yoo Cc: Lorenzo Stoakes, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett, David Hildenbrand, Andrew Morton On Fri, Aug 22, 2025 at 10:29 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > Hi all, > > Currently, some callers of rmap_walk() conditionally avoid try-locking > non-ksm anon folios. This necessitates serialization through anon_vma > write-lock elsewhere when folio->mapping and/or folio->index (fields > involved in rmap_walk()) are to be updated. This hurts scalability due > to coarse granularity of the lock. For instance, when multiple threads > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages > from the same src VMA, they all contend for the corresponding > anon_vma’s lock. Field traces for arm64 android devices reveal over > 30ms of uninterruptible sleep in the main UI thread, leading to janky > user interactions. > > Among all rmap_walk() callers that don’t lock anon folios, > folio_referenced() is the most critical (others are > page_idle_clear_pte_refs(), damon_folio_young(), and > damon_folio_mkold()). The relevant code in folio_referenced() is: > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > we_locked = folio_trylock(folio); > if (!we_locked) > return 1; > } > > It’s unclear why locking anon_vma exclusively (when updating > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > with folio locked. It’s in the reclaim path, so should not be a > critical path that necessitates some special treatment, unless I’m > missing something. > > Therefore, I propose simplifying the locking mechanism by ensuring the > folio is locked before calling rmap_walk(). This helps avoid locking > anon_vma when updating folio->mapping, which, for instance, will help > eliminate the uninterruptible sleep observed in the field traces > mentioned earlier. Furthermore, it enables us to simplify the code in > folio_lock_anon_vma_read() by removing the re-check to ensure that the > field hasn’t changed under us. Hi Harry, Your comment [1] in the other thread was quite useful and also needed to be responded to. So bringing it here for continuing discussion. It seems from your comment that you misunderstood my proposal. I am not suggesting replacing anon_vma lock with folio lock during rmap walk. Clearly, it is essential for all the reasons that you enumerated. My proposal is to lock anon folios during rmap_walk(), like file and KSM folios. This helps in improving scalability (and also simplifying code in folio_lock_anon_vma_read()) as then we can serialize on folio lock instead of anon_vma lock when moving the folio to a different root anon_vma in folio_move_anon_rmap() [2]. [1] https://lore.kernel.org/all/aKhIL3OguViS9myH@hyeyoo/ [2] https://lore.kernel.org/all/e5d41fbe-a91b-9491-7b93-733f67e75a54@redhat.com/ > > Thanks, > Lokesh ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-24 4:18 ` Lokesh Gidra @ 2025-08-24 5:31 ` Harry Yoo 2025-08-24 6:45 ` Lokesh Gidra 0 siblings, 1 reply; 21+ messages in thread From: Harry Yoo @ 2025-08-24 5:31 UTC (permalink / raw) To: Lokesh Gidra Cc: Lorenzo Stoakes, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett, David Hildenbrand, Andrew Morton On Sat, Aug 23, 2025 at 09:18:11PM -0700, Lokesh Gidra wrote: > On Fri, Aug 22, 2025 at 10:29 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > > Hi all, > > > > Currently, some callers of rmap_walk() conditionally avoid try-locking > > non-ksm anon folios. This necessitates serialization through anon_vma > > write-lock elsewhere when folio->mapping and/or folio->index (fields > > involved in rmap_walk()) are to be updated. This hurts scalability due > > to coarse granularity of the lock. For instance, when multiple threads > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages > > from the same src VMA, they all contend for the corresponding > > anon_vma’s lock. Field traces for arm64 android devices reveal over > > 30ms of uninterruptible sleep in the main UI thread, leading to janky > > user interactions. > > > > Among all rmap_walk() callers that don’t lock anon folios, > > folio_referenced() is the most critical (others are > > page_idle_clear_pte_refs(), damon_folio_young(), and > > damon_folio_mkold()). The relevant code in folio_referenced() is: > > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > > we_locked = folio_trylock(folio); > > if (!we_locked) > > return 1; > > } > > > > It’s unclear why locking anon_vma exclusively (when updating > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > > with folio locked. It’s in the reclaim path, so should not be a > > critical path that necessitates some special treatment, unless I’m > > missing something. > > > > Therefore, I propose simplifying the locking mechanism by ensuring the > > folio is locked before calling rmap_walk(). This helps avoid locking > > anon_vma when updating folio->mapping, which, for instance, will help > > eliminate the uninterruptible sleep observed in the field traces > > mentioned earlier. Furthermore, it enables us to simplify the code in > > folio_lock_anon_vma_read() by removing the re-check to ensure that the > > field hasn’t changed under us. > Hi Harry, > > Your comment [1] in the other thread was quite useful and also needed > to be responded to. So bringing it here for continuing discussion. Hi Lokesh, Here I'm quoting my previous comment for discussion. I should have done it earlier but you know, it was Friday night in Korea :) My previous comment was: Simply acquiring the folio lock instead of anon_vma lock isn't enough 1) because the kernel can't stablize anon_vma without anon_vma lock (an anon_vma cannot be freed while someone's holding anon_vma lock, see anon_vma_free()). 2) without anon_vma lock the kernel can't reliably unmap folios because they can be mapped to other processes (by fork()) while the kernel is iterating list of VMAs that can possibly map the folio. fork() doens't and shouldn't acquire folio lock. 3) Holding anon_vma lock also prevents anon_vma_chains from being freed while holding the lock. [Are there more things to worry about that I missed? Please add them if so] Any idea to relax locking requirements while addressing these requirements? If some users don't care about missing some PTE A bits due to race against fork() (perhaps folio_referenced()?), a crazy idea might be to RCU-protect anon_vma_chains (but then we need to check if the VMA is still alive) and use refcount to stablize anon_vmas? > It seems from your comment that you misunderstood my proposal. I am > not suggesting replacing anon_vma lock with folio lock during rmap > walk. Clearly, it is essential for all the reasons that you > enumerated. My proposal is to lock anon folios during rmap_walk(), > like file and KSM folios. Still not sure if I follow your proposal. Let's clarify a little bit. As anon_vma lock is reader-writer semaphore, maybe you're saying 1) readers should acquire both folio lock and anon_vma lock, and > This helps in improving scalability (and also simplifying code in > folio_lock_anon_vma_read()) as then we can serialize on folio lock > instead of anon_vma lock when moving the folio to a different root > anon_vma in folio_move_anon_rmap() [2]. 2) some of existing writers (e.g., move_pages_pte() in mm/userfaultfd.c) simply update folio->index and folio->mapping, and they should be able to run in parallel if they're not updating the same folio, by taking folio lock and avoiding anon_vma lock? I see a comment in move_pages_pte(): /* * folio_referenced walks the anon_vma chain * without the folio lock. Serialize against it with * the anon_vma lock, the folio lock is not enough. */ > [1] https://lore.kernel.org/all/aKhIL3OguViS9myH@hyeyoo > [2] https://lore.kernel.org/all/e5d41fbe-a91b-9491-7b93-733f67e75a54@redhat.com > > > > Thanks, > > Lokesh -- Cheers, Harry / Hyeonggon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-24 5:31 ` Harry Yoo @ 2025-08-24 6:45 ` Lokesh Gidra 2025-08-25 10:52 ` Harry Yoo 0 siblings, 1 reply; 21+ messages in thread From: Lokesh Gidra @ 2025-08-24 6:45 UTC (permalink / raw) To: Harry Yoo Cc: Lorenzo Stoakes, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett, David Hildenbrand, Andrew Morton On Sat, Aug 23, 2025 at 10:31 PM Harry Yoo <harry.yoo@oracle.com> wrote: > > On Sat, Aug 23, 2025 at 09:18:11PM -0700, Lokesh Gidra wrote: > > On Fri, Aug 22, 2025 at 10:29 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > > > > Hi all, > > > > > > Currently, some callers of rmap_walk() conditionally avoid try-locking > > > non-ksm anon folios. This necessitates serialization through anon_vma > > > write-lock elsewhere when folio->mapping and/or folio->index (fields > > > involved in rmap_walk()) are to be updated. This hurts scalability due > > > to coarse granularity of the lock. For instance, when multiple threads > > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages > > > from the same src VMA, they all contend for the corresponding > > > anon_vma’s lock. Field traces for arm64 android devices reveal over > > > 30ms of uninterruptible sleep in the main UI thread, leading to janky > > > user interactions. > > > > > > Among all rmap_walk() callers that don’t lock anon folios, > > > folio_referenced() is the most critical (others are > > > page_idle_clear_pte_refs(), damon_folio_young(), and > > > damon_folio_mkold()). The relevant code in folio_referenced() is: > > > > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > > > we_locked = folio_trylock(folio); > > > if (!we_locked) > > > return 1; > > > } > > > > > > It’s unclear why locking anon_vma exclusively (when updating > > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > > > with folio locked. It’s in the reclaim path, so should not be a > > > critical path that necessitates some special treatment, unless I’m > > > missing something. > > > > > > Therefore, I propose simplifying the locking mechanism by ensuring the > > > folio is locked before calling rmap_walk(). This helps avoid locking > > > anon_vma when updating folio->mapping, which, for instance, will help > > > eliminate the uninterruptible sleep observed in the field traces > > > mentioned earlier. Furthermore, it enables us to simplify the code in > > > folio_lock_anon_vma_read() by removing the re-check to ensure that the > > > field hasn’t changed under us. > > Hi Harry, > > > > Your comment [1] in the other thread was quite useful and also needed > > to be responded to. So bringing it here for continuing discussion. > > Hi Lokesh, > > Here I'm quoting my previous comment for discussion. I should have done it > earlier but you know, it was Friday night in Korea :) No problem at all. :) > > My previous comment was: > Simply acquiring the folio lock instead of anon_vma lock isn't enough > 1) because the kernel can't stablize anon_vma without anon_vma lock > (an anon_vma cannot be freed while someone's holding anon_vma lock, > see anon_vma_free()). > > 2) without anon_vma lock the kernel can't reliably unmap folios because > they can be mapped to other processes (by fork()) while the kernel is > iterating list of VMAs that can possibly map the folio. fork() doens't > and shouldn't acquire folio lock. > > 3) Holding anon_vma lock also prevents anon_vma_chains from > being freed while holding the lock. > > [Are there more things to worry about that I missed? > Please add them if so] > > Any idea to relax locking requirements while addressing these > requirements? > > If some users don't care about missing some PTE A bits due to race > against fork() (perhaps folio_referenced()?), a crazy idea might be to > RCU-protect anon_vma_chains (but then we need to check if the VMA is > still alive) and use refcount to stablize anon_vmas? > > > It seems from your comment that you misunderstood my proposal. I am > > not suggesting replacing anon_vma lock with folio lock during rmap > > walk. Clearly, it is essential for all the reasons that you > > enumerated. My proposal is to lock anon folios during rmap_walk(), > > like file and KSM folios. > > Still not sure if I follow your proposal. Let's clarify a little bit. > > As anon_vma lock is reader-writer semaphore, maybe you're saying > 1) readers should acquire both folio lock and anon_vma lock, and > > > This helps in improving scalability (and also simplifying code in > > folio_lock_anon_vma_read()) as then we can serialize on folio lock > > instead of anon_vma lock when moving the folio to a different root > > anon_vma in folio_move_anon_rmap() [2]. > > 2) some of existing writers (e.g., move_pages_pte() in mm/userfaultfd.c) > simply update folio->index and folio->mapping, and they should be able > to run in parallel if they're not updating the same folio, > by taking folio lock and avoiding anon_vma lock? Yes, that's exactly what I am hoping to achieve. > > I see a comment in move_pages_pte(): > /* > * folio_referenced walks the anon_vma chain > * without the folio lock. Serialize against it with > * the anon_vma lock, the folio lock is not enough. > */ > > > [1] https://lore.kernel.org/all/aKhIL3OguViS9myH@hyeyoo > > [2] https://lore.kernel.org/all/e5d41fbe-a91b-9491-7b93-733f67e75a54@redhat.com > > > > > Thanks, > > > Lokesh > > -- > Cheers, > Harry / Hyeonggon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-24 6:45 ` Lokesh Gidra @ 2025-08-25 10:52 ` Harry Yoo 2025-08-28 11:32 ` Lorenzo Stoakes 0 siblings, 1 reply; 21+ messages in thread From: Harry Yoo @ 2025-08-25 10:52 UTC (permalink / raw) To: Lokesh Gidra Cc: Lorenzo Stoakes, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett, David Hildenbrand, Andrew Morton On Sat, Aug 23, 2025 at 11:45:03PM -0700, Lokesh Gidra wrote: > On Sat, Aug 23, 2025 at 10:31 PM Harry Yoo <harry.yoo@oracle.com> wrote: > > > > On Sat, Aug 23, 2025 at 09:18:11PM -0700, Lokesh Gidra wrote: > > > On Fri, Aug 22, 2025 at 10:29 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > > > > > > Hi all, > > > > > > > > Currently, some callers of rmap_walk() conditionally avoid try-locking > > > > non-ksm anon folios. This necessitates serialization through anon_vma > > > > write-lock elsewhere when folio->mapping and/or folio->index (fields > > > > involved in rmap_walk()) are to be updated. This hurts scalability due > > > > to coarse granularity of the lock. For instance, when multiple threads > > > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages > > > > from the same src VMA, they all contend for the corresponding > > > > anon_vma’s lock. Field traces for arm64 android devices reveal over > > > > 30ms of uninterruptible sleep in the main UI thread, leading to janky > > > > user interactions. > > > > > > > > Among all rmap_walk() callers that don’t lock anon folios, > > > > folio_referenced() is the most critical (others are > > > > page_idle_clear_pte_refs(), damon_folio_young(), and > > > > damon_folio_mkold()). The relevant code in folio_referenced() is: > > > > > > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > > > > we_locked = folio_trylock(folio); > > > > if (!we_locked) > > > > return 1; > > > > } > > > > > > > > It’s unclear why locking anon_vma exclusively (when updating > > > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > > > > with folio locked. It’s in the reclaim path, so should not be a > > > > critical path that necessitates some special treatment, unless I’m > > > > missing something. > > > > > > > > Therefore, I propose simplifying the locking mechanism by ensuring the > > > > folio is locked before calling rmap_walk(). This helps avoid locking > > > > anon_vma when updating folio->mapping, which, for instance, will help > > > > eliminate the uninterruptible sleep observed in the field traces > > > > mentioned earlier. Furthermore, it enables us to simplify the code in > > > > folio_lock_anon_vma_read() by removing the re-check to ensure that the > > > > field hasn’t changed under us. > > > Hi Harry, > > > > > > Your comment [1] in the other thread was quite useful and also needed > > > to be responded to. So bringing it here for continuing discussion. > > > > Hi Lokesh, > > > > Here I'm quoting my previous comment for discussion. I should have done it > > earlier but you know, it was Friday night in Korea :) > > No problem at all. :) > > > > My previous comment was: > > Simply acquiring the folio lock instead of anon_vma lock isn't enough > > 1) because the kernel can't stablize anon_vma without anon_vma lock > > (an anon_vma cannot be freed while someone's holding anon_vma lock, > > see anon_vma_free()). > > > > 2) without anon_vma lock the kernel can't reliably unmap folios because > > they can be mapped to other processes (by fork()) while the kernel is > > iterating list of VMAs that can possibly map the folio. fork() doens't > > and shouldn't acquire folio lock. > > > > 3) Holding anon_vma lock also prevents anon_vma_chains from > > being freed while holding the lock. > > > > [Are there more things to worry about that I missed? > > Please add them if so] > > > > Any idea to relax locking requirements while addressing these > > requirements? > > > > If some users don't care about missing some PTE A bits due to race > > against fork() (perhaps folio_referenced()?), a crazy idea might be to > > RCU-protect anon_vma_chains (but then we need to check if the VMA is > > still alive) and use refcount to stablize anon_vmas? > > > > > It seems from your comment that you misunderstood my proposal. I am > > > not suggesting replacing anon_vma lock with folio lock during rmap > > > walk. Clearly, it is essential for all the reasons that you > > > enumerated. My proposal is to lock anon folios during rmap_walk(), > > > like file and KSM folios. > > > > Still not sure if I follow your proposal. Let's clarify a little bit. > > > > As anon_vma lock is reader-writer semaphore, maybe you're saying > > 1) readers should acquire both folio lock and anon_vma lock, and > > > > > This helps in improving scalability (and also simplifying code in > > > folio_lock_anon_vma_read()) as then we can serialize on folio lock > > > instead of anon_vma lock when moving the folio to a different root > > > anon_vma in folio_move_anon_rmap() [2]. > > > > 2) some of existing writers (e.g., move_pages_pte() in mm/userfaultfd.c) > > simply update folio->index and folio->mapping, and they should be able > > to run in parallel if they're not updating the same folio, > > by taking folio lock and avoiding anon_vma lock? > > Yes, that's exactly what I am hoping to achieve. I think technically that should work. I can't think of a way this could go wrong (Or maybe I just lack imagination :/ ). Adding folio_lock/unlock in folio_referenced() is likely to affect reclamation performance. How bad is it? I don't know, need data. And adding cost only for the uffd case might not be justifiable... > > I see a comment in move_pages_pte(): > > /* > > * folio_referenced walks the anon_vma chain > > * without the folio lock. Serialize against it with > > * the anon_vma lock, the folio lock is not enough. > > */ > > > > > [1] https://lore.kernel.org/all/aKhIL3OguViS9myH@hyeyoo > > > [2] https://lore.kernel.org/all/e5d41fbe-a91b-9491-7b93-733f67e75a54@redhat.com > > > > > > Thanks, > > > > Lokesh -- Cheers, Harry / Hyeonggon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-25 10:52 ` Harry Yoo @ 2025-08-28 11:32 ` Lorenzo Stoakes 0 siblings, 0 replies; 21+ messages in thread From: Lorenzo Stoakes @ 2025-08-28 11:32 UTC (permalink / raw) To: Harry Yoo Cc: Lokesh Gidra, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett, David Hildenbrand, Andrew Morton On Mon, Aug 25, 2025 at 07:52:10PM +0900, Harry Yoo wrote: > Adding folio_lock/unlock in folio_referenced() is likely to affect > reclamation performance. How bad is it? I don't know, need data. > > And adding cost only for the uffd case might not be justifiable... Exactly. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-22 17:29 [DISCUSSION] Unconditionally lock folios when calling rmap_walk() Lokesh Gidra 2025-08-22 17:44 ` Matthew Wilcox 2025-08-24 4:18 ` Lokesh Gidra @ 2025-08-25 15:19 ` David Hildenbrand 2025-08-25 18:46 ` Lokesh Gidra 2025-08-28 12:04 ` Lorenzo Stoakes 2025-08-26 15:52 ` Lorenzo Stoakes 3 siblings, 2 replies; 21+ messages in thread From: David Hildenbrand @ 2025-08-25 15:19 UTC (permalink / raw) To: Lokesh Gidra, Lorenzo Stoakes, Andrew Morton Cc: Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On 22.08.25 19:29, Lokesh Gidra wrote: > Hi all, > > Currently, some callers of rmap_walk() conditionally avoid try-locking > non-ksm anon folios. This necessitates serialization through anon_vma > write-lock elsewhere when folio->mapping and/or folio->index (fields > involved in rmap_walk()) are to be updated. This hurts scalability due > to coarse granularity of the lock. For instance, when multiple threads > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages > from the same src VMA, they all contend for the corresponding > anon_vma’s lock. Field traces for arm64 android devices reveal over > 30ms of uninterruptible sleep in the main UI thread, leading to janky > user interactions. > > Among all rmap_walk() callers that don’t lock anon folios, > folio_referenced() is the most critical (others are > page_idle_clear_pte_refs(), damon_folio_young(), and > damon_folio_mkold()). The relevant code in folio_referenced() is: > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > we_locked = folio_trylock(folio); > if (!we_locked) > return 1; > } > > It’s unclear why locking anon_vma exclusively (when updating > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > with folio locked. It’s in the reclaim path, so should not be a > critical path that necessitates some special treatment, unless I’m > missing something. > > Therefore, I propose simplifying the locking mechanism by ensuring the > folio is locked before calling rmap_walk(). Essentially, what you mean is roughly: diff --git a/mm/rmap.c b/mm/rmap.c index 34333ae3bd80f..0800e73c0796e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1005,7 +1005,7 @@ int folio_referenced(struct folio *folio, int is_locked, if (!folio_raw_mapping(folio)) return 0; - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { + if (!is_locked) { we_locked = folio_trylock(folio); if (!we_locked) return 1; The downside of that change is that ordinary (!ksm) folios will observe being locked when we are actually only trying to asses if they were referenced. Does it matter? I can only speculate that it might have been very relevant before 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"). Essentially any R/O fault would have resulted in us copying the page, simply because there is concurrent folio_referenced() happening. Before 09854ba94c6a ("mm: do_wp_page() simplification") that wasn't an issue, but it would have meant that the write fault would be stuck until folio_referenced() would be done, which is also suboptimal. So likely, avoiding grabbing the folio lock was beneficial. Today, this would only affect R/O pages after fork (PageAnonExclusive not set). Staring at shrink_active_list()->folio_referenced(), we isolate the folio first (grabbing reference+clearing LRU), so do_wp_page()->wp_can_reuse_anon_folio() would already refuse to reuse immediately, because it would spot a raised reference. The folio lock does not make a difference anymore. Is there any other anon-specific (!ksm) folio locking? Nothing exciting comes to mind, except maybe some folio splitting or khugepaged that maybe would have to wait. But khugepaged would already also fail to isolate these folios, so probably it's not that relevant anymore ... -- Cheers David / dhildenb ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-25 15:19 ` David Hildenbrand @ 2025-08-25 18:46 ` Lokesh Gidra 2025-08-28 12:04 ` Lorenzo Stoakes 1 sibling, 0 replies; 21+ messages in thread From: Lokesh Gidra @ 2025-08-25 18:46 UTC (permalink / raw) To: David Hildenbrand Cc: Lorenzo Stoakes, Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On Mon, Aug 25, 2025 at 8:19 AM David Hildenbrand <david@redhat.com> wrote: > > On 22.08.25 19:29, Lokesh Gidra wrote: > > Hi all, > > > > Currently, some callers of rmap_walk() conditionally avoid try-locking > > non-ksm anon folios. This necessitates serialization through anon_vma > > write-lock elsewhere when folio->mapping and/or folio->index (fields > > involved in rmap_walk()) are to be updated. This hurts scalability due > > to coarse granularity of the lock. For instance, when multiple threads > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages > > from the same src VMA, they all contend for the corresponding > > anon_vma’s lock. Field traces for arm64 android devices reveal over > > 30ms of uninterruptible sleep in the main UI thread, leading to janky > > user interactions. > > > > Among all rmap_walk() callers that don’t lock anon folios, > > folio_referenced() is the most critical (others are > > page_idle_clear_pte_refs(), damon_folio_young(), and > > damon_folio_mkold()). The relevant code in folio_referenced() is: > > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > > we_locked = folio_trylock(folio); > > if (!we_locked) > > return 1; > > } > > > > It’s unclear why locking anon_vma exclusively (when updating > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > > with folio locked. It’s in the reclaim path, so should not be a > > critical path that necessitates some special treatment, unless I’m > > missing something. > > > > Therefore, I propose simplifying the locking mechanism by ensuring the > > folio is locked before calling rmap_walk(). > > Essentially, what you mean is roughly: > > diff --git a/mm/rmap.c b/mm/rmap.c > index 34333ae3bd80f..0800e73c0796e 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1005,7 +1005,7 @@ int folio_referenced(struct folio *folio, int is_locked, > if (!folio_raw_mapping(folio)) > return 0; > > - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > + if (!is_locked) { > we_locked = folio_trylock(folio); > if (!we_locked) > return 1; > > > The downside of that change is that ordinary (!ksm) folios will observe being locked > when we are actually only trying to asses if they were referenced. > > Does it matter? > > I can only speculate that it might have been very relevant before > 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"). > > Essentially any R/O fault would have resulted in us copying the page, simply because > there is concurrent folio_referenced() happening. > > Before 09854ba94c6a ("mm: do_wp_page() simplification") that wasn't an issue, but > it would have meant that the write fault would be stuck until folio_referenced() > would be done, which is also suboptimal. > > So likely, avoiding grabbing the folio lock was beneficial. > > > Today, this would only affect R/O pages after fork (PageAnonExclusive not set). > > > Staring at shrink_active_list()->folio_referenced(), we isolate the folio first > (grabbing reference+clearing LRU), so do_wp_page()->wp_can_reuse_anon_folio() > would already refuse to reuse immediately, because it would spot a raised reference. > The folio lock does not make a difference anymore. > > > Is there any other anon-specific (!ksm) folio locking? Nothing exciting comes to mind, > except maybe some folio splitting or khugepaged that maybe would have to wait. > > But khugepaged would already also fail to isolate these folios, so probably it's not that > relevant anymore ... Thanks so much for your thorough analysis. Very useful! For folio splitting, it seems anon_vma lock is acquired exclusively, so it serializes against folio_referenced() anyways. > > -- > Cheers > > David / dhildenb > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-25 15:19 ` David Hildenbrand 2025-08-25 18:46 ` Lokesh Gidra @ 2025-08-28 12:04 ` Lorenzo Stoakes 2025-08-29 0:23 ` Lokesh Gidra 1 sibling, 1 reply; 21+ messages in thread From: Lorenzo Stoakes @ 2025-08-28 12:04 UTC (permalink / raw) To: David Hildenbrand Cc: Lokesh Gidra, Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On Mon, Aug 25, 2025 at 05:19:05PM +0200, David Hildenbrand wrote: > On 22.08.25 19:29, Lokesh Gidra wrote: > > Hi all, > > > > Currently, some callers of rmap_walk() conditionally avoid try-locking > > non-ksm anon folios. This necessitates serialization through anon_vma > > write-lock elsewhere when folio->mapping and/or folio->index (fields > > involved in rmap_walk()) are to be updated. This hurts scalability due > > to coarse granularity of the lock. For instance, when multiple threads > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages > > from the same src VMA, they all contend for the corresponding > > anon_vma’s lock. Field traces for arm64 android devices reveal over > > 30ms of uninterruptible sleep in the main UI thread, leading to janky > > user interactions. > > > > Among all rmap_walk() callers that don’t lock anon folios, > > folio_referenced() is the most critical (others are > > page_idle_clear_pte_refs(), damon_folio_young(), and > > damon_folio_mkold()). The relevant code in folio_referenced() is: > > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > > we_locked = folio_trylock(folio); > > if (!we_locked) > > return 1; > > } > > > > It’s unclear why locking anon_vma exclusively (when updating > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > > with folio locked. It’s in the reclaim path, so should not be a > > critical path that necessitates some special treatment, unless I’m > > missing something. > > > > Therefore, I propose simplifying the locking mechanism by ensuring the > > folio is locked before calling rmap_walk(). > > Essentially, what you mean is roughly: > > diff --git a/mm/rmap.c b/mm/rmap.c > index 34333ae3bd80f..0800e73c0796e 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1005,7 +1005,7 @@ int folio_referenced(struct folio *folio, int is_locked, > if (!folio_raw_mapping(folio)) > return 0; > - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > + if (!is_locked) { > we_locked = folio_trylock(folio); > if (!we_locked) > return 1; > > > The downside of that change is that ordinary (!ksm) folios will observe being locked Well anon folios, I guess this is what you meant :) > when we are actually only trying to asses if they were referenced. > > Does it matter? Also another downside is we try lock and abort if we fail, so we've made this conditional on that. But surely this is going to impact reclaim performance esp. under heavy memory pressure? It is at least a trylock. It's dangerous waters, and I'd really want some detailed data + analysis to prove the point here, I don't think theorising about it is enough. > > I can only speculate that it might have been very relevant before > 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"). > > Essentially any R/O fault would have resulted in us copying the page, simply because > there is concurrent folio_referenced() happening. Fun. Thing is we now have to consider _every case_ where a contention might cause an issue. One thing I _was_ concerned about was: - uffd move locks folios - now folio_referenced() 'fails' returning 1 But case 2 is only in shrink_active_list() which uses this as a boolean... OK so maybe fine for this one. I really do also hate that any future callers are going to possibly be confused about how this function works, but I guess it was already 'weird' for file-backed/KSM. So the issue remains really - folio lock contention as a result of this. It's one thing to theorise, but you may be forgetting something... and then we've changed an absolutely core thing to suit a niche UFFD use case. I do wonder if we can identify this case and handle things differently. Perhaps even saying 'try and get the rmap lock, but if there's "too much" contention, grab the folio lock. > > Before 09854ba94c6a ("mm: do_wp_page() simplification") that wasn't an issue, but > it would have meant that the write fault would be stuck until folio_referenced() > would be done, which is also suboptimal. > > So likely, avoiding grabbing the folio lock was beneficial. > > > Today, this would only affect R/O pages after fork (PageAnonExclusive not set). Hm probably less of a problem that. > > > Staring at shrink_active_list()->folio_referenced(), we isolate the folio first > (grabbing reference+clearing LRU), so do_wp_page()->wp_can_reuse_anon_folio() > would already refuse to reuse immediately, because it would spot a raised reference. > The folio lock does not make a difference anymore. folio_check_references() we're good with anyway as folio already locked. So obviously shrink_active_list() is the only caller we really care about. That at least reduces this case, but we then have to deal with the fact we're contending this lock elsewhere. > > > Is there any other anon-specific (!ksm) folio locking? Nothing exciting comes to mind, > except maybe some folio splitting or khugepaged that maybe would have to wait. > > But khugepaged would already also fail to isolate these folios, so probably it's not that > relevant anymore ... This is it... there's a lot of possibilities and we need to tread extremely carefully. if we could find a way to make uffd deal with this one way or another (or possibly - detecting heavy anon vma lock contention) maybe that'd be better... but then adding more complexity obv. > > -- > Cheers > > David / dhildenb > I mean having said all the above and also in other thread - I am open to being convinced I'm wrong and this is ok. Obviously removing the complicated special case for anon would _in general_ be nice, it's just very sensitive stuff :) Cheers, Lorenzo ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-28 12:04 ` Lorenzo Stoakes @ 2025-08-29 0:23 ` Lokesh Gidra 2025-08-29 8:42 ` David Hildenbrand 2025-08-29 9:01 ` Lorenzo Stoakes 0 siblings, 2 replies; 21+ messages in thread From: Lokesh Gidra @ 2025-08-29 0:23 UTC (permalink / raw) To: Lorenzo Stoakes Cc: David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On Thu, Aug 28, 2025 at 5:04 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Mon, Aug 25, 2025 at 05:19:05PM +0200, David Hildenbrand wrote: > > On 22.08.25 19:29, Lokesh Gidra wrote: > > > Hi all, > > > > > > Currently, some callers of rmap_walk() conditionally avoid try-locking > > > non-ksm anon folios. This necessitates serialization through anon_vma > > > write-lock elsewhere when folio->mapping and/or folio->index (fields > > > involved in rmap_walk()) are to be updated. This hurts scalability due > > > to coarse granularity of the lock. For instance, when multiple threads > > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages > > > from the same src VMA, they all contend for the corresponding > > > anon_vma’s lock. Field traces for arm64 android devices reveal over > > > 30ms of uninterruptible sleep in the main UI thread, leading to janky > > > user interactions. > > > > > > Among all rmap_walk() callers that don’t lock anon folios, > > > folio_referenced() is the most critical (others are > > > page_idle_clear_pte_refs(), damon_folio_young(), and > > > damon_folio_mkold()). The relevant code in folio_referenced() is: > > > > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > > > we_locked = folio_trylock(folio); > > > if (!we_locked) > > > return 1; > > > } > > > > > > It’s unclear why locking anon_vma exclusively (when updating > > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > > > with folio locked. It’s in the reclaim path, so should not be a > > > critical path that necessitates some special treatment, unless I’m > > > missing something. > > > > > > Therefore, I propose simplifying the locking mechanism by ensuring the > > > folio is locked before calling rmap_walk(). > > > > Essentially, what you mean is roughly: > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 34333ae3bd80f..0800e73c0796e 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -1005,7 +1005,7 @@ int folio_referenced(struct folio *folio, int is_locked, > > if (!folio_raw_mapping(folio)) > > return 0; > > - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > > + if (!is_locked) { > > we_locked = folio_trylock(folio); > > if (!we_locked) > > return 1; > > > > > > The downside of that change is that ordinary (!ksm) folios will observe being locked > > Well anon folios, I guess this is what you meant :) > > > when we are actually only trying to asses if they were referenced. > > > > Does it matter? > > Also another downside is we try lock and abort if we fail, so we've made this > conditional on that. > > But surely this is going to impact reclaim performance esp. under heavy memory > pressure? It is at least a trylock. > > It's dangerous waters, and I'd really want some detailed data + analysis to > prove the point here, I don't think theorising about it is enough. > > > > > I can only speculate that it might have been very relevant before > > 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"). > > > > Essentially any R/O fault would have resulted in us copying the page, simply because > > there is concurrent folio_referenced() happening. > > Fun. > > Thing is we now have to consider _every case_ where a contention might cause an > issue. > > One thing I _was_ concerned about was: > > - uffd move locks folios > - now folio_referenced() 'fails' returning 1 > > But case 2 is only in shrink_active_list() which uses this as a boolean... > > OK so maybe fine for this one. For shrink_active_list() it seems like doesn't matter what folio_referenced() returns unless it's an executable file-backed folio. > > I really do also hate that any future callers are going to possibly be confused > about how this function works, but I guess it was already 'weird' for > file-backed/KSM. Actually, wouldn't the simplification remove the already existing confusion, rather than adding to it? :) We can then simply say, rmap_walk() expects folio to be locked. > > So the issue remains really - folio lock contention as a result of this. > > It's one thing to theorise, but you may be forgetting something... and then > we've changed an absolutely core thing to suit a niche UFFD use case. I really wish there was a way to avoid this within the UFFD code :( In fact, the real pain point is multiple UFFD threads contending for write-lock on anon_vma, even when they don't need to serialize among themselves. > > I do wonder if we can identify this case and handle things differently. > > Perhaps even saying 'try and get the rmap lock, but if there's "too much" > contention, grab the folio lock. Can you please elaborate what you mean? Where do you mean we can possibly do something like this? UFFD move only works on PageAnonExclusive folios. So, would it help (in terms of avoiding contention) if we were to change the condition: diff --git a/mm/rmap.c b/mm/rmap.c index 568198e9efc2..1638e27347e7 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1006,7 +1006,8 @@ int folio_referenced(struct folio *folio, int is_locked, if (!folio_raw_mapping(folio)) return 0; - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { + if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio) || + PageAnonExclusive(&folio->page))) { we_locked = folio_trylock(folio); if (!we_locked) return 1; Obviously, this is opposite of simplification :) But as we know that shrink_active_list() uses this as a boolean, so do we even need to walk rmap for PageAnonExclusive folios? Can't we simply do: diff --git a/mm/rmap.c b/mm/rmap.c index 568198e9efc2..a26523de321f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1006,10 +1006,14 @@ int folio_referenced(struct folio *folio, int is_locked, if (!folio_raw_mapping(folio)) return 0; - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { - we_locked = folio_trylock(folio); - if (!we_locked) + if (!is_locked) { + if (!folio_test_anon(folio) || folio_test_ksm(folio)) { + we_locked = folio_trylock(folio); + if (!we_locked) + return 1; + } else if (PageAnonExclusive(&folio->page)) { return 1; + } } rmap_walk(folio, &rwc); I'm not at all an expert on this, so pardon my ignorance if this is wrong. > > > > > Before 09854ba94c6a ("mm: do_wp_page() simplification") that wasn't an issue, but > > it would have meant that the write fault would be stuck until folio_referenced() > > would be done, which is also suboptimal. > > > > So likely, avoiding grabbing the folio lock was beneficial. > > > > > > Today, this would only affect R/O pages after fork (PageAnonExclusive not set). > > Hm probably less of a problem that. > > > > > > > Staring at shrink_active_list()->folio_referenced(), we isolate the folio first > > (grabbing reference+clearing LRU), so do_wp_page()->wp_can_reuse_anon_folio() > > would already refuse to reuse immediately, because it would spot a raised reference. > > The folio lock does not make a difference anymore. > > folio_check_references() we're good with anyway as folio already locked. > > So obviously shrink_active_list() is the only caller we really care about. > > That at least reduces this case, but we then have to deal with the fact we're > contending this lock elsewhere. > > > > > > > Is there any other anon-specific (!ksm) folio locking? Nothing exciting comes to mind, > > except maybe some folio splitting or khugepaged that maybe would have to wait. > > > > But khugepaged would already also fail to isolate these folios, so probably it's not that > > relevant anymore ... > > This is it... there's a lot of possibilities and we need to tread extremely > carefully. > > if we could find a way to make uffd deal with this one way or another (or > possibly - detecting heavy anon vma lock contention) maybe that'd be > better... but then adding more complexity obv. > > > > > -- > > Cheers > > > > David / dhildenb > > > > I mean having said all the above and also in other thread - I am open to being > convinced I'm wrong and this is ok. > > Obviously removing the complicated special case for anon would _in general_ be > nice, it's just very sensitive stuff :) > > Cheers, Lorenzo ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-29 0:23 ` Lokesh Gidra @ 2025-08-29 8:42 ` David Hildenbrand 2025-08-29 9:04 ` Lorenzo Stoakes 2025-08-29 9:01 ` Lorenzo Stoakes 1 sibling, 1 reply; 21+ messages in thread From: David Hildenbrand @ 2025-08-29 8:42 UTC (permalink / raw) To: Lokesh Gidra, Lorenzo Stoakes Cc: Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett >> >> I do wonder if we can identify this case and handle things differently. >> >> Perhaps even saying 'try and get the rmap lock, but if there's "too much" >> contention, grab the folio lock. > > Can you please elaborate what you mean? Where do you mean we can > possibly do something like this? > > UFFD move only works on PageAnonExclusive folios. So, would it help > (in terms of avoiding contention) if we were to change the condition: I think we shouldn't be using PAE here. Once could consider using folio_maybe_mapped_shared(), and assume contention on the folio lock if it is maybe mapped shared. But the real question is with whom we would be contending for the folio lock. Is it really other processes mapping that folio? I'm not so sure. -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-29 8:42 ` David Hildenbrand @ 2025-08-29 9:04 ` Lorenzo Stoakes 2025-09-02 18:59 ` Lokesh Gidra 0 siblings, 1 reply; 21+ messages in thread From: Lorenzo Stoakes @ 2025-08-29 9:04 UTC (permalink / raw) To: David Hildenbrand Cc: Lokesh Gidra, Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On Fri, Aug 29, 2025 at 10:42:45AM +0200, David Hildenbrand wrote: > > > > > > I do wonder if we can identify this case and handle things differently. > > > > > > Perhaps even saying 'try and get the rmap lock, but if there's "too much" > > > contention, grab the folio lock. > > > > Can you please elaborate what you mean? Where do you mean we can > > possibly do something like this? > > > > UFFD move only works on PageAnonExclusive folios. So, would it help > > (in terms of avoiding contention) if we were to change the condition: > > I think we shouldn't be using PAE here. Once could consider using > folio_maybe_mapped_shared(), and assume contention on the folio lock if it > is maybe mapped shared. Interesting! > > But the real question is with whom we would be contending for the folio > lock. > > Is it really other processes mapping that folio? I'm not so sure. Yeah, I might go off and do some research myself on this, actually. Nail down wehre this might actually happen. Generally I'm softening on this and maybe we're good with the proposed change. But still want to be super careful here... :) > > -- > Cheers > > David / dhildenb > Cheers, Lorenzo ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-29 9:04 ` Lorenzo Stoakes @ 2025-09-02 18:59 ` Lokesh Gidra 2025-09-02 19:01 ` David Hildenbrand 0 siblings, 1 reply; 21+ messages in thread From: Lokesh Gidra @ 2025-09-02 18:59 UTC (permalink / raw) To: Lorenzo Stoakes Cc: David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On Fri, Aug 29, 2025 at 2:04 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Fri, Aug 29, 2025 at 10:42:45AM +0200, David Hildenbrand wrote: > > > > > > > > I do wonder if we can identify this case and handle things differently. > > > > > > > > Perhaps even saying 'try and get the rmap lock, but if there's "too much" > > > > contention, grab the folio lock. > > > > > > Can you please elaborate what you mean? Where do you mean we can > > > possibly do something like this? > > > > > > UFFD move only works on PageAnonExclusive folios. So, would it help > > > (in terms of avoiding contention) if we were to change the condition: > > > > I think we shouldn't be using PAE here. Once could consider using > > folio_maybe_mapped_shared(), and assume contention on the folio lock if it > > is maybe mapped shared. > > Interesting! > > > > > But the real question is with whom we would be contending for the folio > > lock. > > > > Is it really other processes mapping that folio? I'm not so sure. > > Yeah, I might go off and do some research myself on this, actually. Nail down > wehre this might actually happen. > > Generally I'm softening on this and maybe we're good with the proposed change. > > But still want to be super careful here... :) > Anxiously waiting for your assessment. Fingers crossed :) > > > > -- > > Cheers > > > > David / dhildenb > > > > Cheers, Lorenzo ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-09-02 18:59 ` Lokesh Gidra @ 2025-09-02 19:01 ` David Hildenbrand 2025-09-02 19:04 ` Lokesh Gidra 0 siblings, 1 reply; 21+ messages in thread From: David Hildenbrand @ 2025-09-02 19:01 UTC (permalink / raw) To: Lokesh Gidra, Lorenzo Stoakes Cc: Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On 02.09.25 20:59, Lokesh Gidra wrote: > On Fri, Aug 29, 2025 at 2:04 AM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: >> >> On Fri, Aug 29, 2025 at 10:42:45AM +0200, David Hildenbrand wrote: >>>>> >>>>> I do wonder if we can identify this case and handle things differently. >>>>> >>>>> Perhaps even saying 'try and get the rmap lock, but if there's "too much" >>>>> contention, grab the folio lock. >>>> >>>> Can you please elaborate what you mean? Where do you mean we can >>>> possibly do something like this? >>>> >>>> UFFD move only works on PageAnonExclusive folios. So, would it help >>>> (in terms of avoiding contention) if we were to change the condition: >>> >>> I think we shouldn't be using PAE here. Once could consider using >>> folio_maybe_mapped_shared(), and assume contention on the folio lock if it >>> is maybe mapped shared. >> >> Interesting! >> >>> >>> But the real question is with whom we would be contending for the folio >>> lock. >>> >>> Is it really other processes mapping that folio? I'm not so sure. >> >> Yeah, I might go off and do some research myself on this, actually. Nail down >> wehre this might actually happen. >> >> Generally I'm softening on this and maybe we're good with the proposed change. >> >> But still want to be super careful here... :) >> > Anxiously waiting for your assessment. Fingers crossed :) I'd suggest you prepare an RFC patch where you neatly summarize all we learned so far. :) -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-09-02 19:01 ` David Hildenbrand @ 2025-09-02 19:04 ` Lokesh Gidra 0 siblings, 0 replies; 21+ messages in thread From: Lokesh Gidra @ 2025-09-02 19:04 UTC (permalink / raw) To: David Hildenbrand Cc: Lorenzo Stoakes, Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On Tue, Sep 2, 2025 at 12:01 PM David Hildenbrand <david@redhat.com> wrote: > > On 02.09.25 20:59, Lokesh Gidra wrote: > > On Fri, Aug 29, 2025 at 2:04 AM Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > >> > >> On Fri, Aug 29, 2025 at 10:42:45AM +0200, David Hildenbrand wrote: > >>>>> > >>>>> I do wonder if we can identify this case and handle things differently. > >>>>> > >>>>> Perhaps even saying 'try and get the rmap lock, but if there's "too much" > >>>>> contention, grab the folio lock. > >>>> > >>>> Can you please elaborate what you mean? Where do you mean we can > >>>> possibly do something like this? > >>>> > >>>> UFFD move only works on PageAnonExclusive folios. So, would it help > >>>> (in terms of avoiding contention) if we were to change the condition: > >>> > >>> I think we shouldn't be using PAE here. Once could consider using > >>> folio_maybe_mapped_shared(), and assume contention on the folio lock if it > >>> is maybe mapped shared. > >> > >> Interesting! > >> > >>> > >>> But the real question is with whom we would be contending for the folio > >>> lock. > >>> > >>> Is it really other processes mapping that folio? I'm not so sure. > >> > >> Yeah, I might go off and do some research myself on this, actually. Nail down > >> wehre this might actually happen. > >> > >> Generally I'm softening on this and maybe we're good with the proposed change. > >> > >> But still want to be super careful here... :) > >> > > Anxiously waiting for your assessment. Fingers crossed :) > > I'd suggest you prepare an RFC patch where you neatly summarize all we > learned so far. :) > Sounds good. Will do. Thanks. > -- > Cheers > > David / dhildenb > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-29 0:23 ` Lokesh Gidra 2025-08-29 8:42 ` David Hildenbrand @ 2025-08-29 9:01 ` Lorenzo Stoakes 1 sibling, 0 replies; 21+ messages in thread From: Lorenzo Stoakes @ 2025-08-29 9:01 UTC (permalink / raw) To: Lokesh Gidra Cc: David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On Thu, Aug 28, 2025 at 05:23:56PM -0700, Lokesh Gidra wrote: > On Thu, Aug 28, 2025 at 5:04 AM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > > > On Mon, Aug 25, 2025 at 05:19:05PM +0200, David Hildenbrand wrote: > > > On 22.08.25 19:29, Lokesh Gidra wrote: > > > > Hi all, > > > > > > > > Currently, some callers of rmap_walk() conditionally avoid try-locking > > > > non-ksm anon folios. This necessitates serialization through anon_vma > > > > write-lock elsewhere when folio->mapping and/or folio->index (fields > > > > involved in rmap_walk()) are to be updated. This hurts scalability due > > > > to coarse granularity of the lock. For instance, when multiple threads > > > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages > > > > from the same src VMA, they all contend for the corresponding > > > > anon_vma’s lock. Field traces for arm64 android devices reveal over > > > > 30ms of uninterruptible sleep in the main UI thread, leading to janky > > > > user interactions. > > > > > > > > Among all rmap_walk() callers that don’t lock anon folios, > > > > folio_referenced() is the most critical (others are > > > > page_idle_clear_pte_refs(), damon_folio_young(), and > > > > damon_folio_mkold()). The relevant code in folio_referenced() is: > > > > > > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > > > > we_locked = folio_trylock(folio); > > > > if (!we_locked) > > > > return 1; > > > > } > > > > > > > > It’s unclear why locking anon_vma exclusively (when updating > > > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > > > > with folio locked. It’s in the reclaim path, so should not be a > > > > critical path that necessitates some special treatment, unless I’m > > > > missing something. > > > > > > > > Therefore, I propose simplifying the locking mechanism by ensuring the > > > > folio is locked before calling rmap_walk(). > > > > > > Essentially, what you mean is roughly: > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > index 34333ae3bd80f..0800e73c0796e 100644 > > > --- a/mm/rmap.c > > > +++ b/mm/rmap.c > > > @@ -1005,7 +1005,7 @@ int folio_referenced(struct folio *folio, int is_locked, > > > if (!folio_raw_mapping(folio)) > > > return 0; > > > - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > > > + if (!is_locked) { > > > we_locked = folio_trylock(folio); > > > if (!we_locked) > > > return 1; > > > > > > > > > The downside of that change is that ordinary (!ksm) folios will observe being locked > > > > Well anon folios, I guess this is what you meant :) > > > > > when we are actually only trying to asses if they were referenced. > > > > > > Does it matter? > > > > Also another downside is we try lock and abort if we fail, so we've made this > > conditional on that. > > > > But surely this is going to impact reclaim performance esp. under heavy memory > > pressure? It is at least a trylock. > > > > It's dangerous waters, and I'd really want some detailed data + analysis to > > prove the point here, I don't think theorising about it is enough. > > > > > > > > I can only speculate that it might have been very relevant before > > > 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"). > > > > > > Essentially any R/O fault would have resulted in us copying the page, simply because > > > there is concurrent folio_referenced() happening. > > > > Fun. > > > > Thing is we now have to consider _every case_ where a contention might cause an > > issue. > > > > One thing I _was_ concerned about was: > > > > - uffd move locks folios > > - now folio_referenced() 'fails' returning 1 > > > > But case 2 is only in shrink_active_list() which uses this as a boolean... > > > > OK so maybe fine for this one. > > For shrink_active_list() it seems like doesn't matter what > folio_referenced() returns unless it's an executable file-backed > folio. Yes agreed, was chatting with David I think yesterday (my review load atm makes remembering when I did stuff harder :P) and was going through the code as a result and had a closer look and you're right. So it returning 1 is fine. > > > > I really do also hate that any future callers are going to possibly be confused > > about how this function works, but I guess it was already 'weird' for > > file-backed/KSM. > > Actually, wouldn't the simplification remove the already existing > confusion, rather than adding to it? :) > We can then simply say, rmap_walk() expects folio to be locked. Yeah it does simplify in that sense, the real issue is - will we see contention in some workloads. I'm sort of gradually softening on this as we talk... but I feel like we really need to check this more thoroughly. > > > > > So the issue remains really - folio lock contention as a result of this. > > > > It's one thing to theorise, but you may be forgetting something... and then > > we've changed an absolutely core thing to suit a niche UFFD use case. > > I really wish there was a way to avoid this within the UFFD code :( In > fact, the real pain point is multiple UFFD threads contending for > write-lock on anon_vma, even when they don't need to serialize among > themselves. > > > > I do wonder if we can identify this case and handle things differently. > > > > Perhaps even saying 'try and get the rmap lock, but if there's "too much" > > contention, grab the folio lock. > > Can you please elaborate what you mean? Where do you mean we can > possibly do something like this? It's vague hand waving, but generally if we could detect contention then we could conditionally change how we handle this, perhaps... > > UFFD move only works on PageAnonExclusive folios. So, would it help > (in terms of avoiding contention) if we were to change the condition: > > diff --git a/mm/rmap.c b/mm/rmap.c > index 568198e9efc2..1638e27347e7 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1006,7 +1006,8 @@ int folio_referenced(struct folio *folio, int is_locked, > if (!folio_raw_mapping(folio)) > return 0; > > - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > + if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio) || > + PageAnonExclusive(&folio->page))) { > we_locked = folio_trylock(folio); > if (!we_locked) > return 1; > > Obviously, this is opposite of simplification :) > > But as we know that shrink_active_list() uses this as a boolean, so do > we even need to walk rmap for PageAnonExclusive folios? Can't we > simply do: > > diff --git a/mm/rmap.c b/mm/rmap.c > index 568198e9efc2..a26523de321f 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1006,10 +1006,14 @@ int folio_referenced(struct folio *folio, int is_locked, > if (!folio_raw_mapping(folio)) > return 0; > > - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > - we_locked = folio_trylock(folio); > - if (!we_locked) > + if (!is_locked) { > + if (!folio_test_anon(folio) || folio_test_ksm(folio)) { > + we_locked = folio_trylock(folio); > + if (!we_locked) > + return 1; > + } else if (PageAnonExclusive(&folio->page)) { > return 1; > + } > } > > rmap_walk(folio, &rwc); > > I'm not at all an expert on this, so pardon my ignorance if this is wrong. I see David's replied and he _is_ the expert on PAE so will see :) > > > > > > > > Before 09854ba94c6a ("mm: do_wp_page() simplification") that wasn't an issue, but > > > it would have meant that the write fault would be stuck until folio_referenced() > > > would be done, which is also suboptimal. > > > > > > So likely, avoiding grabbing the folio lock was beneficial. > > > > > > > > > Today, this would only affect R/O pages after fork (PageAnonExclusive not set). > > > > Hm probably less of a problem that. > > > > > > > > > > > Staring at shrink_active_list()->folio_referenced(), we isolate the folio first > > > (grabbing reference+clearing LRU), so do_wp_page()->wp_can_reuse_anon_folio() > > > would already refuse to reuse immediately, because it would spot a raised reference. > > > The folio lock does not make a difference anymore. > > > > folio_check_references() we're good with anyway as folio already locked. > > > > So obviously shrink_active_list() is the only caller we really care about. > > > > That at least reduces this case, but we then have to deal with the fact we're > > contending this lock elsewhere. > > > > > > > > > > > Is there any other anon-specific (!ksm) folio locking? Nothing exciting comes to mind, > > > except maybe some folio splitting or khugepaged that maybe would have to wait. > > > > > > But khugepaged would already also fail to isolate these folios, so probably it's not that > > > relevant anymore ... > > > > This is it... there's a lot of possibilities and we need to tread extremely > > carefully. > > > > if we could find a way to make uffd deal with this one way or another (or > > possibly - detecting heavy anon vma lock contention) maybe that'd be > > better... but then adding more complexity obv. > > > > > > > > -- > > > Cheers > > > > > > David / dhildenb > > > > > > > I mean having said all the above and also in other thread - I am open to being > > convinced I'm wrong and this is ok. > > > > Obviously removing the complicated special case for anon would _in general_ be > > nice, it's just very sensitive stuff :) > > > > Cheers, Lorenzo ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-22 17:29 [DISCUSSION] Unconditionally lock folios when calling rmap_walk() Lokesh Gidra ` (2 preceding siblings ...) 2025-08-25 15:19 ` David Hildenbrand @ 2025-08-26 15:52 ` Lorenzo Stoakes 2025-08-26 22:23 ` Lokesh Gidra 3 siblings, 1 reply; 21+ messages in thread From: Lorenzo Stoakes @ 2025-08-26 15:52 UTC (permalink / raw) To: Lokesh Gidra Cc: David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote: > Hi all, > > Currently, some callers of rmap_walk() conditionally avoid try-locking > non-ksm anon folios. This necessitates serialization through anon_vma > write-lock elsewhere when folio->mapping and/or folio->index (fields > involved in rmap_walk()) are to be updated. This hurts scalability due > to coarse granularity of the lock. For instance, when multiple threads > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages > from the same src VMA, they all contend for the corresponding > anon_vma’s lock. Field traces for arm64 android devices reveal over > 30ms of uninterruptible sleep in the main UI thread, leading to janky > user interactions. Can we clarify whether this is simply an example, or rather the entire motivating reason for raising this issue? It's important, because it strikes me that this is a very specific use case, and you're now suggesting changing core locking to suit it. While this is a discussion, and I'm glad you raised it, I think it's important in these cases to really exhaustively examine all of the possible consequences. OK so to clarify: - You want to traverse the rmap entirely without any rmap locks whatsoever for anon, relying solely on the folio lock to serialise, because otherwise rmap read locks here block other rmap write lock calls. - You want to unconditionally folio lock all anon and kSM folios for at least folio_referenced(). In order to resolve a scalability issue specific to a uffd usecase? Is this the case? Happy to be corrected if I've misinterpreted. I don't see how this could possibly work, unless I'm missing something here, because: 1. When we lock anon_vma's it's at the root which covers all anon_vma's covering parent/children of forked processes. 2. We do "top down" operations that acquire the rmap lock on the assumption we have exclusive access to the rmapping that have nothing to do with the folio nor could we even know what the folio is at this point. 3. We manipulate higher level page tables on the basis that the rmap lock excludes other page table walkers. So this proposal seems to violate all of that? For instance, in many VMA operations we perform: anon_vma_interval_tree_pre_update_vma() and anon_vma_interval_tree_post_update_vma() Which removes _all_ R/B tree mappings. So you can now race with this (it of course doesn't care about folio lock) and then get completely incorrect results? This seems fairly disasterous? In free_pgtables() also we call unlink_anon_vmas() which iterates through the vma->anon_vma_chain and uses the anon lock to tear down higher order page tables which you now might race with and that seems even more disasterous... > > Among all rmap_walk() callers that don’t lock anon folios, > folio_referenced() is the most critical (others are > page_idle_clear_pte_refs(), damon_folio_young(), and > damon_folio_mkold()). The relevant code in folio_referenced() is: > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > we_locked = folio_trylock(folio); > if (!we_locked) > return 1; > } > > It’s unclear why locking anon_vma exclusively (when updating > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > with folio locked. It’s in the reclaim path, so should not be a > critical path that necessitates some special treatment, unless I’m > missing something. > Therefore, I propose simplifying the locking mechanism by ensuring the > folio is locked before calling rmap_walk(). This helps avoid locking > anon_vma when updating folio->mapping, which, for instance, will help > eliminate the uninterruptible sleep observed in the field traces > mentioned earlier. Furthermore, it enables us to simplify the code in > folio_lock_anon_vma_read() by removing the re-check to ensure that the > field hasn’t changed under us. I mean this is why I get confused here though, because you seem to be saying 'don't take rmap lock at all' to referencing folio_lock_anon_vma_read()? Perhaps I misinterpreted (forgive me if so) and indeed you meant this, but then I don't see how you impact contention on the anon_vma lock by making this change? I think in general - let's clarify what exactly you intend to do here, and then we need to delineate what we need to confirm and test to have any confidence in making such a change. anon_vma locks (and rmap locks) are very very sensitive in general and we've had actual security issues come up due to race windows emerging from inappropriate handling, not to mention that performance around this obviously matters a great deal. So we must tread carefully here. Thanks, Lorenzo ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-26 15:52 ` Lorenzo Stoakes @ 2025-08-26 22:23 ` Lokesh Gidra 2025-08-28 11:31 ` Lorenzo Stoakes 0 siblings, 1 reply; 21+ messages in thread From: Lokesh Gidra @ 2025-08-26 22:23 UTC (permalink / raw) To: Lorenzo Stoakes Cc: David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On Tue, Aug 26, 2025 at 8:52 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote: > > Hi all, > > > > Currently, some callers of rmap_walk() conditionally avoid try-locking > > non-ksm anon folios. This necessitates serialization through anon_vma > > write-lock elsewhere when folio->mapping and/or folio->index (fields > > involved in rmap_walk()) are to be updated. This hurts scalability due > > to coarse granularity of the lock. For instance, when multiple threads > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages > > from the same src VMA, they all contend for the corresponding > > anon_vma’s lock. Field traces for arm64 android devices reveal over > > 30ms of uninterruptible sleep in the main UI thread, leading to janky > > user interactions. > > Can we clarify whether this is simply an example, or rather the entire > motivating reason for raising this issue? > When I started off I thought maybe there are other cases too, but it looks like as of now only uffd MOVE updates folio->mapping to a different root anon_vma. > It's important, because it strikes me that this is a very specific use > case, and you're now suggesting changing core locking to suit it. > > While this is a discussion, and I'm glad you raised it, I think it's > important in these cases to really exhaustively examine all of the possible > consequences. > > OK so to clarify: > > - You want to traverse the rmap entirely without any rmap locks whatsoever > for anon, relying solely on the folio lock to serialise, because > otherwise rmap read locks here block other rmap write lock calls. > There is a misunderstanding. I'm suggesting locking *both* folio as well as anon_vma during rmap walk. To avoid any confusion, here are the simplifications in mm/rmap.c that I suggest: diff --git a/mm/rmap.c b/mm/rmap.c index 568198e9efc2..81c177b0cddf 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -547,7 +547,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio, struct anon_vma *root_anon_vma; unsigned long anon_mapping; -retry: rcu_read_lock(); anon_mapping = (unsigned long)READ_ONCE(folio->mapping); if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON) @@ -558,17 +557,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio, anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON); root_anon_vma = READ_ONCE(anon_vma->root); if (down_read_trylock(&root_anon_vma->rwsem)) { - /* - * folio_move_anon_rmap() might have changed the anon_vma as we - * might not hold the folio lock here. - */ - if (unlikely((unsigned long)READ_ONCE(folio->mapping) != - anon_mapping)) { - up_read(&root_anon_vma->rwsem); - rcu_read_unlock(); - goto retry; - } - /* * If the folio is still mapped, then this anon_vma is still * its anon_vma, and holding the mutex ensures that it will @@ -603,18 +591,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio, rcu_read_unlock(); anon_vma_lock_read(anon_vma); - /* - * folio_move_anon_rmap() might have changed the anon_vma as we might - * not hold the folio lock here. - */ - if (unlikely((unsigned long)READ_ONCE(folio->mapping) != - anon_mapping)) { - anon_vma_unlock_read(anon_vma); - put_anon_vma(anon_vma); - anon_vma = NULL; - goto retry; - } - if (atomic_dec_and_test(&anon_vma->refcount)) { /* * Oops, we held the last refcount, release the lock @@ -1006,7 +982,7 @@ int folio_referenced(struct folio *folio, int is_locked, if (!folio_raw_mapping(folio)) return 0; - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { + if (!is_locked) { we_locked = folio_trylock(folio); if (!we_locked) return 1; > - You want to unconditionally folio lock all anon and kSM folios for at > least folio_referenced(). > Actually file and KSM folios are always locked today. The anon folios are conditionally left out. So my proposal actually standardizes this locking, which is an overall simplification. > In order to resolve a scalability issue specific to a uffd usecase? > With the requirement of locking anon_vma in write mode, uffd MOVE currently is unusable in practice due to poor scalability. The above change in mm/rmap.c allows us to make the following improvement to MOVE ioctl: diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 45e6290e2e8b..c4fc87d73ab7 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1192,7 +1192,6 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pmd_t dummy_pmdval; pmd_t dst_pmdval; struct folio *src_folio = NULL; - struct anon_vma *src_anon_vma = NULL; struct mmu_notifier_range range; int err = 0; @@ -1353,28 +1352,6 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, goto retry; } - if (!src_anon_vma) { - /* - * folio_referenced walks the anon_vma chain - * without the folio lock. Serialize against it with - * the anon_vma lock, the folio lock is not enough. - */ - src_anon_vma = folio_get_anon_vma(src_folio); - if (!src_anon_vma) { - /* page was unmapped from under us */ - err = -EAGAIN; - goto out; - } - if (!anon_vma_trylock_write(src_anon_vma)) { - pte_unmap(src_pte); - pte_unmap(dst_pte); - src_pte = dst_pte = NULL; - /* now we can block and wait */ - anon_vma_lock_write(src_anon_vma); - goto retry; - } - } - err = move_present_pte(mm, dst_vma, src_vma, dst_addr, src_addr, dst_pte, src_pte, orig_dst_pte, orig_src_pte, dst_pmd, @@ -1445,10 +1422,6 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, } out: - if (src_anon_vma) { - anon_vma_unlock_write(src_anon_vma); - put_anon_vma(src_anon_vma); - } if (src_folio) { folio_unlock(src_folio); folio_put(src_folio); > Is this the case? Happy to be corrected if I've misinterpreted. > > I don't see how this could possibly work, unless I'm missing something > here, because: > > 1. When we lock anon_vma's it's at the root which covers all anon_vma's > covering parent/children of forked processes. > > 2. We do "top down" operations that acquire the rmap lock on the assumption > we have exclusive access to the rmapping that have nothing to do with > the folio nor could we even know what the folio is at this point. > > 3. We manipulate higher level page tables on the basis that the rmap lock > excludes other page table walkers. > > So this proposal seems to violate all of that? > > For instance, in many VMA operations we perform: > > anon_vma_interval_tree_pre_update_vma() > > and > > anon_vma_interval_tree_post_update_vma() > > Which removes _all_ R/B tree mappings. > > So you can now race with this (it of course doesn't care about folio lock) > and then get completely incorrect results? > > This seems fairly disasterous? > > In free_pgtables() also we call unlink_anon_vmas() which iterates through > the vma->anon_vma_chain and uses the anon lock to tear down higher order > page tables which you now might race with and that seems even more > disasterous... > > > > > > Among all rmap_walk() callers that don’t lock anon folios, > > folio_referenced() is the most critical (others are > > page_idle_clear_pte_refs(), damon_folio_young(), and > > damon_folio_mkold()). The relevant code in folio_referenced() is: > > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > > we_locked = folio_trylock(folio); > > if (!we_locked) > > return 1; > > } > > > > It’s unclear why locking anon_vma exclusively (when updating > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > > with folio locked. It’s in the reclaim path, so should not be a > > critical path that necessitates some special treatment, unless I’m > > missing something. > > Therefore, I propose simplifying the locking mechanism by ensuring the > > folio is locked before calling rmap_walk(). This helps avoid locking > > anon_vma when updating folio->mapping, which, for instance, will help > > eliminate the uninterruptible sleep observed in the field traces > > mentioned earlier. Furthermore, it enables us to simplify the code in > > folio_lock_anon_vma_read() by removing the re-check to ensure that the > > field hasn’t changed under us. > > > I mean this is why I get confused here though, because you seem to be > saying 'don't take rmap lock at all' to referencing > folio_lock_anon_vma_read()? > > Perhaps I misinterpreted (forgive me if so) and indeed you meant this, but > then I don't see how you impact contention on the anon_vma lock by making > this change? > > I think in general - let's clarify what exactly you intend to do here, and > then we need to delineate what we need to confirm and test to have any > confidence in making such a change. > > anon_vma locks (and rmap locks) are very very sensitive in general and > we've had actual security issues come up due to race windows emerging from > inappropriate handling, not to mention that performance around this > obviously matters a great deal. I couldn't agree more. My changes seemed to simplify, otherwise I wouldn't have suggested this. And David's reply yesterday gives confidence that it wouldn't negatively affect performance either. Thanks, Lokesh > > So we must tread carefully here. > > Thanks, Lorenzo ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk() 2025-08-26 22:23 ` Lokesh Gidra @ 2025-08-28 11:31 ` Lorenzo Stoakes 0 siblings, 0 replies; 21+ messages in thread From: Lorenzo Stoakes @ 2025-08-28 11:31 UTC (permalink / raw) To: Lokesh Gidra Cc: David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett On Tue, Aug 26, 2025 at 03:23:28PM -0700, Lokesh Gidra wrote: > On Tue, Aug 26, 2025 at 8:52 AM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > > > On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote: > > > Hi all, > > > > > > Currently, some callers of rmap_walk() conditionally avoid try-locking > > > non-ksm anon folios. This necessitates serialization through anon_vma > > > write-lock elsewhere when folio->mapping and/or folio->index (fields > > > involved in rmap_walk()) are to be updated. This hurts scalability due > > > to coarse granularity of the lock. For instance, when multiple threads > > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages > > > from the same src VMA, they all contend for the corresponding > > > anon_vma’s lock. Field traces for arm64 android devices reveal over Hmm, I started by responding below but now have a vague thought of: What if we find a way to somehow detect this scenario, and mark the anon_vma in some way to indicate that a folio lock should be tried? That'd be a lot less egregious than changing things fundamentally for everyone. > > > 30ms of uninterruptible sleep in the main UI thread, leading to janky > > > user interactions. > > > > Can we clarify whether this is simply an example, or rather the entire > > motivating reason for raising this issue? > > > When I started off I thought maybe there are other cases too, but it > looks like as of now only uffd MOVE updates folio->mapping to a > different root anon_vma. Yup, I mean I looked into making mremap() do it, but it was insanely difficult to make it work (sadly!) But indeed. I think it's important to highlight that this is the use case. I wonder if we can't do something specific to uffd then that would be less potentially problematic for the rest of core. Because I just don't really see this as upstreamable otherwise. > > > It's important, because it strikes me that this is a very specific use > > case, and you're now suggesting changing core locking to suit it. > > > > While this is a discussion, and I'm glad you raised it, I think it's > > important in these cases to really exhaustively examine all of the possible > > consequences. > > > > OK so to clarify: > > > > - You want to traverse the rmap entirely without any rmap locks whatsoever > > for anon, relying solely on the folio lock to serialise, because > > otherwise rmap read locks here block other rmap write lock calls. > > > There is a misunderstanding. I'm suggesting locking *both* folio as > well as anon_vma during rmap walk. To avoid any confusion, here are > the simplifications in mm/rmap.c that I suggest: OK. Well that's less extreme :) But then, if we're taking both locks, how does this prevent contention on the anon_vma lock? Even so, this is adding a bunch of overhead. > > diff --git a/mm/rmap.c b/mm/rmap.c > index 568198e9efc2..81c177b0cddf 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -547,7 +547,6 @@ struct anon_vma *folio_lock_anon_vma_read(const > struct folio *folio, > struct anon_vma *root_anon_vma; > unsigned long anon_mapping; > > -retry: > rcu_read_lock(); > anon_mapping = (unsigned long)READ_ONCE(folio->mapping); > if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON) > @@ -558,17 +557,6 @@ struct anon_vma *folio_lock_anon_vma_read(const > struct folio *folio, > anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON); > root_anon_vma = READ_ONCE(anon_vma->root); > if (down_read_trylock(&root_anon_vma->rwsem)) { > - /* > - * folio_move_anon_rmap() might have changed the anon_vma as we > - * might not hold the folio lock here. > - */ > - if (unlikely((unsigned long)READ_ONCE(folio->mapping) != > - anon_mapping)) { > - up_read(&root_anon_vma->rwsem); > - rcu_read_unlock(); > - goto retry; > - } > - > /* > * If the folio is still mapped, then this anon_vma is still > * its anon_vma, and holding the mutex ensures that it will > @@ -603,18 +591,6 @@ struct anon_vma *folio_lock_anon_vma_read(const > struct folio *folio, > rcu_read_unlock(); > anon_vma_lock_read(anon_vma); > > - /* > - * folio_move_anon_rmap() might have changed the anon_vma as we might > - * not hold the folio lock here. > - */ > - if (unlikely((unsigned long)READ_ONCE(folio->mapping) != > - anon_mapping)) { > - anon_vma_unlock_read(anon_vma); > - put_anon_vma(anon_vma); > - anon_vma = NULL; > - goto retry; > - } > - > if (atomic_dec_and_test(&anon_vma->refcount)) { > /* > * Oops, we held the last refcount, release the lock > @@ -1006,7 +982,7 @@ int folio_referenced(struct folio *folio, int is_locked, > if (!folio_raw_mapping(folio)) > return 0; > > - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > + if (!is_locked) { > we_locked = folio_trylock(folio); > if (!we_locked) > return 1; This is still a really big change, we're going to be contending the folio lock potentially a LOT more, for the sake of a very specific and peculiar uffd use case. It's hard to justify. And any such justification would need _really serious_ testing on very many arches/workloads to even come close to being ok in my view. This is a pretty huge ask + it's for a specific use case. > > > - You want to unconditionally folio lock all anon and kSM folios for at > > least folio_referenced(). > > > Actually file and KSM folios are always locked today. The anon folios > are conditionally left out. So my proposal actually standardizes this > locking, which is an overall simplification. Right yes sorry misspoke I meant to say anon. This is a HUGE exception though, because that covers the majority of a process's memory allocation. > > > In order to resolve a scalability issue specific to a uffd usecase? > > > With the requirement of locking anon_vma in write mode, uffd MOVE > currently is unusable in practice due to poor scalability. The above > change in mm/rmap.c allows us to make the following improvement to > MOVE ioctl: > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index 45e6290e2e8b..c4fc87d73ab7 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -1192,7 +1192,6 @@ static int move_pages_pte(struct mm_struct *mm, > pmd_t *dst_pmd, pmd_t *src_pmd, > pmd_t dummy_pmdval; > pmd_t dst_pmdval; > struct folio *src_folio = NULL; > - struct anon_vma *src_anon_vma = NULL; > struct mmu_notifier_range range; > int err = 0; > > @@ -1353,28 +1352,6 @@ static int move_pages_pte(struct mm_struct *mm, > pmd_t *dst_pmd, pmd_t *src_pmd, > goto retry; > } > > - if (!src_anon_vma) { > - /* > - * folio_referenced walks the anon_vma chain > - * without the folio lock. Serialize against it with > - * the anon_vma lock, the folio lock is not enough. > - */ > - src_anon_vma = folio_get_anon_vma(src_folio); > - if (!src_anon_vma) { > - /* page was unmapped from under us */ > - err = -EAGAIN; > - goto out; > - } > - if (!anon_vma_trylock_write(src_anon_vma)) { > - pte_unmap(src_pte); > - pte_unmap(dst_pte); > - src_pte = dst_pte = NULL; > - /* now we can block and wait */ > - anon_vma_lock_write(src_anon_vma); > - goto retry; > - } > - } > - > err = move_present_pte(mm, dst_vma, src_vma, > dst_addr, src_addr, dst_pte, src_pte, > orig_dst_pte, orig_src_pte, dst_pmd, > @@ -1445,10 +1422,6 @@ static int move_pages_pte(struct mm_struct *mm, > pmd_t *dst_pmd, pmd_t *src_pmd, > } > > out: > - if (src_anon_vma) { > - anon_vma_unlock_write(src_anon_vma); > - put_anon_vma(src_anon_vma); > - } > if (src_folio) { > folio_unlock(src_folio); > folio_put(src_folio); Right, but again it's a niche use case, sorry. Changing how _the whole system_ does rmap to suit a very specific use case isn't really a viable approach. > > > > Is this the case? Happy to be corrected if I've misinterpreted. > > > > I don't see how this could possibly work, unless I'm missing something > > here, because: > > > > 1. When we lock anon_vma's it's at the root which covers all anon_vma's > > covering parent/children of forked processes. > > > > 2. We do "top down" operations that acquire the rmap lock on the assumption > > we have exclusive access to the rmapping that have nothing to do with > > the folio nor could we even know what the folio is at this point. > > > > 3. We manipulate higher level page tables on the basis that the rmap lock > > excludes other page table walkers. > > > > So this proposal seems to violate all of that? > > > > For instance, in many VMA operations we perform: > > > > anon_vma_interval_tree_pre_update_vma() > > > > and > > > > anon_vma_interval_tree_post_update_vma() > > > > Which removes _all_ R/B tree mappings. > > > > So you can now race with this (it of course doesn't care about folio lock) > > and then get completely incorrect results? > > > > This seems fairly disasterous? > > > > In free_pgtables() also we call unlink_anon_vmas() which iterates through > > the vma->anon_vma_chain and uses the anon lock to tear down higher order > > page tables which you now might race with and that seems even more > > disasterous... > > > > > > > > > > Among all rmap_walk() callers that don’t lock anon folios, > > > folio_referenced() is the most critical (others are > > > page_idle_clear_pte_refs(), damon_folio_young(), and > > > damon_folio_mkold()). The relevant code in folio_referenced() is: > > > > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { > > > we_locked = folio_trylock(folio); > > > if (!we_locked) > > > return 1; > > > } > > > > > > It’s unclear why locking anon_vma exclusively (when updating > > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap > > > with folio locked. It’s in the reclaim path, so should not be a > > > critical path that necessitates some special treatment, unless I’m > > > missing something. > > > Therefore, I propose simplifying the locking mechanism by ensuring the > > > folio is locked before calling rmap_walk(). This helps avoid locking > > > anon_vma when updating folio->mapping, which, for instance, will help > > > eliminate the uninterruptible sleep observed in the field traces > > > mentioned earlier. Furthermore, it enables us to simplify the code in > > > folio_lock_anon_vma_read() by removing the re-check to ensure that the > > > field hasn’t changed under us. > > > > > > I mean this is why I get confused here though, because you seem to be > > saying 'don't take rmap lock at all' to referencing > > folio_lock_anon_vma_read()? > > > > Perhaps I misinterpreted (forgive me if so) and indeed you meant this, but > > then I don't see how you impact contention on the anon_vma lock by making > > this change? > > > > I think in general - let's clarify what exactly you intend to do here, and > > then we need to delineate what we need to confirm and test to have any > > confidence in making such a change. > > > > anon_vma locks (and rmap locks) are very very sensitive in general and > > we've had actual security issues come up due to race windows emerging from > > inappropriate handling, not to mention that performance around this > > obviously matters a great deal. > > I couldn't agree more. My changes seemed to simplify, otherwise I > wouldn't have suggested this. And David's reply yesterday gives > confidence that it wouldn't negatively affect performance either. This isn't a simplification though, this is taking a new lock in a core mm code path _for everyone_ for a specific UFFD use case. Everyone _but_ people using this UFFD stuff will just pay an overhead. Another aspect is you're now making it much more likely that taking the lock will fail, since it's a trylock... It really needs very deep analysis and justification for me to be any way convinced this is ok. It's hard to justify and I think any workable solution would need to know this case applied. A very simple but horrible thing could be to have a config flag to enable/disable this. An even more really truly horrible thing would be a prctl()... but let's not. > > Thanks, > Lokesh > > > > So we must tread carefully here. > > > > Thanks, Lorenzo I remain very very skeptical of this as-is. Of course always happy to be convinced otherwise... ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2025-09-02 19:05 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-22 17:29 [DISCUSSION] Unconditionally lock folios when calling rmap_walk() Lokesh Gidra 2025-08-22 17:44 ` Matthew Wilcox 2025-08-22 18:08 ` Lorenzo Stoakes 2025-08-24 4:18 ` Lokesh Gidra 2025-08-24 5:31 ` Harry Yoo 2025-08-24 6:45 ` Lokesh Gidra 2025-08-25 10:52 ` Harry Yoo 2025-08-28 11:32 ` Lorenzo Stoakes 2025-08-25 15:19 ` David Hildenbrand 2025-08-25 18:46 ` Lokesh Gidra 2025-08-28 12:04 ` Lorenzo Stoakes 2025-08-29 0:23 ` Lokesh Gidra 2025-08-29 8:42 ` David Hildenbrand 2025-08-29 9:04 ` Lorenzo Stoakes 2025-09-02 18:59 ` Lokesh Gidra 2025-09-02 19:01 ` David Hildenbrand 2025-09-02 19:04 ` Lokesh Gidra 2025-08-29 9:01 ` Lorenzo Stoakes 2025-08-26 15:52 ` Lorenzo Stoakes 2025-08-26 22:23 ` Lokesh Gidra 2025-08-28 11:31 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).