* [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration @ 2025-08-15 10:18 Will Deacon 2025-08-16 1:03 ` John Hubbard 2025-08-16 4:14 ` Hugh Dickins 0 siblings, 2 replies; 20+ messages in thread From: Will Deacon @ 2025-08-15 10:18 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Will Deacon, Hugh Dickins, Keir Fraser, Jason Gunthorpe, David Hildenbrand, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu When taking a longterm GUP pin via pin_user_pages(), __gup_longterm_locked() tries to migrate target folios that should not be longterm pinned, for example because they reside in a CMA region or movable zone. This is done by first pinning all of the target folios anyway, collecting all of the longterm-unpinnable target folios into a list, dropping the pins that were just taken and finally handing the list off to migrate_pages() for the actual migration. It is critically important that no unexpected references are held on the folios being migrated, otherwise the migration will fail and pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is relatively easy to observe migration failures when running pKVM (which uses pin_user_pages() on crosvm's virtual address space to resolve stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and this results in the VM terminating prematurely. In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its mapping of guest memory prior to the pinning. Subsequently, when pin_user_pages() walks the page-table, the relevant 'pte' is not present and so the faulting logic allocates a new folio, mlocks it with mlock_folio() and maps it in the page-table. Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec"), mlock/munlock operations on a folio (formerly page), are deferred. For example, mlock_folio() takes an additional reference on the target folio before placing it into a per-cpu 'folio_batch' for later processing by mlock_folio_batch(), which drops the refcount once the operation is complete. Processing of the batches is coupled with the LRU batch logic and can be forcefully drained with lru_add_drain_all() but as long as a folio remains unprocessed on the batch, its refcount will be elevated. This deferred batching therefore interacts poorly with the pKVM pinning scenario as we can find ourselves in a situation where the migration code fails to migrate a folio due to the elevated refcount from the pending mlock operation. Extend the existing LRU draining logic in collect_longterm_unpinnable_folios() so that unpinnable mlocked folios on the LRU also trigger a drain. Cc: Hugh Dickins <hughd@google.com> Cc: Keir Fraser <keirf@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Frederick Mayle <fmayle@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Fixes: 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec") Signed-off-by: Will Deacon <will@kernel.org> --- This has been quite unpleasant to debug and, as I'm not intimately familiar with the mm internals, I've tried to include all the relevant details in the commit message in case there's a preferred alternative way of solving the problem or there's a flaw in my logic. mm/gup.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index adffe663594d..656835890f05 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios( continue; } - if (!folio_test_lru(folio) && drain_allow) { + if (drain_allow && + (!folio_test_lru(folio) || folio_test_mlocked(folio))) { lru_add_drain_all(); drain_allow = false; } -- 2.51.0.rc1.167.g924127e9c0-goog ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-15 10:18 [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration Will Deacon @ 2025-08-16 1:03 ` John Hubbard 2025-08-16 4:33 ` Hugh Dickins 2025-08-18 13:38 ` Will Deacon 2025-08-16 4:14 ` Hugh Dickins 1 sibling, 2 replies; 20+ messages in thread From: John Hubbard @ 2025-08-16 1:03 UTC (permalink / raw) To: Will Deacon, linux-mm Cc: linux-kernel, Hugh Dickins, Keir Fraser, Jason Gunthorpe, David Hildenbrand, Frederick Mayle, Andrew Morton, Peter Xu On 8/15/25 3:18 AM, Will Deacon wrote: > When taking a longterm GUP pin via pin_user_pages(), > __gup_longterm_locked() tries to migrate target folios that should not > be longterm pinned, for example because they reside in a CMA region or > movable zone. This is done by first pinning all of the target folios > anyway, collecting all of the longterm-unpinnable target folios into a > list, dropping the pins that were just taken and finally handing the > list off to migrate_pages() for the actual migration. > > It is critically important that no unexpected references are held on the > folios being migrated, otherwise the migration will fail and > pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is > relatively easy to observe migration failures when running pKVM (which > uses pin_user_pages() on crosvm's virtual address space to resolve > stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and > this results in the VM terminating prematurely. > > In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its > mapping of guest memory prior to the pinning. Subsequently, when > pin_user_pages() walks the page-table, the relevant 'pte' is not > present and so the faulting logic allocates a new folio, mlocks it > with mlock_folio() and maps it in the page-table. > > Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() > batch by pagevec"), mlock/munlock operations on a folio (formerly page), > are deferred. For example, mlock_folio() takes an additional reference > on the target folio before placing it into a per-cpu 'folio_batch' for > later processing by mlock_folio_batch(), which drops the refcount once > the operation is complete. Processing of the batches is coupled with > the LRU batch logic and can be forcefully drained with > lru_add_drain_all() but as long as a folio remains unprocessed on the > batch, its refcount will be elevated. > > This deferred batching therefore interacts poorly with the pKVM pinning I would go even a little broader (more general), and claim that this deferred batching interacts poorly with gup FOLL_LONGTERM when trying to pin folios in CMA or ZONE_MOVABLE, in fact. More on this below. > scenario as we can find ourselves in a situation where the migration > code fails to migrate a folio due to the elevated refcount from the > pending mlock operation. > > Extend the existing LRU draining logic in > collect_longterm_unpinnable_folios() so that unpinnable mlocked folios > on the LRU also trigger a drain. > > Cc: Hugh Dickins <hughd@google.com> > Cc: Keir Fraser <keirf@google.com> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Cc: David Hildenbrand <david@redhat.com> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Frederick Mayle <fmayle@google.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Peter Xu <peterx@redhat.com> > Fixes: 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec") > Signed-off-by: Will Deacon <will@kernel.org> > --- > > This has been quite unpleasant to debug and, as I'm not intimately I'll bet! It's not even pleasant to *read* about it! haha. Sorry you had to suffer through this. > familiar with the mm internals, I've tried to include all the relevant > details in the commit message in case there's a preferred alternative > way of solving the problem or there's a flaw in my logic. > > mm/gup.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/gup.c b/mm/gup.c > index adffe663594d..656835890f05 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios( > continue; > } > > - if (!folio_test_lru(folio) && drain_allow) { > + if (drain_allow && > + (!folio_test_lru(folio) || folio_test_mlocked(folio))) { That should work, yes. Alternatively, after thinking about this a bit today, it seems to me that the mlock batching is a little too bold, given the presence of gup/pup. And so I'm tempted to fix the problem closer to the root cause, like this (below). But maybe this is actually *less* wise than what you have proposed... I'd like to hear other mm folks' opinion on this approach: diff --git a/mm/mlock.c b/mm/mlock.c index a1d93ad33c6d..edecdd32996e 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -278,7 +278,15 @@ void mlock_new_folio(struct folio *folio) folio_get(folio); if (!folio_batch_add(fbatch, mlock_new(folio)) || - folio_test_large(folio) || lru_cache_disabled()) + folio_test_large(folio) || lru_cache_disabled() || + /* + * If this is being called as part of a gup FOLL_LONGTERM operation in + * CMA/MOVABLE zones with MLOCK_ONFAULT active, then the newly faulted + * in folio will need to immediately migrate to a pinnable zone. + * Allowing the mlock operation to batch would break the ability to + * migrate the folio. Instead, force immediate processing. + */ + (current->flags & PF_MEMALLOC_PIN)) mlock_folio_batch(fbatch); local_unlock(&mlock_fbatch.lock); } > lru_add_drain_all(); > drain_allow = false; > } thanks, -- John Hubbard ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-16 1:03 ` John Hubbard @ 2025-08-16 4:33 ` Hugh Dickins 2025-08-18 13:38 ` Will Deacon 1 sibling, 0 replies; 20+ messages in thread From: Hugh Dickins @ 2025-08-16 4:33 UTC (permalink / raw) To: John Hubbard Cc: Will Deacon, linux-mm, linux-kernel, Hugh Dickins, Keir Fraser, Jason Gunthorpe, David Hildenbrand, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka On Fri, 15 Aug 2025, John Hubbard wrote: > On 8/15/25 3:18 AM, Will Deacon wrote: > > > > diff --git a/mm/gup.c b/mm/gup.c > > index adffe663594d..656835890f05 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios( > > continue; > > } > > > > - if (!folio_test_lru(folio) && drain_allow) { > > + if (drain_allow && > > + (!folio_test_lru(folio) || folio_test_mlocked(folio))) { > > That should work, yes. > > Alternatively, after thinking about this a bit today, it seems to me that the > mlock batching is a little too bold, given the presence of gup/pup. And so I'm > tempted to fix the problem closer to the root cause, like this (below). > > But maybe this is actually *less* wise than what you have proposed... > > I'd like to hear other mm folks' opinion on this approach: > > diff --git a/mm/mlock.c b/mm/mlock.c > index a1d93ad33c6d..edecdd32996e 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -278,7 +278,15 @@ void mlock_new_folio(struct folio *folio) > > folio_get(folio); > if (!folio_batch_add(fbatch, mlock_new(folio)) || > - folio_test_large(folio) || lru_cache_disabled()) > + folio_test_large(folio) || lru_cache_disabled() || > + /* > + * If this is being called as part of a gup FOLL_LONGTERM operation in > + * CMA/MOVABLE zones with MLOCK_ONFAULT active, then the newly faulted > + * in folio will need to immediately migrate to a pinnable zone. > + * Allowing the mlock operation to batch would break the ability to > + * migrate the folio. Instead, force immediate processing. > + */ > + (current->flags & PF_MEMALLOC_PIN)) > mlock_folio_batch(fbatch); > local_unlock(&mlock_fbatch.lock); > } It's certainly worth considering this approach: it is consistent with the lru_cache_disabled() approach (but I'm not a great fan of the lru_cache_disabled() approach, often wonder how much damage it does). But I think you've placed this in the wrong function: mlock_new_folio() should already be satisfactorily handled, it's mlock_folio() that's the problematic one. I didn't know of PF_MEMALLOC_PIN at all: as you say, let's hear other opinions. Hugh ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-16 1:03 ` John Hubbard 2025-08-16 4:33 ` Hugh Dickins @ 2025-08-18 13:38 ` Will Deacon 1 sibling, 0 replies; 20+ messages in thread From: Will Deacon @ 2025-08-18 13:38 UTC (permalink / raw) To: John Hubbard Cc: linux-mm, linux-kernel, Hugh Dickins, Keir Fraser, Jason Gunthorpe, David Hildenbrand, Frederick Mayle, Andrew Morton, Peter Xu On Fri, Aug 15, 2025 at 06:03:17PM -0700, John Hubbard wrote: > On 8/15/25 3:18 AM, Will Deacon wrote: > > When taking a longterm GUP pin via pin_user_pages(), > > __gup_longterm_locked() tries to migrate target folios that should not > > be longterm pinned, for example because they reside in a CMA region or > > movable zone. This is done by first pinning all of the target folios > > anyway, collecting all of the longterm-unpinnable target folios into a > > list, dropping the pins that were just taken and finally handing the > > list off to migrate_pages() for the actual migration. > > > > It is critically important that no unexpected references are held on the > > folios being migrated, otherwise the migration will fail and > > pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is > > relatively easy to observe migration failures when running pKVM (which > > uses pin_user_pages() on crosvm's virtual address space to resolve > > stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and > > this results in the VM terminating prematurely. > > > > In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its > > mapping of guest memory prior to the pinning. Subsequently, when > > pin_user_pages() walks the page-table, the relevant 'pte' is not > > present and so the faulting logic allocates a new folio, mlocks it > > with mlock_folio() and maps it in the page-table. > > > > Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() > > batch by pagevec"), mlock/munlock operations on a folio (formerly page), > > are deferred. For example, mlock_folio() takes an additional reference > > on the target folio before placing it into a per-cpu 'folio_batch' for > > later processing by mlock_folio_batch(), which drops the refcount once > > the operation is complete. Processing of the batches is coupled with > > the LRU batch logic and can be forcefully drained with > > lru_add_drain_all() but as long as a folio remains unprocessed on the > > batch, its refcount will be elevated. > > > > This deferred batching therefore interacts poorly with the pKVM pinning > > I would go even a little broader (more general), and claim that this > deferred batching interacts poorly with gup FOLL_LONGTERM when trying > to pin folios in CMA or ZONE_MOVABLE, in fact. That's much better, thanks. > > diff --git a/mm/gup.c b/mm/gup.c > > index adffe663594d..656835890f05 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios( > > continue; > > } > > > > - if (!folio_test_lru(folio) && drain_allow) { > > + if (drain_allow && > > + (!folio_test_lru(folio) || folio_test_mlocked(folio))) { > > That should work, yes. > > Alternatively, after thinking about this a bit today, it seems to me that the > mlock batching is a little too bold, given the presence of gup/pup. And so I'm > tempted to fix the problem closer to the root cause, like this (below). > > But maybe this is actually *less* wise than what you have proposed... > > I'd like to hear other mm folks' opinion on this approach: > > diff --git a/mm/mlock.c b/mm/mlock.c > index a1d93ad33c6d..edecdd32996e 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -278,7 +278,15 @@ void mlock_new_folio(struct folio *folio) > > folio_get(folio); > if (!folio_batch_add(fbatch, mlock_new(folio)) || > - folio_test_large(folio) || lru_cache_disabled()) > + folio_test_large(folio) || lru_cache_disabled() || > + /* > + * If this is being called as part of a gup FOLL_LONGTERM operation in > + * CMA/MOVABLE zones with MLOCK_ONFAULT active, then the newly faulted > + * in folio will need to immediately migrate to a pinnable zone. > + * Allowing the mlock operation to batch would break the ability to > + * migrate the folio. Instead, force immediate processing. > + */ > + (current->flags & PF_MEMALLOC_PIN)) > mlock_folio_batch(fbatch); > local_unlock(&mlock_fbatch.lock); > } So after Hugh's eagle eyes spotted mlock_folio() in my description, it turns out that the mlock happens on the user page fault path rather than during the pin itself. I think that means that checking for PF_MEMALLOC_PIN isn't going to work, as the pinning comes later. Hrm. I posted some stacktraces in my reply to Hugh that might help (and boy do I have plenty more of those). Will ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-15 10:18 [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration Will Deacon 2025-08-16 1:03 ` John Hubbard @ 2025-08-16 4:14 ` Hugh Dickins 2025-08-16 8:15 ` David Hildenbrand 2025-08-18 13:31 ` Will Deacon 1 sibling, 2 replies; 20+ messages in thread From: Hugh Dickins @ 2025-08-16 4:14 UTC (permalink / raw) To: Will Deacon Cc: linux-mm, linux-kernel, Hugh Dickins, Keir Fraser, Jason Gunthorpe, David Hildenbrand, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka On Fri, 15 Aug 2025, Will Deacon wrote: > When taking a longterm GUP pin via pin_user_pages(), > __gup_longterm_locked() tries to migrate target folios that should not > be longterm pinned, for example because they reside in a CMA region or > movable zone. This is done by first pinning all of the target folios > anyway, collecting all of the longterm-unpinnable target folios into a > list, dropping the pins that were just taken and finally handing the > list off to migrate_pages() for the actual migration. > > It is critically important that no unexpected references are held on the > folios being migrated, otherwise the migration will fail and > pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is > relatively easy to observe migration failures when running pKVM (which > uses pin_user_pages() on crosvm's virtual address space to resolve > stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and > this results in the VM terminating prematurely. > > In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its > mapping of guest memory prior to the pinning. Subsequently, when > pin_user_pages() walks the page-table, the relevant 'pte' is not > present and so the faulting logic allocates a new folio, mlocks it > with mlock_folio() and maps it in the page-table. > > Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() > batch by pagevec"), mlock/munlock operations on a folio (formerly page), > are deferred. For example, mlock_folio() takes an additional reference > on the target folio before placing it into a per-cpu 'folio_batch' for > later processing by mlock_folio_batch(), which drops the refcount once > the operation is complete. Processing of the batches is coupled with > the LRU batch logic and can be forcefully drained with > lru_add_drain_all() but as long as a folio remains unprocessed on the > batch, its refcount will be elevated. > > This deferred batching therefore interacts poorly with the pKVM pinning > scenario as we can find ourselves in a situation where the migration > code fails to migrate a folio due to the elevated refcount from the > pending mlock operation. Thanks for the very full description, Will, that helped me a lot (I know very little of the GUP pinning end). But one thing would help me to understand better: are the areas being pinned anonymous or shmem or file memory (or COWed shmem or file)? From "the faulting logic allocates a new folio" I first assumed anonymous, but later came to think "mlocks it with mlock_folio()" implies they are shmem or file folios (which, yes, can also be allocated by fault). IIRC anonymous and COW faults would go the mlock_new_folio() way, where the folio goes on to the mlock folio batch without having yet reached LRU: those should be dealt with by the existing !folio_test_lru() check. > > Extend the existing LRU draining logic in > collect_longterm_unpinnable_folios() so that unpinnable mlocked folios > on the LRU also trigger a drain. > > Cc: Hugh Dickins <hughd@google.com> > Cc: Keir Fraser <keirf@google.com> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Cc: David Hildenbrand <david@redhat.com> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Frederick Mayle <fmayle@google.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Peter Xu <peterx@redhat.com> > Fixes: 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec") > Signed-off-by: Will Deacon <will@kernel.org> > --- > > This has been quite unpleasant to debug and, as I'm not intimately > familiar with the mm internals, I've tried to include all the relevant > details in the commit message in case there's a preferred alternative > way of solving the problem or there's a flaw in my logic. > > mm/gup.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/gup.c b/mm/gup.c > index adffe663594d..656835890f05 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios( > continue; > } > > - if (!folio_test_lru(folio) && drain_allow) { > + if (drain_allow && > + (!folio_test_lru(folio) || folio_test_mlocked(folio))) { > lru_add_drain_all(); > drain_allow = false; > } Hmm. That is going to call lru_add_drain_all() whenever any of the pages in the list is mlocked, and lru_add_drain_all() is a function we much prefer to avoid calling (it's much better than the old days when it could involve every CPU IPIing every other CPU at the same time; but it's still raising doubts to this day, and best avoided). (Not as bad as I first thought: those unpinnably-placed mlocked folios will get migrated, not appearing again in repeat runs.) I think replace the folio_test_mlocked(folio) part of it by (folio_test_mlocked(folio) && !folio_test_unevictable(folio)). That should reduce the extra calls to a much more reasonable number, while still solving your issue. But in addition, please add an unconditional lru_add_drain() (the local CPU one, not the inter-CPU _all) at the head of collect_longterm_unpinnable_folios(). My guess is that that would eliminate 90% of the calls to the lru_add_drain_all() below: not quite enough to satisfy you, but enough to be a good improvement. I realize that there has been a recent move to cut down even on unjustified lru_add_drain()s; but an lru_add_drain() to avoid an lru_add_drain_all() is a good trade. (Vlastimil, yes, I've Cc'ed you because this reminds me of my "Agreed" in that "Realtime threads" thread two or three weeks ago: I haven't rethought it through again, and will probably still agree with your "should be rare", but answering this mail forces me to realize that I was thinking there of the folio being mlocked before it reaches LRU, forgetting this case of the folio already on LRU being mlocked.) Hugh ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-16 4:14 ` Hugh Dickins @ 2025-08-16 8:15 ` David Hildenbrand 2025-08-18 13:31 ` Will Deacon 1 sibling, 0 replies; 20+ messages in thread From: David Hildenbrand @ 2025-08-16 8:15 UTC (permalink / raw) To: Hugh Dickins, Will Deacon Cc: linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka On 16.08.25 06:14, Hugh Dickins wrote: > On Fri, 15 Aug 2025, Will Deacon wrote: > >> When taking a longterm GUP pin via pin_user_pages(), >> __gup_longterm_locked() tries to migrate target folios that should not >> be longterm pinned, for example because they reside in a CMA region or >> movable zone. This is done by first pinning all of the target folios >> anyway, collecting all of the longterm-unpinnable target folios into a >> list, dropping the pins that were just taken and finally handing the >> list off to migrate_pages() for the actual migration. >> >> It is critically important that no unexpected references are held on the >> folios being migrated, otherwise the migration will fail and >> pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is >> relatively easy to observe migration failures when running pKVM (which >> uses pin_user_pages() on crosvm's virtual address space to resolve >> stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and >> this results in the VM terminating prematurely. >> >> In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its >> mapping of guest memory prior to the pinning. Subsequently, when >> pin_user_pages() walks the page-table, the relevant 'pte' is not >> present and so the faulting logic allocates a new folio, mlocks it >> with mlock_folio() and maps it in the page-table. >> >> Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() >> batch by pagevec"), mlock/munlock operations on a folio (formerly page), >> are deferred. For example, mlock_folio() takes an additional reference >> on the target folio before placing it into a per-cpu 'folio_batch' for >> later processing by mlock_folio_batch(), which drops the refcount once >> the operation is complete. Processing of the batches is coupled with >> the LRU batch logic and can be forcefully drained with >> lru_add_drain_all() but as long as a folio remains unprocessed on the >> batch, its refcount will be elevated. >> >> This deferred batching therefore interacts poorly with the pKVM pinning >> scenario as we can find ourselves in a situation where the migration >> code fails to migrate a folio due to the elevated refcount from the >> pending mlock operation. > > Thanks for the very full description, Will, that helped me a lot > (I know very little of the GUP pinning end). > > But one thing would help me to understand better: are the areas being > pinned anonymous or shmem or file memory (or COWed shmem or file)? > > From "the faulting logic allocates a new folio" I first assumed > anonymous, but later came to think "mlocks it with mlock_folio()" > implies they are shmem or file folios (which, yes, can also be > allocated by fault). > > IIRC anonymous and COW faults would go the mlock_new_folio() way, > where the folio goes on to the mlock folio batch without having yet > reached LRU: those should be dealt with by the existing > !folio_test_lru() check. Very right. Because I don't remember easily running into that in wp_can_reuse_anon_folio(), where we also drain the lru if we spot a !lru folio. Having to enlighten all such code to check for if (!folio_test_lru(folio) || folio_test_mlocked(folio)) is a bit suboptimal, but we could add a helper for that. (we might end up draining for actually mlocked folios though :( ) We recently changed in 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch") that a folio can really only be once in an LRU batch. So if we spot !folio_test_lru(folio) there might be at most one reference from LRU batches. So naturally I wonder if we could somehow make it work that a folio is either on the ordinary LRU batch or on the mlock batch, but not on both. And that the folio LRU flag would be sufficient to check. That would mean that we would clear the LRU flag when adding a folio to the mlock batch. But not sure how feasible that is (e.g., folio already isolated). -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-16 4:14 ` Hugh Dickins 2025-08-16 8:15 ` David Hildenbrand @ 2025-08-18 13:31 ` Will Deacon 2025-08-18 14:31 ` Will Deacon 1 sibling, 1 reply; 20+ messages in thread From: Will Deacon @ 2025-08-18 13:31 UTC (permalink / raw) To: Hugh Dickins Cc: linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, David Hildenbrand, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka Hi Hugh, On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote: > On Fri, 15 Aug 2025, Will Deacon wrote: > > > When taking a longterm GUP pin via pin_user_pages(), > > __gup_longterm_locked() tries to migrate target folios that should not > > be longterm pinned, for example because they reside in a CMA region or > > movable zone. This is done by first pinning all of the target folios > > anyway, collecting all of the longterm-unpinnable target folios into a > > list, dropping the pins that were just taken and finally handing the > > list off to migrate_pages() for the actual migration. > > > > It is critically important that no unexpected references are held on the > > folios being migrated, otherwise the migration will fail and > > pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is > > relatively easy to observe migration failures when running pKVM (which > > uses pin_user_pages() on crosvm's virtual address space to resolve > > stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and > > this results in the VM terminating prematurely. > > > > In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its > > mapping of guest memory prior to the pinning. Subsequently, when > > pin_user_pages() walks the page-table, the relevant 'pte' is not > > present and so the faulting logic allocates a new folio, mlocks it > > with mlock_folio() and maps it in the page-table. > > > > Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() > > batch by pagevec"), mlock/munlock operations on a folio (formerly page), > > are deferred. For example, mlock_folio() takes an additional reference > > on the target folio before placing it into a per-cpu 'folio_batch' for > > later processing by mlock_folio_batch(), which drops the refcount once > > the operation is complete. Processing of the batches is coupled with > > the LRU batch logic and can be forcefully drained with > > lru_add_drain_all() but as long as a folio remains unprocessed on the > > batch, its refcount will be elevated. > > > > This deferred batching therefore interacts poorly with the pKVM pinning > > scenario as we can find ourselves in a situation where the migration > > code fails to migrate a folio due to the elevated refcount from the > > pending mlock operation. > > Thanks for the very full description, Will, that helped me a lot > (I know very little of the GUP pinning end). > > But one thing would help me to understand better: are the areas being > pinned anonymous or shmem or file memory (or COWed shmem or file)? crosvm is using memfd_create() so I think that means it's anonymous and shmem? I'm not entirely sure what the right terminology is for a memfd. > From "the faulting logic allocates a new folio" I first assumed > anonymous, but later came to think "mlocks it with mlock_folio()" > implies they are shmem or file folios (which, yes, can also be > allocated by fault). > > IIRC anonymous and COW faults would go the mlock_new_folio() way, > where the folio goes on to the mlock folio batch without having yet > reached LRU: those should be dealt with by the existing > !folio_test_lru() check. Most of my analysis has been constructed from backtraces when we've managed to catch the problem. Thankfully, I've saved most of them, so I went back to have a look and you're absolutely right -- it's _not_ the pin_user_pages() which allocates the page in this case, so apologies for getting that wrong in the commit message. I've clearly been reading too many trace logs! The page is allocated and mlocked just before the pin thanks to a user page fault: crosvm_VmRunApp-6493 [007] ...2. 144.767288: page_ref_mod: pfn=0x9f83a8 flags=locked|uptodate|swapbacked|mlocked count=4 mapcount=0 mapping=00000000f0e9c5fd mt=4 val=1 crosvm_VmRunApp-6493 [007] ...2. 144.767288: <stack trace> => trace_event_raw_event_page_ref_mod_template => __page_ref_mod => mlock_folio => folio_add_file_rmap_ptes => set_pte_range => finish_fault => do_pte_missing => handle_mm_fault => do_page_fault => do_translation_fault => do_mem_abort => el0_da => el0t_64_sync_handler => el0t_64_sync and the pin_user_pages() comes in a little later (on a different CPU): crosvm_vcpu0-6499 [003] ...1. 144.786828: page_ref_mod: pfn=0x9f83a8 flags=uptodate|dirty|lru|swapbacked|unevictable|mlocked count=1027 mapcount=0 mapping=00000000f0e9c5fd mt=4 val=1024 crosvm_vcpu0-6499 [003] ...1. 144.786832: <stack trace> => trace_event_raw_event_page_ref_mod_template => __page_ref_mod => try_grab_folio => follow_page_pte => __get_user_pages => __gup_longterm_locked => pin_user_pages => __pkvm_pin_user_pages => pkvm_mem_abort => pkvm_mem_abort_prefault => kvm_handle_guest_abort => handle_exit => kvm_arch_vcpu_ioctl_run => kvm_vcpu_ioctl => __arm64_sys_ioctl => invoke_syscall => el0_svc_common => do_el0_svc => el0_svc => el0t_64_sync_handler => el0t_64_sync Between the two there's an interesting filemap fault triggering readahead and that's what adds the folio to the LRU: crosvm_VmRunApp-6493 [007] ...1. 144.767358: page_ref_mod_and_test: pfn=0x9f83a8 flags=uptodate|dirty|lru|swapbacked|unevictable|mlocked count=3 mapcount=0 mapping=00000000f0e9c5fd mt=4 val=-1 ret=0 crosvm_VmRunApp-6493 [007] ...1. 144.767359: <stack trace> => trace_event_raw_event_page_ref_mod_and_test_template => __page_ref_mod_and_test => folios_put_refs => folio_batch_move_lru => __folio_batch_add_and_move => folio_add_lru => filemap_add_folio => page_cache_ra_unbounded => page_cache_ra_order => do_sync_mmap_readahead => filemap_fault => __do_fault => do_pte_missing => handle_mm_fault => do_page_fault => do_translation_fault => do_mem_abort => el0_ia => el0t_64_sync_handler => el0t_64_sync > > diff --git a/mm/gup.c b/mm/gup.c > > index adffe663594d..656835890f05 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios( > > continue; > > } > > > > - if (!folio_test_lru(folio) && drain_allow) { > > + if (drain_allow && > > + (!folio_test_lru(folio) || folio_test_mlocked(folio))) { > > lru_add_drain_all(); > > drain_allow = false; > > } > > Hmm. That is going to call lru_add_drain_all() whenever any of the > pages in the list is mlocked, and lru_add_drain_all() is a function > we much prefer to avoid calling (it's much better than the old days > when it could involve every CPU IPIing every other CPU at the same > time; but it's still raising doubts to this day, and best avoided). > > (Not as bad as I first thought: those unpinnably-placed mlocked > folios will get migrated, not appearing again in repeat runs.) > > I think replace the folio_test_mlocked(folio) part of it by > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)). > That should reduce the extra calls to a much more reasonable > number, while still solving your issue. Alas, I fear that the folio may be unevictable by this point (which seems to coincide with the readahead fault adding it to the LRU above) but I can try it out. > But in addition, please add an unconditional lru_add_drain() > (the local CPU one, not the inter-CPU _all) at the head of > collect_longterm_unpinnable_folios(). My guess is that that > would eliminate 90% of the calls to the lru_add_drain_all() below: > not quite enough to satisfy you, but enough to be a good improvement. Sure, that's straightforward enough. We do already see an mlock_drain_local() from try_to_migrate_one() but that happens after all the unpinnable folios have been collected so it doesn't help us here. Will ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-18 13:31 ` Will Deacon @ 2025-08-18 14:31 ` Will Deacon 2025-08-25 1:25 ` Hugh Dickins 0 siblings, 1 reply; 20+ messages in thread From: Will Deacon @ 2025-08-18 14:31 UTC (permalink / raw) To: Hugh Dickins Cc: linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, David Hildenbrand, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote: > On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote: > > I think replace the folio_test_mlocked(folio) part of it by > > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)). > > That should reduce the extra calls to a much more reasonable > > number, while still solving your issue. > > Alas, I fear that the folio may be unevictable by this point (which > seems to coincide with the readahead fault adding it to the LRU above) > but I can try it out. I gave this a spin but I still see failures with this change. Will ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-18 14:31 ` Will Deacon @ 2025-08-25 1:25 ` Hugh Dickins 2025-08-25 16:04 ` David Hildenbrand 2025-08-28 8:47 ` Hugh Dickins 0 siblings, 2 replies; 20+ messages in thread From: Hugh Dickins @ 2025-08-25 1:25 UTC (permalink / raw) To: Will Deacon, David Hildenbrand Cc: Hugh Dickins, linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang On Mon, 18 Aug 2025, Will Deacon wrote: > On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote: > > On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote: > > > I think replace the folio_test_mlocked(folio) part of it by > > > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)). > > > That should reduce the extra calls to a much more reasonable > > > number, while still solving your issue. > > > > Alas, I fear that the folio may be unevictable by this point (which > > seems to coincide with the readahead fault adding it to the LRU above) > > but I can try it out. > > I gave this a spin but I still see failures with this change. Many thanks, Will, for the precisely relevant traces (in which, by the way, mapcount=0 really means _mapcount=0 hence mapcount=1). Yes, those do indeed illustrate a case which my suggested (folio_test_mlocked(folio) && !folio_test_unevictable(folio)) failed to cover. Very helpful to have an example of that. And many thanks, David, for your reminder of commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). Yes, I strongly agree with your suggestion that the mlock batch be brought into line with its change to the ordinary LRU batches, and agree that doing so will be likely to solve Will's issue (and similar cases elsewhere, without needing to modify them). Now I just have to cool my head and get back down into those mlock batches. I am fearful that making a change there to suit this case will turn out later to break another case (and I just won't have time to redevelop as thorough a grasp of the races as I had back then). But if we're lucky, applying that "one batch at a time" rule will actually make it all more comprehensible. (I so wish we had spare room in struct page to keep the address of that one batch entry, or the CPU to which that one batch belongs: then, although that wouldn't eliminate all uses of lru_add_drain_all(), it would allow us to efficiently extract a target page from its LRU batch without a remote drain.) I have not yet begun to write such a patch, and I'm not yet sure that it's even feasible: this mail sent to get the polite thank yous out of my mind, to help clear it for getting down to work. Hugh ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-25 1:25 ` Hugh Dickins @ 2025-08-25 16:04 ` David Hildenbrand 2025-08-28 8:47 ` Hugh Dickins 1 sibling, 0 replies; 20+ messages in thread From: David Hildenbrand @ 2025-08-25 16:04 UTC (permalink / raw) To: Hugh Dickins, Will Deacon Cc: linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang On 25.08.25 03:25, Hugh Dickins wrote: > On Mon, 18 Aug 2025, Will Deacon wrote: >> On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote: >>> On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote: >>>> I think replace the folio_test_mlocked(folio) part of it by >>>> (folio_test_mlocked(folio) && !folio_test_unevictable(folio)). >>>> That should reduce the extra calls to a much more reasonable >>>> number, while still solving your issue. >>> >>> Alas, I fear that the folio may be unevictable by this point (which >>> seems to coincide with the readahead fault adding it to the LRU above) >>> but I can try it out. >> >> I gave this a spin but I still see failures with this change. > > Many thanks, Will, for the precisely relevant traces (in which, > by the way, mapcount=0 really means _mapcount=0 hence mapcount=1). > > Yes, those do indeed illustrate a case which my suggested > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)) > failed to cover. Very helpful to have an example of that. > > And many thanks, David, for your reminder of commit 33dfe9204f29 > ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). > > Yes, I strongly agree with your suggestion that the mlock batch > be brought into line with its change to the ordinary LRU batches, > and agree that doing so will be likely to solve Will's issue > (and similar cases elsewhere, without needing to modify them). > > Now I just have to cool my head and get back down into those > mlock batches. I am fearful that making a change there to suit > this case will turn out later to break another case (and I just > won't have time to redevelop as thorough a grasp of the races as > I had back then). But if we're lucky, applying that "one batch > at a time" rule will actually make it all more comprehensible. > > (I so wish we had spare room in struct page to keep the address > of that one batch entry, or the CPU to which that one batch > belongs: then, although that wouldn't eliminate all uses of > lru_add_drain_all(), it would allow us to efficiently extract > a target page from its LRU batch without a remote drain.) I like the idea of identifying what exactly to drain, especially regarding remote LRU draining. With separately allocated folios we later might have that space, but it could mean growing the folio size, so it depends on other factors (and also how to store that information). For now, I don't think we have any space to store this ... briefly thought about using folio->lru for that purpose, but the whole reason for batching is to no mess with folio->lru modifications but instead to ... defer batch them :) -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-25 1:25 ` Hugh Dickins 2025-08-25 16:04 ` David Hildenbrand @ 2025-08-28 8:47 ` Hugh Dickins 2025-08-28 8:59 ` David Hildenbrand 2025-08-29 11:57 ` Will Deacon 1 sibling, 2 replies; 20+ messages in thread From: Hugh Dickins @ 2025-08-28 8:47 UTC (permalink / raw) To: Will Deacon Cc: Hugh Dickins, David Hildenbrand, linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang On Sun, 24 Aug 2025, Hugh Dickins wrote: > On Mon, 18 Aug 2025, Will Deacon wrote: > > On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote: > > > On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote: > > > > I think replace the folio_test_mlocked(folio) part of it by > > > > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)). > > > > That should reduce the extra calls to a much more reasonable > > > > number, while still solving your issue. > > > > > > Alas, I fear that the folio may be unevictable by this point (which > > > seems to coincide with the readahead fault adding it to the LRU above) > > > but I can try it out. > > > > I gave this a spin but I still see failures with this change. > > Many thanks, Will, for the precisely relevant traces (in which, > by the way, mapcount=0 really means _mapcount=0 hence mapcount=1). > > Yes, those do indeed illustrate a case which my suggested > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)) > failed to cover. Very helpful to have an example of that. > > And many thanks, David, for your reminder of commit 33dfe9204f29 > ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). > > Yes, I strongly agree with your suggestion that the mlock batch > be brought into line with its change to the ordinary LRU batches, > and agree that doing so will be likely to solve Will's issue > (and similar cases elsewhere, without needing to modify them). > > Now I just have to cool my head and get back down into those > mlock batches. I am fearful that making a change there to suit > this case will turn out later to break another case (and I just > won't have time to redevelop as thorough a grasp of the races as > I had back then). But if we're lucky, applying that "one batch > at a time" rule will actually make it all more comprehensible. > > (I so wish we had spare room in struct page to keep the address > of that one batch entry, or the CPU to which that one batch > belongs: then, although that wouldn't eliminate all uses of > lru_add_drain_all(), it would allow us to efficiently extract > a target page from its LRU batch without a remote drain.) > > I have not yet begun to write such a patch, and I'm not yet sure > that it's even feasible: this mail sent to get the polite thank > yous out of my mind, to help clear it for getting down to work. It took several days in search of the least bad compromise, but in the end I concluded the opposite of what we'd intended above. There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec") and Ge Yang's 6.11 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). It turns out that the mm/swap.c folio batches (apart from lru_add) are all for best-effort, doesn't matter if it's missed, operations; whereas mlock and munlock are more serious. Probably mlock could be (not very satisfactorily) converted, but then munlock? Because of failed folio_test_clear_lru()s, it would be far too likely to err on either side, munlocking too soon or too late. I've concluded that one or the other has to go. If we're having a beauty contest, there's no doubt that 33dfe9204f29 is much nicer than 2fbb0c10d1e8 (which is itself far from perfect). But functionally, I'm afraid that removing the mlock/munlock batching will show up as a perceptible regression in realistic workloadsg; and on consideration, I've found no real justification for the LRU flag clearing change. Unless I'm mistaken, collect_longterm_unpinnable_folios() should never have been relying on folio_test_lru(), and should simply be checking for expected ref_count instead. Will, please give the portmanteau patch (combination of four) below a try: reversion of 33dfe9204f29 and a later MGLRU fixup, corrected test in collect...(), preparatory lru_add_drain() there. I hope you won't be proving me wrong again, and I can move on to writing up those four patches (and adding probably three more that make sense in such a series, but should not affect your testing). I've tested enough to know that it's not harmful, but am hoping to take advantage of your superior testing, particularly in the GUP pin area. But if you're uneasy with the combination, and would prefer to check just the minimum, then ignore the reversions and try just the mm/gup.c part of it - that will probably be good enough for you even without the reversions. Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12 (or an intervening release), I already did the backport so please just ask. Thanks! mm/gup.c | 5 ++++- mm/swap.c | 50 ++++++++++++++++++++++++++------------------------ mm/vmscan.c | 2 +- 3 files changed, 31 insertions(+), 26 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index adffe663594d..9f7c87f504a9 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_unpinnable_folios( struct folio *folio; long i = 0; + lru_add_drain(); + for (folio = pofs_get_folio(pofs, i); folio; folio = pofs_next_folio(folio, pofs, &i)) { @@ -2307,7 +2309,8 @@ static unsigned long collect_longterm_unpinnable_folios( continue; } - if (!folio_test_lru(folio) && drain_allow) { + if (drain_allow && folio_ref_count(folio) != + folio_expected_ref_count(folio) + 1) { lru_add_drain_all(); drain_allow = false; } diff --git a/mm/swap.c b/mm/swap.c index 3632dd061beb..8ef3dea20e39 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -164,6 +164,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn) for (i = 0; i < folio_batch_count(fbatch); i++) { struct folio *folio = fbatch->folios[i]; + /* block memcg migration while the folio moves between lru */ + if (move_fn != lru_add && !folio_test_clear_lru(folio)) + continue; + folio_lruvec_relock_irqsave(folio, &lruvec, &flags); move_fn(lruvec, folio); @@ -176,14 +180,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn) } static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch, - struct folio *folio, move_fn_t move_fn, - bool on_lru, bool disable_irq) + struct folio *folio, move_fn_t move_fn, bool disable_irq) { unsigned long flags; - if (on_lru && !folio_test_clear_lru(folio)) - return; - folio_get(folio); if (disable_irq) @@ -191,8 +191,8 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch, else local_lock(&cpu_fbatches.lock); - if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || folio_test_large(folio) || - lru_cache_disabled()) + if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || + folio_test_large(folio) || lru_cache_disabled()) folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn); if (disable_irq) @@ -201,13 +201,13 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch, local_unlock(&cpu_fbatches.lock); } -#define folio_batch_add_and_move(folio, op, on_lru) \ - __folio_batch_add_and_move( \ - &cpu_fbatches.op, \ - folio, \ - op, \ - on_lru, \ - offsetof(struct cpu_fbatches, op) >= offsetof(struct cpu_fbatches, lock_irq) \ +#define folio_batch_add_and_move(folio, op) \ + __folio_batch_add_and_move( \ + &cpu_fbatches.op, \ + folio, \ + op, \ + offsetof(struct cpu_fbatches, op) >= \ + offsetof(struct cpu_fbatches, lock_irq) \ ) static void lru_move_tail(struct lruvec *lruvec, struct folio *folio) @@ -231,10 +231,10 @@ static void lru_move_tail(struct lruvec *lruvec, struct folio *folio) void folio_rotate_reclaimable(struct folio *folio) { if (folio_test_locked(folio) || folio_test_dirty(folio) || - folio_test_unevictable(folio)) + folio_test_unevictable(folio) || !folio_test_lru(folio)) return; - folio_batch_add_and_move(folio, lru_move_tail, true); + folio_batch_add_and_move(folio, lru_move_tail); } void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file, @@ -328,10 +328,11 @@ static void folio_activate_drain(int cpu) void folio_activate(struct folio *folio) { - if (folio_test_active(folio) || folio_test_unevictable(folio)) + if (folio_test_active(folio) || folio_test_unevictable(folio) || + !folio_test_lru(folio)) return; - folio_batch_add_and_move(folio, lru_activate, true); + folio_batch_add_and_move(folio, lru_activate); } #else @@ -507,7 +508,7 @@ void folio_add_lru(struct folio *folio) lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) folio_set_active(folio); - folio_batch_add_and_move(folio, lru_add, false); + folio_batch_add_and_move(folio, lru_add); } EXPORT_SYMBOL(folio_add_lru); @@ -685,13 +686,13 @@ void lru_add_drain_cpu(int cpu) void deactivate_file_folio(struct folio *folio) { /* Deactivating an unevictable folio will not accelerate reclaim */ - if (folio_test_unevictable(folio)) + if (folio_test_unevictable(folio) || !folio_test_lru(folio)) return; if (lru_gen_enabled() && lru_gen_clear_refs(folio)) return; - folio_batch_add_and_move(folio, lru_deactivate_file, true); + folio_batch_add_and_move(folio, lru_deactivate_file); } /* @@ -704,13 +705,13 @@ void deactivate_file_folio(struct folio *folio) */ void folio_deactivate(struct folio *folio) { - if (folio_test_unevictable(folio)) + if (folio_test_unevictable(folio) || !folio_test_lru(folio)) return; if (lru_gen_enabled() ? lru_gen_clear_refs(folio) : !folio_test_active(folio)) return; - folio_batch_add_and_move(folio, lru_deactivate, true); + folio_batch_add_and_move(folio, lru_deactivate); } /** @@ -723,10 +724,11 @@ void folio_deactivate(struct folio *folio) void folio_mark_lazyfree(struct folio *folio) { if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) || + !folio_test_lru(folio) || folio_test_swapcache(folio) || folio_test_unevictable(folio)) return; - folio_batch_add_and_move(folio, lru_lazyfree, true); + folio_batch_add_and_move(folio, lru_lazyfree); } void lru_add_drain(void) diff --git a/mm/vmscan.c b/mm/vmscan.c index a48aec8bfd92..674999999cd0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4507,7 +4507,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c } /* ineligible */ - if (!folio_test_lru(folio) || zone > sc->reclaim_idx) { + if (zone > sc->reclaim_idx) { gen = folio_inc_gen(lruvec, folio, false); list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]); return true; ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-28 8:47 ` Hugh Dickins @ 2025-08-28 8:59 ` David Hildenbrand 2025-08-28 16:12 ` Hugh Dickins 2025-08-29 11:57 ` Will Deacon 1 sibling, 1 reply; 20+ messages in thread From: David Hildenbrand @ 2025-08-28 8:59 UTC (permalink / raw) To: Hugh Dickins, Will Deacon Cc: linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang On 28.08.25 10:47, Hugh Dickins wrote: > On Sun, 24 Aug 2025, Hugh Dickins wrote: >> On Mon, 18 Aug 2025, Will Deacon wrote: >>> On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote: >>>> On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote: >>>>> I think replace the folio_test_mlocked(folio) part of it by >>>>> (folio_test_mlocked(folio) && !folio_test_unevictable(folio)). >>>>> That should reduce the extra calls to a much more reasonable >>>>> number, while still solving your issue. >>>> >>>> Alas, I fear that the folio may be unevictable by this point (which >>>> seems to coincide with the readahead fault adding it to the LRU above) >>>> but I can try it out. >>> >>> I gave this a spin but I still see failures with this change. >> >> Many thanks, Will, for the precisely relevant traces (in which, >> by the way, mapcount=0 really means _mapcount=0 hence mapcount=1). >> >> Yes, those do indeed illustrate a case which my suggested >> (folio_test_mlocked(folio) && !folio_test_unevictable(folio)) >> failed to cover. Very helpful to have an example of that. >> >> And many thanks, David, for your reminder of commit 33dfe9204f29 >> ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). >> >> Yes, I strongly agree with your suggestion that the mlock batch >> be brought into line with its change to the ordinary LRU batches, >> and agree that doing so will be likely to solve Will's issue >> (and similar cases elsewhere, without needing to modify them). >> >> Now I just have to cool my head and get back down into those >> mlock batches. I am fearful that making a change there to suit >> this case will turn out later to break another case (and I just >> won't have time to redevelop as thorough a grasp of the races as >> I had back then). But if we're lucky, applying that "one batch >> at a time" rule will actually make it all more comprehensible. >> >> (I so wish we had spare room in struct page to keep the address >> of that one batch entry, or the CPU to which that one batch >> belongs: then, although that wouldn't eliminate all uses of >> lru_add_drain_all(), it would allow us to efficiently extract >> a target page from its LRU batch without a remote drain.) >> >> I have not yet begun to write such a patch, and I'm not yet sure >> that it's even feasible: this mail sent to get the polite thank >> yous out of my mind, to help clear it for getting down to work. > > It took several days in search of the least bad compromise, but > in the end I concluded the opposite of what we'd intended above. > > There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8 > ("mm/munlock: mlock_page() munlock_page() batch by pagevec") > and Ge Yang's 6.11 33dfe9204f29 > ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). > > It turns out that the mm/swap.c folio batches (apart from lru_add) > are all for best-effort, doesn't matter if it's missed, operations; > whereas mlock and munlock are more serious. Probably mlock could > be (not very satisfactorily) converted, but then munlock? Because > of failed folio_test_clear_lru()s, it would be far too likely to > err on either side, munlocking too soon or too late. > > I've concluded that one or the other has to go. If we're having > a beauty contest, there's no doubt that 33dfe9204f29 is much nicer > than 2fbb0c10d1e8 (which is itself far from perfect). But functionally, > I'm afraid that removing the mlock/munlock batching will show up as a > perceptible regression in realistic workloadsg; and on consideration, > I've found no real justification for the LRU flag clearing change. Just to understand what you are saying: are you saying that we will go back to having a folio being part of multiple LRU caches? :/ If so, I really rally hope that we can find another way and not go back to that old handling. -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-28 8:59 ` David Hildenbrand @ 2025-08-28 16:12 ` Hugh Dickins 2025-08-28 20:38 ` David Hildenbrand 0 siblings, 1 reply; 20+ messages in thread From: Hugh Dickins @ 2025-08-28 16:12 UTC (permalink / raw) To: David Hildenbrand Cc: Hugh Dickins, Will Deacon, linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang On Thu, 28 Aug 2025, David Hildenbrand wrote: > On 28.08.25 10:47, Hugh Dickins wrote: ... > > It took several days in search of the least bad compromise, but > > in the end I concluded the opposite of what we'd intended above. > > > > There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8 > > ("mm/munlock: mlock_page() munlock_page() batch by pagevec") > > and Ge Yang's 6.11 33dfe9204f29 > > ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). > > > > It turns out that the mm/swap.c folio batches (apart from lru_add) > > are all for best-effort, doesn't matter if it's missed, operations; > > whereas mlock and munlock are more serious. Probably mlock could > > be (not very satisfactorily) converted, but then munlock? Because > > of failed folio_test_clear_lru()s, it would be far too likely to > > err on either side, munlocking too soon or too late. > > > > I've concluded that one or the other has to go. If we're having > > a beauty contest, there's no doubt that 33dfe9204f29 is much nicer > > than 2fbb0c10d1e8 (which is itself far from perfect). But functionally, > > I'm afraid that removing the mlock/munlock batching will show up as a > > perceptible regression in realistic workloadsg; and on consideration, > > I've found no real justification for the LRU flag clearing change. > > Just to understand what you are saying: are you saying that we will go back to > having a folio being part of multiple LRU caches? Yes. Well, if you count the mlock/munlock batches in as "LRU caches", then that has been so all along. > :/ If so, I really rally > hope that we can find another way and not go back to that old handling. For what reason? It sounded like a nice "invariant" to keep in mind, but it's a limitation, and what purpose was it actually serving? If it's the "spare room in struct page to keep the address of that one batch entry ... efficiently extract ..." that I was dreaming of: that has to be a future thing, when perhaps memdescs will allow an extensible structure to be attached, and extending it for an mlocked folio (to hold the mlock_count instead of squeezing it into lru.prev) would not need mlock/munlock batching at all (I guess: far from uppermost in my mind!), and including a field for "efficiently extract" from LRU batch would be nice. But, memdescs or not, there will always be pressure to keep the common struct as small as possible, so I don't know if we would actually go that way. But I suspect that was not your reason at all: please illuminate. Hugh ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-28 16:12 ` Hugh Dickins @ 2025-08-28 20:38 ` David Hildenbrand 2025-08-29 1:58 ` Hugh Dickins 0 siblings, 1 reply; 20+ messages in thread From: David Hildenbrand @ 2025-08-28 20:38 UTC (permalink / raw) To: Hugh Dickins Cc: Will Deacon, linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang On 28.08.25 18:12, Hugh Dickins wrote: > On Thu, 28 Aug 2025, David Hildenbrand wrote: >> On 28.08.25 10:47, Hugh Dickins wrote: > ... >>> It took several days in search of the least bad compromise, but >>> in the end I concluded the opposite of what we'd intended above. >>> >>> There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8 >>> ("mm/munlock: mlock_page() munlock_page() batch by pagevec") >>> and Ge Yang's 6.11 33dfe9204f29 >>> ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). >>> >>> It turns out that the mm/swap.c folio batches (apart from lru_add) >>> are all for best-effort, doesn't matter if it's missed, operations; >>> whereas mlock and munlock are more serious. Probably mlock could >>> be (not very satisfactorily) converted, but then munlock? Because >>> of failed folio_test_clear_lru()s, it would be far too likely to >>> err on either side, munlocking too soon or too late. >>> >>> I've concluded that one or the other has to go. If we're having >>> a beauty contest, there's no doubt that 33dfe9204f29 is much nicer >>> than 2fbb0c10d1e8 (which is itself far from perfect). But functionally, >>> I'm afraid that removing the mlock/munlock batching will show up as a >>> perceptible regression in realistic workloadsg; and on consideration, >>> I've found no real justification for the LRU flag clearing change. >> >> Just to understand what you are saying: are you saying that we will go back to >> having a folio being part of multiple LRU caches? > > Yes. Well, if you count the mlock/munlock batches in as "LRU caches", > then that has been so all along. Yes ... > >> :/ If so, I really rally >> hope that we can find another way and not go back to that old handling. > > For what reason? It sounded like a nice "invariant" to keep in mind, > but it's a limitation, and what purpose was it actually serving? I liked the semantics that if !lru, there could be at most one reference from the LRU caches. That is, if there are two references, you don't even have to bother with flushing anything. > > If it's the "spare room in struct page to keep the address of that one > batch entry ... efficiently extract ..." that I was dreaming of: that > has to be a future thing, when perhaps memdescs will allow an extensible > structure to be attached, and extending it for an mlocked folio (to hold > the mlock_count instead of squeezing it into lru.prev) would not need > mlock/munlock batching at all (I guess: far from uppermost in my mind!), > and including a field for "efficiently extract" from LRU batch would be > nice. > > But, memdescs or not, there will always be pressure to keep the common > struct as small as possible, so I don't know if we would actually go > that way. > > But I suspect that was not your reason at all: please illuminate. You are very right :) Regarding the issue at hand: There were discussions at LSF/MM about also putting (some) large folios onto the LRU caches. In that context, GUP could take multiple references on the same folio, and a simple folio_expected_ref_count() + 1 would no longer do the trick. I thought about this today, and likely it could be handled by scanning the page array for same folios etc. A bit nasty when wanting to cover all corner cases (folio pages must not be consecutive in the passed array ) ... Apart from that issue, I liked the idea of a "single entry in the cache" for other reasons: it gives clear semantics. We cannot end up in a scenario where someone performs OPX and later someone OPY on a folio, but the way the lru caches are flushed we might end up processing OPX after OPY -- this should be able to happen in case of local or remote flushes IIRC. Anyhow, I quickly scanned your code. The folio_expected_ref_count() should solve the issue for now. It's quite unfortunate that any raised reference will make us now flush all remote LRU caches. Maybe we just want to limit it to !folio_test_large(), because flushing there really doesn't give us any chance in succeeding right now? Not sure if it makes any difference in practice, though, we'll likely be flushing remote LRU caches now more frequently either way. -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-28 20:38 ` David Hildenbrand @ 2025-08-29 1:58 ` Hugh Dickins 2025-08-29 8:56 ` David Hildenbrand 0 siblings, 1 reply; 20+ messages in thread From: Hugh Dickins @ 2025-08-29 1:58 UTC (permalink / raw) To: David Hildenbrand Cc: Hugh Dickins, Will Deacon, linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang On Thu, 28 Aug 2025, David Hildenbrand wrote: > On 28.08.25 18:12, Hugh Dickins wrote: > > On Thu, 28 Aug 2025, David Hildenbrand wrote: > >> On 28.08.25 10:47, Hugh Dickins wrote: > > ... > >>> It took several days in search of the least bad compromise, but > >>> in the end I concluded the opposite of what we'd intended above. > >>> > >>> There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8 > >>> ("mm/munlock: mlock_page() munlock_page() batch by pagevec") > >>> and Ge Yang's 6.11 33dfe9204f29 > >>> ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). > >>> > >>> It turns out that the mm/swap.c folio batches (apart from lru_add) > >>> are all for best-effort, doesn't matter if it's missed, operations; > >>> whereas mlock and munlock are more serious. Probably mlock could > >>> be (not very satisfactorily) converted, but then munlock? Because > >>> of failed folio_test_clear_lru()s, it would be far too likely to > >>> err on either side, munlocking too soon or too late. > >>> > >>> I've concluded that one or the other has to go. If we're having > >>> a beauty contest, there's no doubt that 33dfe9204f29 is much nicer > >>> than 2fbb0c10d1e8 (which is itself far from perfect). But functionally, > >>> I'm afraid that removing the mlock/munlock batching will show up as a > >>> perceptible regression in realistic workloadsg; and on consideration, > >>> I've found no real justification for the LRU flag clearing change. > >> > >> Just to understand what you are saying: are you saying that we will go back > >> to > >> having a folio being part of multiple LRU caches? > > > > Yes. Well, if you count the mlock/munlock batches in as "LRU caches", > > then that has been so all along. > > Yes ... > > > > >> :/ If so, I really rally > >> hope that we can find another way and not go back to that old handling. > > > > For what reason? It sounded like a nice "invariant" to keep in mind, > > but it's a limitation, and what purpose was it actually serving? > > I liked the semantics that if !lru, there could be at most one reference from > the LRU caches. > > That is, if there are two references, you don't even have to bother with > flushing anything. If that assumption is being put into practice anywhere (not that I know of), then it's currently wrong and needs currecting. It would be nice and simple if it worked, but I couldn't get it to work with mlock/munlock batching, so it seemed better to give up on the pretence. And one of the points of using a pagevec used to be precisely that a page could exist on more than one at a time (unlike threading via lru). > > > > > If it's the "spare room in struct page to keep the address of that one > > batch entry ... efficiently extract ..." that I was dreaming of: that > > has to be a future thing, when perhaps memdescs will allow an extensible > > structure to be attached, and extending it for an mlocked folio (to hold > > the mlock_count instead of squeezing it into lru.prev) would not need > > mlock/munlock batching at all (I guess: far from uppermost in my mind!), > > and including a field for "efficiently extract" from LRU batch would be > > nice. > > > > But, memdescs or not, there will always be pressure to keep the common > > struct as small as possible, so I don't know if we would actually go > > that way. > > > > But I suspect that was not your reason at all: please illuminate. > > You are very right :) OK, thanks, I'll stop reading further now :) > > Regarding the issue at hand: > > There were discussions at LSF/MM about also putting (some) large folios onto > the LRU caches. > > In that context, GUP could take multiple references on the same folio, and a > simple folio_expected_ref_count() + 1 would no longer do the trick. > > I thought about this today, and likely it could be handled by scanning the > page array for same folios etc. A bit nasty when wanting to cover all corner > cases (folio pages must not be consecutive in the passed array ) ... I haven't thought about that problem at all (except when working around a similar issue in mm/mempolicy.c's folio queueing), but can sympathize. It had irritated me to notice how the flush-immediately code for 512-page folios now extends to 2-page folios: you've enlightened me why that remains so, I hadn't thought of the implications. Perhaps even more reason to allow for multiple references on the pagevecs/batches? > > Apart from that issue, I liked the idea of a "single entry in the cache" for > other reasons: it gives clear semantics. We cannot end up in a scenario where > someone performs OPX and later someone OPY on a folio, but the way the lru > caches are flushed we might end up processing OPX after OPY -- this should be > able to happen in case of local or remote flushes IIRC. It's been that way for many years, that's how they are. > > Anyhow, I quickly scanned your code. The folio_expected_ref_count() should > solve the issue for now. It's quite unfortunate that any raised reference will > make us now flush all remote LRU caches. There will be more false positives (drains which drain nothing relevant) by relying on ref_count rather than test_lru, yes. But certainly there were already false positives by relying on test_lru. And I expect the preparatory local lru_add_drain() to cut out a lot of true positives. > > Maybe we just want to limit it to !folio_test_large(), because flushing there > really doesn't give us any chance in succeeding right now? Not sure if it > makes any difference in practice, though, we'll likely be flushing remote LRU > caches now more frequently either way. Ah, good idea. with or without my changes. Maybe material for a separate patch. I wonder if we would do better to add a folio_is_batchable() and use that consistently in all of the places which are refusing to batch when folio_test_large() - I wonder if a !folio_test_large() here will get missed if there's a change there. Thanks, Hugh ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-29 1:58 ` Hugh Dickins @ 2025-08-29 8:56 ` David Hildenbrand 0 siblings, 0 replies; 20+ messages in thread From: David Hildenbrand @ 2025-08-29 8:56 UTC (permalink / raw) To: Hugh Dickins Cc: Will Deacon, linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang >>> >>>> :/ If so, I really rally >>>> hope that we can find another way and not go back to that old handling. >>> >>> For what reason? It sounded like a nice "invariant" to keep in mind, >>> but it's a limitation, and what purpose was it actually serving? >> >> I liked the semantics that if !lru, there could be at most one reference from >> the LRU caches. >> >> That is, if there are two references, you don't even have to bother with >> flushing anything. > > If that assumption is being put into practice anywhere (not that I know of), > then it's currently wrong and needs currecting. We do make use of that property in wp_can_reuse_anon_folio(), to just fail fast -- that's what Linus argued for back then: give up fast and just create a page copy. Essentially hidden in the "folio_ref_count(folio) > 3" and "!folio_test_lru(folio)" code path. mlock caches would not be considered right now. Now, in contrast to page pinning, that code is pure best effort, and we don't care that much about always getting it right (e.g., no remote draining, no draining if there is a clear indication that it might help). > > It would be nice and simple if it worked, but I couldn't get it to work > with mlock/munlock batching, so it seemed better to give up on the > pretence. Thanks for trying! [...] >> Regarding the issue at hand: >> >> There were discussions at LSF/MM about also putting (some) large folios onto >> the LRU caches. >> >> In that context, GUP could take multiple references on the same folio, and a >> simple folio_expected_ref_count() + 1 would no longer do the trick. >> >> I thought about this today, and likely it could be handled by scanning the >> page array for same folios etc. A bit nasty when wanting to cover all corner >> cases (folio pages must not be consecutive in the passed array ) ... > > I haven't thought about that problem at all (except when working around > a similar issue in mm/mempolicy.c's folio queueing), but can sympathize. > > It had irritated me to notice how the flush-immediately code for 512-page > folios now extends to 2-page folios: you've enlightened me why that remains > so, I hadn't thought of the implications. Perhaps even more reason to > allow for multiple references on the pagevecs/batches? Not sure TBH if multiple references from pagevecs/batches really play a role here :) > >> >> Apart from that issue, I liked the idea of a "single entry in the cache" for >> other reasons: it gives clear semantics. We cannot end up in a scenario where >> someone performs OPX and later someone OPY on a folio, but the way the lru >> caches are flushed we might end up processing OPX after OPY -- this should be >> able to happen in case of local or remote flushes IIRC. > > It's been that way for many years, that's how they are. Yeah, but I enjoyed clearer semantics: like a folio_activate() actually resulting in an activation, not getting swallowed by a folio_deactivate() stranded somewhere. >> >> Maybe we just want to limit it to !folio_test_large(), because flushing there >> really doesn't give us any chance in succeeding right now? Not sure if it >> makes any difference in practice, though, we'll likely be flushing remote LRU >> caches now more frequently either way. > > Ah, good idea. with or without my changes. Maybe material for a separate > patch. I wonder if we would do better to add a folio_is_batchable() and > use that consistently in all of the places which are refusing to batch > when folio_test_large() - I wonder if a !folio_test_large() here will > get missed if there's a change there. You read my mind. -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-28 8:47 ` Hugh Dickins 2025-08-28 8:59 ` David Hildenbrand @ 2025-08-29 11:57 ` Will Deacon 2025-08-29 13:21 ` Will Deacon 2025-08-29 15:46 ` Hugh Dickins 1 sibling, 2 replies; 20+ messages in thread From: Will Deacon @ 2025-08-29 11:57 UTC (permalink / raw) To: Hugh Dickins Cc: David Hildenbrand, linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang Hi Hugh, On Thu, Aug 28, 2025 at 01:47:14AM -0700, Hugh Dickins wrote: > On Sun, 24 Aug 2025, Hugh Dickins wrote: > > On Mon, 18 Aug 2025, Will Deacon wrote: > > > On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote: > > > > On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote: > > > > > I think replace the folio_test_mlocked(folio) part of it by > > > > > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)). > > > > > That should reduce the extra calls to a much more reasonable > > > > > number, while still solving your issue. > > > > > > > > Alas, I fear that the folio may be unevictable by this point (which > > > > seems to coincide with the readahead fault adding it to the LRU above) > > > > but I can try it out. > > > > > > I gave this a spin but I still see failures with this change. > > > > Many thanks, Will, for the precisely relevant traces (in which, > > by the way, mapcount=0 really means _mapcount=0 hence mapcount=1). > > > > Yes, those do indeed illustrate a case which my suggested > > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)) > > failed to cover. Very helpful to have an example of that. > > > > And many thanks, David, for your reminder of commit 33dfe9204f29 > > ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). > > > > Yes, I strongly agree with your suggestion that the mlock batch > > be brought into line with its change to the ordinary LRU batches, > > and agree that doing so will be likely to solve Will's issue > > (and similar cases elsewhere, without needing to modify them). > > > > Now I just have to cool my head and get back down into those > > mlock batches. I am fearful that making a change there to suit > > this case will turn out later to break another case (and I just > > won't have time to redevelop as thorough a grasp of the races as > > I had back then). But if we're lucky, applying that "one batch > > at a time" rule will actually make it all more comprehensible. > > > > (I so wish we had spare room in struct page to keep the address > > of that one batch entry, or the CPU to which that one batch > > belongs: then, although that wouldn't eliminate all uses of > > lru_add_drain_all(), it would allow us to efficiently extract > > a target page from its LRU batch without a remote drain.) > > > > I have not yet begun to write such a patch, and I'm not yet sure > > that it's even feasible: this mail sent to get the polite thank > > yous out of my mind, to help clear it for getting down to work. > > It took several days in search of the least bad compromise, but > in the end I concluded the opposite of what we'd intended above. > > There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8 > ("mm/munlock: mlock_page() munlock_page() batch by pagevec") > and Ge Yang's 6.11 33dfe9204f29 > ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). That's actually pretty good news, as I was initially worried that we'd have to backport a fix all the way back to 6.1. From the above, the only LTS affected is 6.12.y. > It turns out that the mm/swap.c folio batches (apart from lru_add) > are all for best-effort, doesn't matter if it's missed, operations; > whereas mlock and munlock are more serious. Probably mlock could > be (not very satisfactorily) converted, but then munlock? Because > of failed folio_test_clear_lru()s, it would be far too likely to > err on either side, munlocking too soon or too late. > > I've concluded that one or the other has to go. If we're having > a beauty contest, there's no doubt that 33dfe9204f29 is much nicer > than 2fbb0c10d1e8 (which is itself far from perfect). But functionally, > I'm afraid that removing the mlock/munlock batching will show up as a > perceptible regression in realistic workloadsg; and on consideration, > I've found no real justification for the LRU flag clearing change. > > Unless I'm mistaken, collect_longterm_unpinnable_folios() should > never have been relying on folio_test_lru(), and should simply be > checking for expected ref_count instead. > > Will, please give the portmanteau patch (combination of four) > below a try: reversion of 33dfe9204f29 and a later MGLRU fixup, > corrected test in collect...(), preparatory lru_add_drain() there. > > I hope you won't be proving me wrong again, and I can move on to > writing up those four patches (and adding probably three more that > make sense in such a series, but should not affect your testing). > > I've tested enough to know that it's not harmful, but am hoping > to take advantage of your superior testing, particularly in the > GUP pin area. But if you're uneasy with the combination, and would > prefer to check just the minimum, then ignore the reversions and try > just the mm/gup.c part of it - that will probably be good enough for > you even without the reversions. Thanks, I'll try to test the whole lot. I was geographically separated from my testing device yesterday but I should be able to give it a spin later today. I'm _supposed_ to be writing my KVM Forum slides for next week, so this offers a perfect opportunity to procrastinate. > Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12 > (or an intervening release), I already did the backport so please just > ask. We've got 6.15 working well at the moment, so I'll backport your diff to that. One question on the diff below: > Thanks! > > mm/gup.c | 5 ++++- > mm/swap.c | 50 ++++++++++++++++++++++++++------------------------ > mm/vmscan.c | 2 +- > 3 files changed, 31 insertions(+), 26 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index adffe663594d..9f7c87f504a9 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_unpinnable_folios( > struct folio *folio; > long i = 0; > > + lru_add_drain(); > + > for (folio = pofs_get_folio(pofs, i); folio; > folio = pofs_next_folio(folio, pofs, &i)) { > > @@ -2307,7 +2309,8 @@ static unsigned long collect_longterm_unpinnable_folios( > continue; > } > > - if (!folio_test_lru(folio) && drain_allow) { > + if (drain_allow && folio_ref_count(folio) != > + folio_expected_ref_count(folio) + 1) { > lru_add_drain_all(); How does this synchronise with the folio being added to the mlock batch on another CPU? need_mlock_drain(), which is what I think lru_add_drain_all() ends up using to figure out which CPU batches to process, just looks at the 'nr' field in the batch and I can't see anything in mlock_folio() to ensure any ordering between adding the folio to the batch and incrementing its refcount. Then again, my hack to use folio_test_mlocked() would have a similar issue because the flag is set (albeit with barrier semantics) before adding the folio to the batch, meaning the drain could miss the folio. I guess there's some higher-level synchronisation making this all work, but it would be good to understand that as I can't see that collect_longterm_unpinnable_folios() can rely on much other than the pin. Will ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-29 11:57 ` Will Deacon @ 2025-08-29 13:21 ` Will Deacon 2025-08-29 16:04 ` Hugh Dickins 2025-08-29 15:46 ` Hugh Dickins 1 sibling, 1 reply; 20+ messages in thread From: Will Deacon @ 2025-08-29 13:21 UTC (permalink / raw) To: Hugh Dickins Cc: David Hildenbrand, linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang On Fri, Aug 29, 2025 at 12:57:43PM +0100, Will Deacon wrote: > On Thu, Aug 28, 2025 at 01:47:14AM -0700, Hugh Dickins wrote: > > Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12 > > (or an intervening release), I already did the backport so please just > > ask. > > We've got 6.15 working well at the moment, so I'll backport your diff > to that. Notwithstanding my question about the synchronisation, I cherry-picked 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation") to my 6.15-based Android tree and applied your diff on top. With that, I've not managed to reproduce the original failure and haven't observed any migration failures on the GUP path. Cheers, Will ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-29 13:21 ` Will Deacon @ 2025-08-29 16:04 ` Hugh Dickins 0 siblings, 0 replies; 20+ messages in thread From: Hugh Dickins @ 2025-08-29 16:04 UTC (permalink / raw) To: Will Deacon Cc: Hugh Dickins, David Hildenbrand, linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang On Fri, 29 Aug 2025, Will Deacon wrote: > On Fri, Aug 29, 2025 at 12:57:43PM +0100, Will Deacon wrote: > > On Thu, Aug 28, 2025 at 01:47:14AM -0700, Hugh Dickins wrote: > > > Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12 > > > (or an intervening release), I already did the backport so please just > > > ask. > > > > We've got 6.15 working well at the moment, so I'll backport your diff > > to that. > > Notwithstanding my question about the synchronisation, I cherry-picked > 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count > calculation") to my 6.15-based Android tree and applied your diff on top. Yes, I cherry-picked exactly that into my 6.12. Not a big deal, but a word of advance warning: the first patch in my series will (I believe) be a fix to that folio_expected_ref_count(), to allow for PG_private_2, which implies +1 on refcount (I've not yet researched whether it's +1 or +2 if PG_private and PG_private_2 are both set - comments I've seen imply +1 but I need to check). I thought of that when doing the cherry-pick because I thought that PG_private_2 had already been killed off by now in the latest tree: but was surprised to find that it is still there in 6.17-rc. > > With that, I've not managed to reproduce the original failure and > haven't observed any migration failures on the GUP path. Great, many thanks to you, Will. Hugh ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration 2025-08-29 11:57 ` Will Deacon 2025-08-29 13:21 ` Will Deacon @ 2025-08-29 15:46 ` Hugh Dickins 1 sibling, 0 replies; 20+ messages in thread From: Hugh Dickins @ 2025-08-29 15:46 UTC (permalink / raw) To: Will Deacon Cc: Hugh Dickins, David Hildenbrand, linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang On Fri, 29 Aug 2025, Will Deacon wrote: > On Thu, Aug 28, 2025 at 01:47:14AM -0700, Hugh Dickins wrote: ... > > > > It took several days in search of the least bad compromise, but > > in the end I concluded the opposite of what we'd intended above. > > > > There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8 > > ("mm/munlock: mlock_page() munlock_page() batch by pagevec") > > and Ge Yang's 6.11 33dfe9204f29 > > ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). > > That's actually pretty good news, as I was initially worried that we'd > have to backport a fix all the way back to 6.1. From the above, the only > LTS affected is 6.12.y. I'm not so sure of that. I think the 6.11 change tickled a particular sequence that showed up in your testing, but the !folio_test_lru() was never an adequate test for whether lru_add_drain_all() might help. I can't try to estimate the probabilities, you'll have to make your own decision, whether it's worth going back to change a release which did not (I presume) show problems in real life. ... > > Unless I'm mistaken, collect_longterm_unpinnable_folios() should > > never have been relying on folio_test_lru(), and should simply be > > checking for expected ref_count instead. > > > > Will, please give the portmanteau patch (combination of four) > > below a try: reversion of 33dfe9204f29 and a later MGLRU fixup, > > corrected test in collect...(), preparatory lru_add_drain() there. > > > > I hope you won't be proving me wrong again, and I can move on to > > writing up those four patches (and adding probably three more that > > make sense in such a series, but should not affect your testing). > > > > I've tested enough to know that it's not harmful, but am hoping > > to take advantage of your superior testing, particularly in the > > GUP pin area. But if you're uneasy with the combination, and would > > prefer to check just the minimum, then ignore the reversions and try > > just the mm/gup.c part of it - that will probably be good enough for > > you even without the reversions. > > Thanks, I'll try to test the whole lot. I was geographically separated > from my testing device yesterday but I should be able to give it a spin > later today. I'm _supposed_ to be writing my KVM Forum slides for next > week, so this offers a perfect opportunity to procrastinate. Well understood :) And you've already reported on the testing, thanks. > > > Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12 > > (or an intervening release), I already did the backport so please just > > ask. > > We've got 6.15 working well at the moment, so I'll backport your diff > to that. > > One question on the diff below: > > > Thanks! > > > > mm/gup.c | 5 ++++- > > mm/swap.c | 50 ++++++++++++++++++++++++++------------------------ > > mm/vmscan.c | 2 +- > > 3 files changed, 31 insertions(+), 26 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index adffe663594d..9f7c87f504a9 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_unpinnable_folios( > > struct folio *folio; > > long i = 0; > > > > + lru_add_drain(); > > + > > for (folio = pofs_get_folio(pofs, i); folio; > > folio = pofs_next_folio(folio, pofs, &i)) { > > > > @@ -2307,7 +2309,8 @@ static unsigned long collect_longterm_unpinnable_folios( > > continue; > > } > > > > - if (!folio_test_lru(folio) && drain_allow) { > > + if (drain_allow && folio_ref_count(folio) != > > + folio_expected_ref_count(folio) + 1) { > > lru_add_drain_all(); > > How does this synchronise with the folio being added to the mlock batch > on another CPU? > > need_mlock_drain(), which is what I think lru_add_drain_all() ends up > using to figure out which CPU batches to process, just looks at the > 'nr' field in the batch and I can't see anything in mlock_folio() to > ensure any ordering between adding the folio to the batch and > incrementing its refcount. > > Then again, my hack to use folio_test_mlocked() would have a similar > issue because the flag is set (albeit with barrier semantics) before > adding the folio to the batch, meaning the drain could miss the folio. > > I guess there's some higher-level synchronisation making this all work, > but it would be good to understand that as I can't see that > collect_longterm_unpinnable_folios() can rely on much other than the pin. No such strict synchronization: you've been misled if people have told you that this pinning migration stuff is deterministically successful: it's best effort - or will others on the Cc disagree? Just as there's no synchronization between the calculation inside folio_expected_ref_count() and the reading of folio's refcount. It wouldn't make sense for this unpinnable collection to anguish over such synchronization, when a moment later the migration is liable to fail (on occasion) for other transient reasons. All ending up reported as -ENOMEM apparently? that looks unhelpful. There is a heavy hammer called lru_cache_disable() in mm/swap.c, stopping the batching and doing its own lru_add_drain_all(): that is used by CMA and a few others, but not by this pinning (and I do not want to argue that it should be). Hugh ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-08-29 16:04 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-15 10:18 [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration Will Deacon 2025-08-16 1:03 ` John Hubbard 2025-08-16 4:33 ` Hugh Dickins 2025-08-18 13:38 ` Will Deacon 2025-08-16 4:14 ` Hugh Dickins 2025-08-16 8:15 ` David Hildenbrand 2025-08-18 13:31 ` Will Deacon 2025-08-18 14:31 ` Will Deacon 2025-08-25 1:25 ` Hugh Dickins 2025-08-25 16:04 ` David Hildenbrand 2025-08-28 8:47 ` Hugh Dickins 2025-08-28 8:59 ` David Hildenbrand 2025-08-28 16:12 ` Hugh Dickins 2025-08-28 20:38 ` David Hildenbrand 2025-08-29 1:58 ` Hugh Dickins 2025-08-29 8:56 ` David Hildenbrand 2025-08-29 11:57 ` Will Deacon 2025-08-29 13:21 ` Will Deacon 2025-08-29 16:04 ` Hugh Dickins 2025-08-29 15:46 ` Hugh Dickins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).