* [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
@ 2025-08-15 10:18 Will Deacon
2025-08-16 1:03 ` John Hubbard
2025-08-16 4:14 ` Hugh Dickins
0 siblings, 2 replies; 20+ messages in thread
From: Will Deacon @ 2025-08-15 10:18 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Will Deacon, Hugh Dickins, Keir Fraser,
Jason Gunthorpe, David Hildenbrand, John Hubbard, Frederick Mayle,
Andrew Morton, Peter Xu
When taking a longterm GUP pin via pin_user_pages(),
__gup_longterm_locked() tries to migrate target folios that should not
be longterm pinned, for example because they reside in a CMA region or
movable zone. This is done by first pinning all of the target folios
anyway, collecting all of the longterm-unpinnable target folios into a
list, dropping the pins that were just taken and finally handing the
list off to migrate_pages() for the actual migration.
It is critically important that no unexpected references are held on the
folios being migrated, otherwise the migration will fail and
pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
relatively easy to observe migration failures when running pKVM (which
uses pin_user_pages() on crosvm's virtual address space to resolve
stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
this results in the VM terminating prematurely.
In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
mapping of guest memory prior to the pinning. Subsequently, when
pin_user_pages() walks the page-table, the relevant 'pte' is not
present and so the faulting logic allocates a new folio, mlocks it
with mlock_folio() and maps it in the page-table.
Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
batch by pagevec"), mlock/munlock operations on a folio (formerly page),
are deferred. For example, mlock_folio() takes an additional reference
on the target folio before placing it into a per-cpu 'folio_batch' for
later processing by mlock_folio_batch(), which drops the refcount once
the operation is complete. Processing of the batches is coupled with
the LRU batch logic and can be forcefully drained with
lru_add_drain_all() but as long as a folio remains unprocessed on the
batch, its refcount will be elevated.
This deferred batching therefore interacts poorly with the pKVM pinning
scenario as we can find ourselves in a situation where the migration
code fails to migrate a folio due to the elevated refcount from the
pending mlock operation.
Extend the existing LRU draining logic in
collect_longterm_unpinnable_folios() so that unpinnable mlocked folios
on the LRU also trigger a drain.
Cc: Hugh Dickins <hughd@google.com>
Cc: Keir Fraser <keirf@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Frederick Mayle <fmayle@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Fixes: 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
Signed-off-by: Will Deacon <will@kernel.org>
---
This has been quite unpleasant to debug and, as I'm not intimately
familiar with the mm internals, I've tried to include all the relevant
details in the commit message in case there's a preferred alternative
way of solving the problem or there's a flaw in my logic.
mm/gup.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/gup.c b/mm/gup.c
index adffe663594d..656835890f05 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
continue;
}
- if (!folio_test_lru(folio) && drain_allow) {
+ if (drain_allow &&
+ (!folio_test_lru(folio) || folio_test_mlocked(folio))) {
lru_add_drain_all();
drain_allow = false;
}
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-15 10:18 [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration Will Deacon
@ 2025-08-16 1:03 ` John Hubbard
2025-08-16 4:33 ` Hugh Dickins
2025-08-18 13:38 ` Will Deacon
2025-08-16 4:14 ` Hugh Dickins
1 sibling, 2 replies; 20+ messages in thread
From: John Hubbard @ 2025-08-16 1:03 UTC (permalink / raw)
To: Will Deacon, linux-mm
Cc: linux-kernel, Hugh Dickins, Keir Fraser, Jason Gunthorpe,
David Hildenbrand, Frederick Mayle, Andrew Morton, Peter Xu
On 8/15/25 3:18 AM, Will Deacon wrote:
> When taking a longterm GUP pin via pin_user_pages(),
> __gup_longterm_locked() tries to migrate target folios that should not
> be longterm pinned, for example because they reside in a CMA region or
> movable zone. This is done by first pinning all of the target folios
> anyway, collecting all of the longterm-unpinnable target folios into a
> list, dropping the pins that were just taken and finally handing the
> list off to migrate_pages() for the actual migration.
>
> It is critically important that no unexpected references are held on the
> folios being migrated, otherwise the migration will fail and
> pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
> relatively easy to observe migration failures when running pKVM (which
> uses pin_user_pages() on crosvm's virtual address space to resolve
> stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
> this results in the VM terminating prematurely.
>
> In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
> mapping of guest memory prior to the pinning. Subsequently, when
> pin_user_pages() walks the page-table, the relevant 'pte' is not
> present and so the faulting logic allocates a new folio, mlocks it
> with mlock_folio() and maps it in the page-table.
>
> Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
> batch by pagevec"), mlock/munlock operations on a folio (formerly page),
> are deferred. For example, mlock_folio() takes an additional reference
> on the target folio before placing it into a per-cpu 'folio_batch' for
> later processing by mlock_folio_batch(), which drops the refcount once
> the operation is complete. Processing of the batches is coupled with
> the LRU batch logic and can be forcefully drained with
> lru_add_drain_all() but as long as a folio remains unprocessed on the
> batch, its refcount will be elevated.
>
> This deferred batching therefore interacts poorly with the pKVM pinning
I would go even a little broader (more general), and claim that this
deferred batching interacts poorly with gup FOLL_LONGTERM when trying
to pin folios in CMA or ZONE_MOVABLE, in fact.
More on this below.
> scenario as we can find ourselves in a situation where the migration
> code fails to migrate a folio due to the elevated refcount from the
> pending mlock operation.
>
> Extend the existing LRU draining logic in
> collect_longterm_unpinnable_folios() so that unpinnable mlocked folios
> on the LRU also trigger a drain.
>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Keir Fraser <keirf@google.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Frederick Mayle <fmayle@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Xu <peterx@redhat.com>
> Fixes: 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>
> This has been quite unpleasant to debug and, as I'm not intimately
I'll bet! It's not even pleasant to *read* about it! haha. Sorry you had
to suffer through this.
> familiar with the mm internals, I've tried to include all the relevant
> details in the commit message in case there's a preferred alternative
> way of solving the problem or there's a flaw in my logic.
>
> mm/gup.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index adffe663594d..656835890f05 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> continue;
> }
>
> - if (!folio_test_lru(folio) && drain_allow) {
> + if (drain_allow &&
> + (!folio_test_lru(folio) || folio_test_mlocked(folio))) {
That should work, yes.
Alternatively, after thinking about this a bit today, it seems to me that the
mlock batching is a little too bold, given the presence of gup/pup. And so I'm
tempted to fix the problem closer to the root cause, like this (below).
But maybe this is actually *less* wise than what you have proposed...
I'd like to hear other mm folks' opinion on this approach:
diff --git a/mm/mlock.c b/mm/mlock.c
index a1d93ad33c6d..edecdd32996e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -278,7 +278,15 @@ void mlock_new_folio(struct folio *folio)
folio_get(folio);
if (!folio_batch_add(fbatch, mlock_new(folio)) ||
- folio_test_large(folio) || lru_cache_disabled())
+ folio_test_large(folio) || lru_cache_disabled() ||
+ /*
+ * If this is being called as part of a gup FOLL_LONGTERM operation in
+ * CMA/MOVABLE zones with MLOCK_ONFAULT active, then the newly faulted
+ * in folio will need to immediately migrate to a pinnable zone.
+ * Allowing the mlock operation to batch would break the ability to
+ * migrate the folio. Instead, force immediate processing.
+ */
+ (current->flags & PF_MEMALLOC_PIN))
mlock_folio_batch(fbatch);
local_unlock(&mlock_fbatch.lock);
}
> lru_add_drain_all();
> drain_allow = false;
> }
thanks,
--
John Hubbard
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-15 10:18 [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration Will Deacon
2025-08-16 1:03 ` John Hubbard
@ 2025-08-16 4:14 ` Hugh Dickins
2025-08-16 8:15 ` David Hildenbrand
2025-08-18 13:31 ` Will Deacon
1 sibling, 2 replies; 20+ messages in thread
From: Hugh Dickins @ 2025-08-16 4:14 UTC (permalink / raw)
To: Will Deacon
Cc: linux-mm, linux-kernel, Hugh Dickins, Keir Fraser,
Jason Gunthorpe, David Hildenbrand, John Hubbard, Frederick Mayle,
Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka
On Fri, 15 Aug 2025, Will Deacon wrote:
> When taking a longterm GUP pin via pin_user_pages(),
> __gup_longterm_locked() tries to migrate target folios that should not
> be longterm pinned, for example because they reside in a CMA region or
> movable zone. This is done by first pinning all of the target folios
> anyway, collecting all of the longterm-unpinnable target folios into a
> list, dropping the pins that were just taken and finally handing the
> list off to migrate_pages() for the actual migration.
>
> It is critically important that no unexpected references are held on the
> folios being migrated, otherwise the migration will fail and
> pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
> relatively easy to observe migration failures when running pKVM (which
> uses pin_user_pages() on crosvm's virtual address space to resolve
> stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
> this results in the VM terminating prematurely.
>
> In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
> mapping of guest memory prior to the pinning. Subsequently, when
> pin_user_pages() walks the page-table, the relevant 'pte' is not
> present and so the faulting logic allocates a new folio, mlocks it
> with mlock_folio() and maps it in the page-table.
>
> Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
> batch by pagevec"), mlock/munlock operations on a folio (formerly page),
> are deferred. For example, mlock_folio() takes an additional reference
> on the target folio before placing it into a per-cpu 'folio_batch' for
> later processing by mlock_folio_batch(), which drops the refcount once
> the operation is complete. Processing of the batches is coupled with
> the LRU batch logic and can be forcefully drained with
> lru_add_drain_all() but as long as a folio remains unprocessed on the
> batch, its refcount will be elevated.
>
> This deferred batching therefore interacts poorly with the pKVM pinning
> scenario as we can find ourselves in a situation where the migration
> code fails to migrate a folio due to the elevated refcount from the
> pending mlock operation.
Thanks for the very full description, Will, that helped me a lot
(I know very little of the GUP pinning end).
But one thing would help me to understand better: are the areas being
pinned anonymous or shmem or file memory (or COWed shmem or file)?
From "the faulting logic allocates a new folio" I first assumed
anonymous, but later came to think "mlocks it with mlock_folio()"
implies they are shmem or file folios (which, yes, can also be
allocated by fault).
IIRC anonymous and COW faults would go the mlock_new_folio() way,
where the folio goes on to the mlock folio batch without having yet
reached LRU: those should be dealt with by the existing
!folio_test_lru() check.
>
> Extend the existing LRU draining logic in
> collect_longterm_unpinnable_folios() so that unpinnable mlocked folios
> on the LRU also trigger a drain.
>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Keir Fraser <keirf@google.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Frederick Mayle <fmayle@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Xu <peterx@redhat.com>
> Fixes: 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>
> This has been quite unpleasant to debug and, as I'm not intimately
> familiar with the mm internals, I've tried to include all the relevant
> details in the commit message in case there's a preferred alternative
> way of solving the problem or there's a flaw in my logic.
>
> mm/gup.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index adffe663594d..656835890f05 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> continue;
> }
>
> - if (!folio_test_lru(folio) && drain_allow) {
> + if (drain_allow &&
> + (!folio_test_lru(folio) || folio_test_mlocked(folio))) {
> lru_add_drain_all();
> drain_allow = false;
> }
Hmm. That is going to call lru_add_drain_all() whenever any of the
pages in the list is mlocked, and lru_add_drain_all() is a function
we much prefer to avoid calling (it's much better than the old days
when it could involve every CPU IPIing every other CPU at the same
time; but it's still raising doubts to this day, and best avoided).
(Not as bad as I first thought: those unpinnably-placed mlocked
folios will get migrated, not appearing again in repeat runs.)
I think replace the folio_test_mlocked(folio) part of it by
(folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
That should reduce the extra calls to a much more reasonable
number, while still solving your issue.
But in addition, please add an unconditional lru_add_drain()
(the local CPU one, not the inter-CPU _all) at the head of
collect_longterm_unpinnable_folios(). My guess is that that
would eliminate 90% of the calls to the lru_add_drain_all() below:
not quite enough to satisfy you, but enough to be a good improvement.
I realize that there has been a recent move to cut down even on
unjustified lru_add_drain()s; but an lru_add_drain() to avoid an
lru_add_drain_all() is a good trade.
(Vlastimil, yes, I've Cc'ed you because this reminds me of my
"Agreed" in that "Realtime threads" thread two or three weeks
ago: I haven't rethought it through again, and will probably
still agree with your "should be rare", but answering this mail
forces me to realize that I was thinking there of the folio being
mlocked before it reaches LRU, forgetting this case of the folio
already on LRU being mlocked.)
Hugh
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-16 1:03 ` John Hubbard
@ 2025-08-16 4:33 ` Hugh Dickins
2025-08-18 13:38 ` Will Deacon
1 sibling, 0 replies; 20+ messages in thread
From: Hugh Dickins @ 2025-08-16 4:33 UTC (permalink / raw)
To: John Hubbard
Cc: Will Deacon, linux-mm, linux-kernel, Hugh Dickins, Keir Fraser,
Jason Gunthorpe, David Hildenbrand, Frederick Mayle,
Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka
On Fri, 15 Aug 2025, John Hubbard wrote:
> On 8/15/25 3:18 AM, Will Deacon wrote:
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index adffe663594d..656835890f05 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> > continue;
> > }
> >
> > - if (!folio_test_lru(folio) && drain_allow) {
> > + if (drain_allow &&
> > + (!folio_test_lru(folio) || folio_test_mlocked(folio))) {
>
> That should work, yes.
>
> Alternatively, after thinking about this a bit today, it seems to me that the
> mlock batching is a little too bold, given the presence of gup/pup. And so I'm
> tempted to fix the problem closer to the root cause, like this (below).
>
> But maybe this is actually *less* wise than what you have proposed...
>
> I'd like to hear other mm folks' opinion on this approach:
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index a1d93ad33c6d..edecdd32996e 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -278,7 +278,15 @@ void mlock_new_folio(struct folio *folio)
>
> folio_get(folio);
> if (!folio_batch_add(fbatch, mlock_new(folio)) ||
> - folio_test_large(folio) || lru_cache_disabled())
> + folio_test_large(folio) || lru_cache_disabled() ||
> + /*
> + * If this is being called as part of a gup FOLL_LONGTERM operation in
> + * CMA/MOVABLE zones with MLOCK_ONFAULT active, then the newly faulted
> + * in folio will need to immediately migrate to a pinnable zone.
> + * Allowing the mlock operation to batch would break the ability to
> + * migrate the folio. Instead, force immediate processing.
> + */
> + (current->flags & PF_MEMALLOC_PIN))
> mlock_folio_batch(fbatch);
> local_unlock(&mlock_fbatch.lock);
> }
It's certainly worth considering this approach: it is consistent with
the lru_cache_disabled() approach (but I'm not a great fan of the
lru_cache_disabled() approach, often wonder how much damage it does).
But I think you've placed this in the wrong function: mlock_new_folio()
should already be satisfactorily handled, it's mlock_folio() that's
the problematic one.
I didn't know of PF_MEMALLOC_PIN at all: as you say,
let's hear other opinions.
Hugh
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-16 4:14 ` Hugh Dickins
@ 2025-08-16 8:15 ` David Hildenbrand
2025-08-18 13:31 ` Will Deacon
1 sibling, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-08-16 8:15 UTC (permalink / raw)
To: Hugh Dickins, Will Deacon
Cc: linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe,
John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu,
Rik van Riel, Vlastimil Babka
On 16.08.25 06:14, Hugh Dickins wrote:
> On Fri, 15 Aug 2025, Will Deacon wrote:
>
>> When taking a longterm GUP pin via pin_user_pages(),
>> __gup_longterm_locked() tries to migrate target folios that should not
>> be longterm pinned, for example because they reside in a CMA region or
>> movable zone. This is done by first pinning all of the target folios
>> anyway, collecting all of the longterm-unpinnable target folios into a
>> list, dropping the pins that were just taken and finally handing the
>> list off to migrate_pages() for the actual migration.
>>
>> It is critically important that no unexpected references are held on the
>> folios being migrated, otherwise the migration will fail and
>> pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
>> relatively easy to observe migration failures when running pKVM (which
>> uses pin_user_pages() on crosvm's virtual address space to resolve
>> stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
>> this results in the VM terminating prematurely.
>>
>> In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
>> mapping of guest memory prior to the pinning. Subsequently, when
>> pin_user_pages() walks the page-table, the relevant 'pte' is not
>> present and so the faulting logic allocates a new folio, mlocks it
>> with mlock_folio() and maps it in the page-table.
>>
>> Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
>> batch by pagevec"), mlock/munlock operations on a folio (formerly page),
>> are deferred. For example, mlock_folio() takes an additional reference
>> on the target folio before placing it into a per-cpu 'folio_batch' for
>> later processing by mlock_folio_batch(), which drops the refcount once
>> the operation is complete. Processing of the batches is coupled with
>> the LRU batch logic and can be forcefully drained with
>> lru_add_drain_all() but as long as a folio remains unprocessed on the
>> batch, its refcount will be elevated.
>>
>> This deferred batching therefore interacts poorly with the pKVM pinning
>> scenario as we can find ourselves in a situation where the migration
>> code fails to migrate a folio due to the elevated refcount from the
>> pending mlock operation.
>
> Thanks for the very full description, Will, that helped me a lot
> (I know very little of the GUP pinning end).
>
> But one thing would help me to understand better: are the areas being
> pinned anonymous or shmem or file memory (or COWed shmem or file)?
>
> From "the faulting logic allocates a new folio" I first assumed
> anonymous, but later came to think "mlocks it with mlock_folio()"
> implies they are shmem or file folios (which, yes, can also be
> allocated by fault).
>
> IIRC anonymous and COW faults would go the mlock_new_folio() way,
> where the folio goes on to the mlock folio batch without having yet
> reached LRU: those should be dealt with by the existing
> !folio_test_lru() check.
Very right. Because I don't remember easily running into that in
wp_can_reuse_anon_folio(), where we also drain the lru if we spot a !lru
folio.
Having to enlighten all such code to check for
if (!folio_test_lru(folio) || folio_test_mlocked(folio))
is a bit suboptimal, but we could add a helper for that. (we might end
up draining for actually mlocked folios though :( )
We recently changed in 33dfe9204f29 ("mm/gup: clear the LRU flag of a
page before adding to LRU batch") that a folio can really only be once
in an LRU batch.
So if we spot !folio_test_lru(folio) there might be at most one
reference from LRU batches.
So naturally I wonder if we could somehow make it work that a folio is
either on the ordinary LRU batch or on the mlock batch, but not on both.
And that the folio LRU flag would be sufficient to check.
That would mean that we would clear the LRU flag when adding a folio to
the mlock batch. But not sure how feasible that is (e.g., folio already
isolated).
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-16 4:14 ` Hugh Dickins
2025-08-16 8:15 ` David Hildenbrand
@ 2025-08-18 13:31 ` Will Deacon
2025-08-18 14:31 ` Will Deacon
1 sibling, 1 reply; 20+ messages in thread
From: Will Deacon @ 2025-08-18 13:31 UTC (permalink / raw)
To: Hugh Dickins
Cc: linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe,
David Hildenbrand, John Hubbard, Frederick Mayle, Andrew Morton,
Peter Xu, Rik van Riel, Vlastimil Babka
Hi Hugh,
On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote:
> On Fri, 15 Aug 2025, Will Deacon wrote:
>
> > When taking a longterm GUP pin via pin_user_pages(),
> > __gup_longterm_locked() tries to migrate target folios that should not
> > be longterm pinned, for example because they reside in a CMA region or
> > movable zone. This is done by first pinning all of the target folios
> > anyway, collecting all of the longterm-unpinnable target folios into a
> > list, dropping the pins that were just taken and finally handing the
> > list off to migrate_pages() for the actual migration.
> >
> > It is critically important that no unexpected references are held on the
> > folios being migrated, otherwise the migration will fail and
> > pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
> > relatively easy to observe migration failures when running pKVM (which
> > uses pin_user_pages() on crosvm's virtual address space to resolve
> > stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
> > this results in the VM terminating prematurely.
> >
> > In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
> > mapping of guest memory prior to the pinning. Subsequently, when
> > pin_user_pages() walks the page-table, the relevant 'pte' is not
> > present and so the faulting logic allocates a new folio, mlocks it
> > with mlock_folio() and maps it in the page-table.
> >
> > Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
> > batch by pagevec"), mlock/munlock operations on a folio (formerly page),
> > are deferred. For example, mlock_folio() takes an additional reference
> > on the target folio before placing it into a per-cpu 'folio_batch' for
> > later processing by mlock_folio_batch(), which drops the refcount once
> > the operation is complete. Processing of the batches is coupled with
> > the LRU batch logic and can be forcefully drained with
> > lru_add_drain_all() but as long as a folio remains unprocessed on the
> > batch, its refcount will be elevated.
> >
> > This deferred batching therefore interacts poorly with the pKVM pinning
> > scenario as we can find ourselves in a situation where the migration
> > code fails to migrate a folio due to the elevated refcount from the
> > pending mlock operation.
>
> Thanks for the very full description, Will, that helped me a lot
> (I know very little of the GUP pinning end).
>
> But one thing would help me to understand better: are the areas being
> pinned anonymous or shmem or file memory (or COWed shmem or file)?
crosvm is using memfd_create() so I think that means it's anonymous and
shmem? I'm not entirely sure what the right terminology is for a memfd.
> From "the faulting logic allocates a new folio" I first assumed
> anonymous, but later came to think "mlocks it with mlock_folio()"
> implies they are shmem or file folios (which, yes, can also be
> allocated by fault).
>
> IIRC anonymous and COW faults would go the mlock_new_folio() way,
> where the folio goes on to the mlock folio batch without having yet
> reached LRU: those should be dealt with by the existing
> !folio_test_lru() check.
Most of my analysis has been constructed from backtraces when we've
managed to catch the problem. Thankfully, I've saved most of them, so
I went back to have a look and you're absolutely right -- it's _not_
the pin_user_pages() which allocates the page in this case, so apologies
for getting that wrong in the commit message. I've clearly been reading
too many trace logs!
The page is allocated and mlocked just before the pin thanks to a user
page fault:
crosvm_VmRunApp-6493 [007] ...2. 144.767288: page_ref_mod: pfn=0x9f83a8 flags=locked|uptodate|swapbacked|mlocked count=4 mapcount=0 mapping=00000000f0e9c5fd mt=4 val=1
crosvm_VmRunApp-6493 [007] ...2. 144.767288: <stack trace>
=> trace_event_raw_event_page_ref_mod_template
=> __page_ref_mod
=> mlock_folio
=> folio_add_file_rmap_ptes
=> set_pte_range
=> finish_fault
=> do_pte_missing
=> handle_mm_fault
=> do_page_fault
=> do_translation_fault
=> do_mem_abort
=> el0_da
=> el0t_64_sync_handler
=> el0t_64_sync
and the pin_user_pages() comes in a little later (on a different CPU):
crosvm_vcpu0-6499 [003] ...1. 144.786828: page_ref_mod: pfn=0x9f83a8 flags=uptodate|dirty|lru|swapbacked|unevictable|mlocked count=1027 mapcount=0 mapping=00000000f0e9c5fd mt=4 val=1024
crosvm_vcpu0-6499 [003] ...1. 144.786832: <stack trace>
=> trace_event_raw_event_page_ref_mod_template
=> __page_ref_mod
=> try_grab_folio
=> follow_page_pte
=> __get_user_pages
=> __gup_longterm_locked
=> pin_user_pages
=> __pkvm_pin_user_pages
=> pkvm_mem_abort
=> pkvm_mem_abort_prefault
=> kvm_handle_guest_abort
=> handle_exit
=> kvm_arch_vcpu_ioctl_run
=> kvm_vcpu_ioctl
=> __arm64_sys_ioctl
=> invoke_syscall
=> el0_svc_common
=> do_el0_svc
=> el0_svc
=> el0t_64_sync_handler
=> el0t_64_sync
Between the two there's an interesting filemap fault triggering
readahead and that's what adds the folio to the LRU:
crosvm_VmRunApp-6493 [007] ...1. 144.767358: page_ref_mod_and_test: pfn=0x9f83a8 flags=uptodate|dirty|lru|swapbacked|unevictable|mlocked count=3 mapcount=0 mapping=00000000f0e9c5fd mt=4 val=-1 ret=0
crosvm_VmRunApp-6493 [007] ...1. 144.767359: <stack trace>
=> trace_event_raw_event_page_ref_mod_and_test_template
=> __page_ref_mod_and_test
=> folios_put_refs
=> folio_batch_move_lru
=> __folio_batch_add_and_move
=> folio_add_lru
=> filemap_add_folio
=> page_cache_ra_unbounded
=> page_cache_ra_order
=> do_sync_mmap_readahead
=> filemap_fault
=> __do_fault
=> do_pte_missing
=> handle_mm_fault
=> do_page_fault
=> do_translation_fault
=> do_mem_abort
=> el0_ia
=> el0t_64_sync_handler
=> el0t_64_sync
> > diff --git a/mm/gup.c b/mm/gup.c
> > index adffe663594d..656835890f05 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> > continue;
> > }
> >
> > - if (!folio_test_lru(folio) && drain_allow) {
> > + if (drain_allow &&
> > + (!folio_test_lru(folio) || folio_test_mlocked(folio))) {
> > lru_add_drain_all();
> > drain_allow = false;
> > }
>
> Hmm. That is going to call lru_add_drain_all() whenever any of the
> pages in the list is mlocked, and lru_add_drain_all() is a function
> we much prefer to avoid calling (it's much better than the old days
> when it could involve every CPU IPIing every other CPU at the same
> time; but it's still raising doubts to this day, and best avoided).
>
> (Not as bad as I first thought: those unpinnably-placed mlocked
> folios will get migrated, not appearing again in repeat runs.)
>
> I think replace the folio_test_mlocked(folio) part of it by
> (folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
> That should reduce the extra calls to a much more reasonable
> number, while still solving your issue.
Alas, I fear that the folio may be unevictable by this point (which
seems to coincide with the readahead fault adding it to the LRU above)
but I can try it out.
> But in addition, please add an unconditional lru_add_drain()
> (the local CPU one, not the inter-CPU _all) at the head of
> collect_longterm_unpinnable_folios(). My guess is that that
> would eliminate 90% of the calls to the lru_add_drain_all() below:
> not quite enough to satisfy you, but enough to be a good improvement.
Sure, that's straightforward enough. We do already see an
mlock_drain_local() from try_to_migrate_one() but that happens after all
the unpinnable folios have been collected so it doesn't help us here.
Will
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-16 1:03 ` John Hubbard
2025-08-16 4:33 ` Hugh Dickins
@ 2025-08-18 13:38 ` Will Deacon
1 sibling, 0 replies; 20+ messages in thread
From: Will Deacon @ 2025-08-18 13:38 UTC (permalink / raw)
To: John Hubbard
Cc: linux-mm, linux-kernel, Hugh Dickins, Keir Fraser,
Jason Gunthorpe, David Hildenbrand, Frederick Mayle,
Andrew Morton, Peter Xu
On Fri, Aug 15, 2025 at 06:03:17PM -0700, John Hubbard wrote:
> On 8/15/25 3:18 AM, Will Deacon wrote:
> > When taking a longterm GUP pin via pin_user_pages(),
> > __gup_longterm_locked() tries to migrate target folios that should not
> > be longterm pinned, for example because they reside in a CMA region or
> > movable zone. This is done by first pinning all of the target folios
> > anyway, collecting all of the longterm-unpinnable target folios into a
> > list, dropping the pins that were just taken and finally handing the
> > list off to migrate_pages() for the actual migration.
> >
> > It is critically important that no unexpected references are held on the
> > folios being migrated, otherwise the migration will fail and
> > pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
> > relatively easy to observe migration failures when running pKVM (which
> > uses pin_user_pages() on crosvm's virtual address space to resolve
> > stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
> > this results in the VM terminating prematurely.
> >
> > In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
> > mapping of guest memory prior to the pinning. Subsequently, when
> > pin_user_pages() walks the page-table, the relevant 'pte' is not
> > present and so the faulting logic allocates a new folio, mlocks it
> > with mlock_folio() and maps it in the page-table.
> >
> > Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
> > batch by pagevec"), mlock/munlock operations on a folio (formerly page),
> > are deferred. For example, mlock_folio() takes an additional reference
> > on the target folio before placing it into a per-cpu 'folio_batch' for
> > later processing by mlock_folio_batch(), which drops the refcount once
> > the operation is complete. Processing of the batches is coupled with
> > the LRU batch logic and can be forcefully drained with
> > lru_add_drain_all() but as long as a folio remains unprocessed on the
> > batch, its refcount will be elevated.
> >
> > This deferred batching therefore interacts poorly with the pKVM pinning
>
> I would go even a little broader (more general), and claim that this
> deferred batching interacts poorly with gup FOLL_LONGTERM when trying
> to pin folios in CMA or ZONE_MOVABLE, in fact.
That's much better, thanks.
> > diff --git a/mm/gup.c b/mm/gup.c
> > index adffe663594d..656835890f05 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> > continue;
> > }
> >
> > - if (!folio_test_lru(folio) && drain_allow) {
> > + if (drain_allow &&
> > + (!folio_test_lru(folio) || folio_test_mlocked(folio))) {
>
> That should work, yes.
>
> Alternatively, after thinking about this a bit today, it seems to me that the
> mlock batching is a little too bold, given the presence of gup/pup. And so I'm
> tempted to fix the problem closer to the root cause, like this (below).
>
> But maybe this is actually *less* wise than what you have proposed...
>
> I'd like to hear other mm folks' opinion on this approach:
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index a1d93ad33c6d..edecdd32996e 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -278,7 +278,15 @@ void mlock_new_folio(struct folio *folio)
>
> folio_get(folio);
> if (!folio_batch_add(fbatch, mlock_new(folio)) ||
> - folio_test_large(folio) || lru_cache_disabled())
> + folio_test_large(folio) || lru_cache_disabled() ||
> + /*
> + * If this is being called as part of a gup FOLL_LONGTERM operation in
> + * CMA/MOVABLE zones with MLOCK_ONFAULT active, then the newly faulted
> + * in folio will need to immediately migrate to a pinnable zone.
> + * Allowing the mlock operation to batch would break the ability to
> + * migrate the folio. Instead, force immediate processing.
> + */
> + (current->flags & PF_MEMALLOC_PIN))
> mlock_folio_batch(fbatch);
> local_unlock(&mlock_fbatch.lock);
> }
So after Hugh's eagle eyes spotted mlock_folio() in my description, it
turns out that the mlock happens on the user page fault path rather than
during the pin itself. I think that means that checking for
PF_MEMALLOC_PIN isn't going to work, as the pinning comes later. Hrm.
I posted some stacktraces in my reply to Hugh that might help (and boy
do I have plenty more of those).
Will
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-18 13:31 ` Will Deacon
@ 2025-08-18 14:31 ` Will Deacon
2025-08-25 1:25 ` Hugh Dickins
0 siblings, 1 reply; 20+ messages in thread
From: Will Deacon @ 2025-08-18 14:31 UTC (permalink / raw)
To: Hugh Dickins
Cc: linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe,
David Hildenbrand, John Hubbard, Frederick Mayle, Andrew Morton,
Peter Xu, Rik van Riel, Vlastimil Babka
On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote:
> On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote:
> > I think replace the folio_test_mlocked(folio) part of it by
> > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
> > That should reduce the extra calls to a much more reasonable
> > number, while still solving your issue.
>
> Alas, I fear that the folio may be unevictable by this point (which
> seems to coincide with the readahead fault adding it to the LRU above)
> but I can try it out.
I gave this a spin but I still see failures with this change.
Will
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-18 14:31 ` Will Deacon
@ 2025-08-25 1:25 ` Hugh Dickins
2025-08-25 16:04 ` David Hildenbrand
2025-08-28 8:47 ` Hugh Dickins
0 siblings, 2 replies; 20+ messages in thread
From: Hugh Dickins @ 2025-08-25 1:25 UTC (permalink / raw)
To: Will Deacon, David Hildenbrand
Cc: Hugh Dickins, linux-mm, linux-kernel, Keir Fraser,
Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton,
Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang
On Mon, 18 Aug 2025, Will Deacon wrote:
> On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote:
> > On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote:
> > > I think replace the folio_test_mlocked(folio) part of it by
> > > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
> > > That should reduce the extra calls to a much more reasonable
> > > number, while still solving your issue.
> >
> > Alas, I fear that the folio may be unevictable by this point (which
> > seems to coincide with the readahead fault adding it to the LRU above)
> > but I can try it out.
>
> I gave this a spin but I still see failures with this change.
Many thanks, Will, for the precisely relevant traces (in which,
by the way, mapcount=0 really means _mapcount=0 hence mapcount=1).
Yes, those do indeed illustrate a case which my suggested
(folio_test_mlocked(folio) && !folio_test_unevictable(folio))
failed to cover. Very helpful to have an example of that.
And many thanks, David, for your reminder of commit 33dfe9204f29
("mm/gup: clear the LRU flag of a page before adding to LRU batch").
Yes, I strongly agree with your suggestion that the mlock batch
be brought into line with its change to the ordinary LRU batches,
and agree that doing so will be likely to solve Will's issue
(and similar cases elsewhere, without needing to modify them).
Now I just have to cool my head and get back down into those
mlock batches. I am fearful that making a change there to suit
this case will turn out later to break another case (and I just
won't have time to redevelop as thorough a grasp of the races as
I had back then). But if we're lucky, applying that "one batch
at a time" rule will actually make it all more comprehensible.
(I so wish we had spare room in struct page to keep the address
of that one batch entry, or the CPU to which that one batch
belongs: then, although that wouldn't eliminate all uses of
lru_add_drain_all(), it would allow us to efficiently extract
a target page from its LRU batch without a remote drain.)
I have not yet begun to write such a patch, and I'm not yet sure
that it's even feasible: this mail sent to get the polite thank
yous out of my mind, to help clear it for getting down to work.
Hugh
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-25 1:25 ` Hugh Dickins
@ 2025-08-25 16:04 ` David Hildenbrand
2025-08-28 8:47 ` Hugh Dickins
1 sibling, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-08-25 16:04 UTC (permalink / raw)
To: Hugh Dickins, Will Deacon
Cc: linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe,
John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu,
Rik van Riel, Vlastimil Babka, Ge Yang
On 25.08.25 03:25, Hugh Dickins wrote:
> On Mon, 18 Aug 2025, Will Deacon wrote:
>> On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote:
>>> On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote:
>>>> I think replace the folio_test_mlocked(folio) part of it by
>>>> (folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
>>>> That should reduce the extra calls to a much more reasonable
>>>> number, while still solving your issue.
>>>
>>> Alas, I fear that the folio may be unevictable by this point (which
>>> seems to coincide with the readahead fault adding it to the LRU above)
>>> but I can try it out.
>>
>> I gave this a spin but I still see failures with this change.
>
> Many thanks, Will, for the precisely relevant traces (in which,
> by the way, mapcount=0 really means _mapcount=0 hence mapcount=1).
>
> Yes, those do indeed illustrate a case which my suggested
> (folio_test_mlocked(folio) && !folio_test_unevictable(folio))
> failed to cover. Very helpful to have an example of that.
>
> And many thanks, David, for your reminder of commit 33dfe9204f29
> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
>
> Yes, I strongly agree with your suggestion that the mlock batch
> be brought into line with its change to the ordinary LRU batches,
> and agree that doing so will be likely to solve Will's issue
> (and similar cases elsewhere, without needing to modify them).
>
> Now I just have to cool my head and get back down into those
> mlock batches. I am fearful that making a change there to suit
> this case will turn out later to break another case (and I just
> won't have time to redevelop as thorough a grasp of the races as
> I had back then). But if we're lucky, applying that "one batch
> at a time" rule will actually make it all more comprehensible.
>
> (I so wish we had spare room in struct page to keep the address
> of that one batch entry, or the CPU to which that one batch
> belongs: then, although that wouldn't eliminate all uses of
> lru_add_drain_all(), it would allow us to efficiently extract
> a target page from its LRU batch without a remote drain.)
I like the idea of identifying what exactly to drain, especially
regarding remote LRU draining.
With separately allocated folios we later might have that space, but it
could mean growing the folio size, so it depends on other factors (and
also how to store that information).
For now, I don't think we have any space to store this ... briefly
thought about using folio->lru for that purpose, but the whole reason
for batching is to no mess with folio->lru modifications but instead to
... defer batch them :)
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-25 1:25 ` Hugh Dickins
2025-08-25 16:04 ` David Hildenbrand
@ 2025-08-28 8:47 ` Hugh Dickins
2025-08-28 8:59 ` David Hildenbrand
2025-08-29 11:57 ` Will Deacon
1 sibling, 2 replies; 20+ messages in thread
From: Hugh Dickins @ 2025-08-28 8:47 UTC (permalink / raw)
To: Will Deacon
Cc: Hugh Dickins, David Hildenbrand, linux-mm, linux-kernel,
Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle,
Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang
On Sun, 24 Aug 2025, Hugh Dickins wrote:
> On Mon, 18 Aug 2025, Will Deacon wrote:
> > On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote:
> > > On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote:
> > > > I think replace the folio_test_mlocked(folio) part of it by
> > > > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
> > > > That should reduce the extra calls to a much more reasonable
> > > > number, while still solving your issue.
> > >
> > > Alas, I fear that the folio may be unevictable by this point (which
> > > seems to coincide with the readahead fault adding it to the LRU above)
> > > but I can try it out.
> >
> > I gave this a spin but I still see failures with this change.
>
> Many thanks, Will, for the precisely relevant traces (in which,
> by the way, mapcount=0 really means _mapcount=0 hence mapcount=1).
>
> Yes, those do indeed illustrate a case which my suggested
> (folio_test_mlocked(folio) && !folio_test_unevictable(folio))
> failed to cover. Very helpful to have an example of that.
>
> And many thanks, David, for your reminder of commit 33dfe9204f29
> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
>
> Yes, I strongly agree with your suggestion that the mlock batch
> be brought into line with its change to the ordinary LRU batches,
> and agree that doing so will be likely to solve Will's issue
> (and similar cases elsewhere, without needing to modify them).
>
> Now I just have to cool my head and get back down into those
> mlock batches. I am fearful that making a change there to suit
> this case will turn out later to break another case (and I just
> won't have time to redevelop as thorough a grasp of the races as
> I had back then). But if we're lucky, applying that "one batch
> at a time" rule will actually make it all more comprehensible.
>
> (I so wish we had spare room in struct page to keep the address
> of that one batch entry, or the CPU to which that one batch
> belongs: then, although that wouldn't eliminate all uses of
> lru_add_drain_all(), it would allow us to efficiently extract
> a target page from its LRU batch without a remote drain.)
>
> I have not yet begun to write such a patch, and I'm not yet sure
> that it's even feasible: this mail sent to get the polite thank
> yous out of my mind, to help clear it for getting down to work.
It took several days in search of the least bad compromise, but
in the end I concluded the opposite of what we'd intended above.
There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
("mm/munlock: mlock_page() munlock_page() batch by pagevec")
and Ge Yang's 6.11 33dfe9204f29
("mm/gup: clear the LRU flag of a page before adding to LRU batch").
It turns out that the mm/swap.c folio batches (apart from lru_add)
are all for best-effort, doesn't matter if it's missed, operations;
whereas mlock and munlock are more serious. Probably mlock could
be (not very satisfactorily) converted, but then munlock? Because
of failed folio_test_clear_lru()s, it would be far too likely to
err on either side, munlocking too soon or too late.
I've concluded that one or the other has to go. If we're having
a beauty contest, there's no doubt that 33dfe9204f29 is much nicer
than 2fbb0c10d1e8 (which is itself far from perfect). But functionally,
I'm afraid that removing the mlock/munlock batching will show up as a
perceptible regression in realistic workloadsg; and on consideration,
I've found no real justification for the LRU flag clearing change.
Unless I'm mistaken, collect_longterm_unpinnable_folios() should
never have been relying on folio_test_lru(), and should simply be
checking for expected ref_count instead.
Will, please give the portmanteau patch (combination of four)
below a try: reversion of 33dfe9204f29 and a later MGLRU fixup,
corrected test in collect...(), preparatory lru_add_drain() there.
I hope you won't be proving me wrong again, and I can move on to
writing up those four patches (and adding probably three more that
make sense in such a series, but should not affect your testing).
I've tested enough to know that it's not harmful, but am hoping
to take advantage of your superior testing, particularly in the
GUP pin area. But if you're uneasy with the combination, and would
prefer to check just the minimum, then ignore the reversions and try
just the mm/gup.c part of it - that will probably be good enough for
you even without the reversions.
Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12
(or an intervening release), I already did the backport so please just
ask.
Thanks!
mm/gup.c | 5 ++++-
mm/swap.c | 50 ++++++++++++++++++++++++++------------------------
mm/vmscan.c | 2 +-
3 files changed, 31 insertions(+), 26 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index adffe663594d..9f7c87f504a9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_unpinnable_folios(
struct folio *folio;
long i = 0;
+ lru_add_drain();
+
for (folio = pofs_get_folio(pofs, i); folio;
folio = pofs_next_folio(folio, pofs, &i)) {
@@ -2307,7 +2309,8 @@ static unsigned long collect_longterm_unpinnable_folios(
continue;
}
- if (!folio_test_lru(folio) && drain_allow) {
+ if (drain_allow && folio_ref_count(folio) !=
+ folio_expected_ref_count(folio) + 1) {
lru_add_drain_all();
drain_allow = false;
}
diff --git a/mm/swap.c b/mm/swap.c
index 3632dd061beb..8ef3dea20e39 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -164,6 +164,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
for (i = 0; i < folio_batch_count(fbatch); i++) {
struct folio *folio = fbatch->folios[i];
+ /* block memcg migration while the folio moves between lru */
+ if (move_fn != lru_add && !folio_test_clear_lru(folio))
+ continue;
+
folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
move_fn(lruvec, folio);
@@ -176,14 +180,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
}
static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
- struct folio *folio, move_fn_t move_fn,
- bool on_lru, bool disable_irq)
+ struct folio *folio, move_fn_t move_fn, bool disable_irq)
{
unsigned long flags;
- if (on_lru && !folio_test_clear_lru(folio))
- return;
-
folio_get(folio);
if (disable_irq)
@@ -191,8 +191,8 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
else
local_lock(&cpu_fbatches.lock);
- if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || folio_test_large(folio) ||
- lru_cache_disabled())
+ if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
+ folio_test_large(folio) || lru_cache_disabled())
folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
if (disable_irq)
@@ -201,13 +201,13 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
local_unlock(&cpu_fbatches.lock);
}
-#define folio_batch_add_and_move(folio, op, on_lru) \
- __folio_batch_add_and_move( \
- &cpu_fbatches.op, \
- folio, \
- op, \
- on_lru, \
- offsetof(struct cpu_fbatches, op) >= offsetof(struct cpu_fbatches, lock_irq) \
+#define folio_batch_add_and_move(folio, op) \
+ __folio_batch_add_and_move( \
+ &cpu_fbatches.op, \
+ folio, \
+ op, \
+ offsetof(struct cpu_fbatches, op) >= \
+ offsetof(struct cpu_fbatches, lock_irq) \
)
static void lru_move_tail(struct lruvec *lruvec, struct folio *folio)
@@ -231,10 +231,10 @@ static void lru_move_tail(struct lruvec *lruvec, struct folio *folio)
void folio_rotate_reclaimable(struct folio *folio)
{
if (folio_test_locked(folio) || folio_test_dirty(folio) ||
- folio_test_unevictable(folio))
+ folio_test_unevictable(folio) || !folio_test_lru(folio))
return;
- folio_batch_add_and_move(folio, lru_move_tail, true);
+ folio_batch_add_and_move(folio, lru_move_tail);
}
void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
@@ -328,10 +328,11 @@ static void folio_activate_drain(int cpu)
void folio_activate(struct folio *folio)
{
- if (folio_test_active(folio) || folio_test_unevictable(folio))
+ if (folio_test_active(folio) || folio_test_unevictable(folio) ||
+ !folio_test_lru(folio))
return;
- folio_batch_add_and_move(folio, lru_activate, true);
+ folio_batch_add_and_move(folio, lru_activate);
}
#else
@@ -507,7 +508,7 @@ void folio_add_lru(struct folio *folio)
lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
folio_set_active(folio);
- folio_batch_add_and_move(folio, lru_add, false);
+ folio_batch_add_and_move(folio, lru_add);
}
EXPORT_SYMBOL(folio_add_lru);
@@ -685,13 +686,13 @@ void lru_add_drain_cpu(int cpu)
void deactivate_file_folio(struct folio *folio)
{
/* Deactivating an unevictable folio will not accelerate reclaim */
- if (folio_test_unevictable(folio))
+ if (folio_test_unevictable(folio) || !folio_test_lru(folio))
return;
if (lru_gen_enabled() && lru_gen_clear_refs(folio))
return;
- folio_batch_add_and_move(folio, lru_deactivate_file, true);
+ folio_batch_add_and_move(folio, lru_deactivate_file);
}
/*
@@ -704,13 +705,13 @@ void deactivate_file_folio(struct folio *folio)
*/
void folio_deactivate(struct folio *folio)
{
- if (folio_test_unevictable(folio))
+ if (folio_test_unevictable(folio) || !folio_test_lru(folio))
return;
if (lru_gen_enabled() ? lru_gen_clear_refs(folio) : !folio_test_active(folio))
return;
- folio_batch_add_and_move(folio, lru_deactivate, true);
+ folio_batch_add_and_move(folio, lru_deactivate);
}
/**
@@ -723,10 +724,11 @@ void folio_deactivate(struct folio *folio)
void folio_mark_lazyfree(struct folio *folio)
{
if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) ||
+ !folio_test_lru(folio) ||
folio_test_swapcache(folio) || folio_test_unevictable(folio))
return;
- folio_batch_add_and_move(folio, lru_lazyfree, true);
+ folio_batch_add_and_move(folio, lru_lazyfree);
}
void lru_add_drain(void)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a48aec8bfd92..674999999cd0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4507,7 +4507,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
}
/* ineligible */
- if (!folio_test_lru(folio) || zone > sc->reclaim_idx) {
+ if (zone > sc->reclaim_idx) {
gen = folio_inc_gen(lruvec, folio, false);
list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
return true;
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-28 8:47 ` Hugh Dickins
@ 2025-08-28 8:59 ` David Hildenbrand
2025-08-28 16:12 ` Hugh Dickins
2025-08-29 11:57 ` Will Deacon
1 sibling, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-08-28 8:59 UTC (permalink / raw)
To: Hugh Dickins, Will Deacon
Cc: linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe,
John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu,
Rik van Riel, Vlastimil Babka, Ge Yang
On 28.08.25 10:47, Hugh Dickins wrote:
> On Sun, 24 Aug 2025, Hugh Dickins wrote:
>> On Mon, 18 Aug 2025, Will Deacon wrote:
>>> On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote:
>>>> On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote:
>>>>> I think replace the folio_test_mlocked(folio) part of it by
>>>>> (folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
>>>>> That should reduce the extra calls to a much more reasonable
>>>>> number, while still solving your issue.
>>>>
>>>> Alas, I fear that the folio may be unevictable by this point (which
>>>> seems to coincide with the readahead fault adding it to the LRU above)
>>>> but I can try it out.
>>>
>>> I gave this a spin but I still see failures with this change.
>>
>> Many thanks, Will, for the precisely relevant traces (in which,
>> by the way, mapcount=0 really means _mapcount=0 hence mapcount=1).
>>
>> Yes, those do indeed illustrate a case which my suggested
>> (folio_test_mlocked(folio) && !folio_test_unevictable(folio))
>> failed to cover. Very helpful to have an example of that.
>>
>> And many thanks, David, for your reminder of commit 33dfe9204f29
>> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
>>
>> Yes, I strongly agree with your suggestion that the mlock batch
>> be brought into line with its change to the ordinary LRU batches,
>> and agree that doing so will be likely to solve Will's issue
>> (and similar cases elsewhere, without needing to modify them).
>>
>> Now I just have to cool my head and get back down into those
>> mlock batches. I am fearful that making a change there to suit
>> this case will turn out later to break another case (and I just
>> won't have time to redevelop as thorough a grasp of the races as
>> I had back then). But if we're lucky, applying that "one batch
>> at a time" rule will actually make it all more comprehensible.
>>
>> (I so wish we had spare room in struct page to keep the address
>> of that one batch entry, or the CPU to which that one batch
>> belongs: then, although that wouldn't eliminate all uses of
>> lru_add_drain_all(), it would allow us to efficiently extract
>> a target page from its LRU batch without a remote drain.)
>>
>> I have not yet begun to write such a patch, and I'm not yet sure
>> that it's even feasible: this mail sent to get the polite thank
>> yous out of my mind, to help clear it for getting down to work.
>
> It took several days in search of the least bad compromise, but
> in the end I concluded the opposite of what we'd intended above.
>
> There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
> ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
> and Ge Yang's 6.11 33dfe9204f29
> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
>
> It turns out that the mm/swap.c folio batches (apart from lru_add)
> are all for best-effort, doesn't matter if it's missed, operations;
> whereas mlock and munlock are more serious. Probably mlock could
> be (not very satisfactorily) converted, but then munlock? Because
> of failed folio_test_clear_lru()s, it would be far too likely to
> err on either side, munlocking too soon or too late.
>
> I've concluded that one or the other has to go. If we're having
> a beauty contest, there's no doubt that 33dfe9204f29 is much nicer
> than 2fbb0c10d1e8 (which is itself far from perfect). But functionally,
> I'm afraid that removing the mlock/munlock batching will show up as a
> perceptible regression in realistic workloadsg; and on consideration,
> I've found no real justification for the LRU flag clearing change.
Just to understand what you are saying: are you saying that we will go
back to having a folio being part of multiple LRU caches? :/ If so, I
really rally hope that we can find another way and not go back to that
old handling.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-28 8:59 ` David Hildenbrand
@ 2025-08-28 16:12 ` Hugh Dickins
2025-08-28 20:38 ` David Hildenbrand
0 siblings, 1 reply; 20+ messages in thread
From: Hugh Dickins @ 2025-08-28 16:12 UTC (permalink / raw)
To: David Hildenbrand
Cc: Hugh Dickins, Will Deacon, linux-mm, linux-kernel, Keir Fraser,
Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton,
Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang
On Thu, 28 Aug 2025, David Hildenbrand wrote:
> On 28.08.25 10:47, Hugh Dickins wrote:
...
> > It took several days in search of the least bad compromise, but
> > in the end I concluded the opposite of what we'd intended above.
> >
> > There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
> > ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
> > and Ge Yang's 6.11 33dfe9204f29
> > ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
> >
> > It turns out that the mm/swap.c folio batches (apart from lru_add)
> > are all for best-effort, doesn't matter if it's missed, operations;
> > whereas mlock and munlock are more serious. Probably mlock could
> > be (not very satisfactorily) converted, but then munlock? Because
> > of failed folio_test_clear_lru()s, it would be far too likely to
> > err on either side, munlocking too soon or too late.
> >
> > I've concluded that one or the other has to go. If we're having
> > a beauty contest, there's no doubt that 33dfe9204f29 is much nicer
> > than 2fbb0c10d1e8 (which is itself far from perfect). But functionally,
> > I'm afraid that removing the mlock/munlock batching will show up as a
> > perceptible regression in realistic workloadsg; and on consideration,
> > I've found no real justification for the LRU flag clearing change.
>
> Just to understand what you are saying: are you saying that we will go back to
> having a folio being part of multiple LRU caches?
Yes. Well, if you count the mlock/munlock batches in as "LRU caches",
then that has been so all along.
> :/ If so, I really rally
> hope that we can find another way and not go back to that old handling.
For what reason? It sounded like a nice "invariant" to keep in mind,
but it's a limitation, and what purpose was it actually serving?
If it's the "spare room in struct page to keep the address of that one
batch entry ... efficiently extract ..." that I was dreaming of: that
has to be a future thing, when perhaps memdescs will allow an extensible
structure to be attached, and extending it for an mlocked folio (to hold
the mlock_count instead of squeezing it into lru.prev) would not need
mlock/munlock batching at all (I guess: far from uppermost in my mind!),
and including a field for "efficiently extract" from LRU batch would be
nice.
But, memdescs or not, there will always be pressure to keep the common
struct as small as possible, so I don't know if we would actually go
that way.
But I suspect that was not your reason at all: please illuminate.
Hugh
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-28 16:12 ` Hugh Dickins
@ 2025-08-28 20:38 ` David Hildenbrand
2025-08-29 1:58 ` Hugh Dickins
0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-08-28 20:38 UTC (permalink / raw)
To: Hugh Dickins
Cc: Will Deacon, linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe,
John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu,
Rik van Riel, Vlastimil Babka, Ge Yang
On 28.08.25 18:12, Hugh Dickins wrote:
> On Thu, 28 Aug 2025, David Hildenbrand wrote:
>> On 28.08.25 10:47, Hugh Dickins wrote:
> ...
>>> It took several days in search of the least bad compromise, but
>>> in the end I concluded the opposite of what we'd intended above.
>>>
>>> There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
>>> ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
>>> and Ge Yang's 6.11 33dfe9204f29
>>> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
>>>
>>> It turns out that the mm/swap.c folio batches (apart from lru_add)
>>> are all for best-effort, doesn't matter if it's missed, operations;
>>> whereas mlock and munlock are more serious. Probably mlock could
>>> be (not very satisfactorily) converted, but then munlock? Because
>>> of failed folio_test_clear_lru()s, it would be far too likely to
>>> err on either side, munlocking too soon or too late.
>>>
>>> I've concluded that one or the other has to go. If we're having
>>> a beauty contest, there's no doubt that 33dfe9204f29 is much nicer
>>> than 2fbb0c10d1e8 (which is itself far from perfect). But functionally,
>>> I'm afraid that removing the mlock/munlock batching will show up as a
>>> perceptible regression in realistic workloadsg; and on consideration,
>>> I've found no real justification for the LRU flag clearing change.
>>
>> Just to understand what you are saying: are you saying that we will go back to
>> having a folio being part of multiple LRU caches?
>
> Yes. Well, if you count the mlock/munlock batches in as "LRU caches",
> then that has been so all along.
Yes ...
>
>> :/ If so, I really rally
>> hope that we can find another way and not go back to that old handling.
>
> For what reason? It sounded like a nice "invariant" to keep in mind,
> but it's a limitation, and what purpose was it actually serving?
I liked the semantics that if !lru, there could be at most one reference
from the LRU caches.
That is, if there are two references, you don't even have to bother with
flushing anything.
>
> If it's the "spare room in struct page to keep the address of that one
> batch entry ... efficiently extract ..." that I was dreaming of: that
> has to be a future thing, when perhaps memdescs will allow an extensible
> structure to be attached, and extending it for an mlocked folio (to hold
> the mlock_count instead of squeezing it into lru.prev) would not need
> mlock/munlock batching at all (I guess: far from uppermost in my mind!),
> and including a field for "efficiently extract" from LRU batch would be
> nice.
>
> But, memdescs or not, there will always be pressure to keep the common
> struct as small as possible, so I don't know if we would actually go
> that way.
>
> But I suspect that was not your reason at all: please illuminate.
You are very right :)
Regarding the issue at hand:
There were discussions at LSF/MM about also putting (some) large folios
onto the LRU caches.
In that context, GUP could take multiple references on the same folio,
and a simple folio_expected_ref_count() + 1 would no longer do the trick.
I thought about this today, and likely it could be handled by scanning
the page array for same folios etc. A bit nasty when wanting to cover
all corner cases (folio pages must not be consecutive in the passed
array ) ...
Apart from that issue, I liked the idea of a "single entry in the cache"
for other reasons: it gives clear semantics. We cannot end up in a
scenario where someone performs OPX and later someone OPY on a folio,
but the way the lru caches are flushed we might end up processing OPX
after OPY -- this should be able to happen in case of local or remote
flushes IIRC.
Anyhow, I quickly scanned your code. The folio_expected_ref_count()
should solve the issue for now. It's quite unfortunate that any raised
reference will make us now flush all remote LRU caches.
Maybe we just want to limit it to !folio_test_large(), because flushing
there really doesn't give us any chance in succeeding right now? Not
sure if it makes any difference in practice, though, we'll likely be
flushing remote LRU caches now more frequently either way.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-28 20:38 ` David Hildenbrand
@ 2025-08-29 1:58 ` Hugh Dickins
2025-08-29 8:56 ` David Hildenbrand
0 siblings, 1 reply; 20+ messages in thread
From: Hugh Dickins @ 2025-08-29 1:58 UTC (permalink / raw)
To: David Hildenbrand
Cc: Hugh Dickins, Will Deacon, linux-mm, linux-kernel, Keir Fraser,
Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton,
Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang
On Thu, 28 Aug 2025, David Hildenbrand wrote:
> On 28.08.25 18:12, Hugh Dickins wrote:
> > On Thu, 28 Aug 2025, David Hildenbrand wrote:
> >> On 28.08.25 10:47, Hugh Dickins wrote:
> > ...
> >>> It took several days in search of the least bad compromise, but
> >>> in the end I concluded the opposite of what we'd intended above.
> >>>
> >>> There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
> >>> ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
> >>> and Ge Yang's 6.11 33dfe9204f29
> >>> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
> >>>
> >>> It turns out that the mm/swap.c folio batches (apart from lru_add)
> >>> are all for best-effort, doesn't matter if it's missed, operations;
> >>> whereas mlock and munlock are more serious. Probably mlock could
> >>> be (not very satisfactorily) converted, but then munlock? Because
> >>> of failed folio_test_clear_lru()s, it would be far too likely to
> >>> err on either side, munlocking too soon or too late.
> >>>
> >>> I've concluded that one or the other has to go. If we're having
> >>> a beauty contest, there's no doubt that 33dfe9204f29 is much nicer
> >>> than 2fbb0c10d1e8 (which is itself far from perfect). But functionally,
> >>> I'm afraid that removing the mlock/munlock batching will show up as a
> >>> perceptible regression in realistic workloadsg; and on consideration,
> >>> I've found no real justification for the LRU flag clearing change.
> >>
> >> Just to understand what you are saying: are you saying that we will go back
> >> to
> >> having a folio being part of multiple LRU caches?
> >
> > Yes. Well, if you count the mlock/munlock batches in as "LRU caches",
> > then that has been so all along.
>
> Yes ...
>
> >
> >> :/ If so, I really rally
> >> hope that we can find another way and not go back to that old handling.
> >
> > For what reason? It sounded like a nice "invariant" to keep in mind,
> > but it's a limitation, and what purpose was it actually serving?
>
> I liked the semantics that if !lru, there could be at most one reference from
> the LRU caches.
>
> That is, if there are two references, you don't even have to bother with
> flushing anything.
If that assumption is being put into practice anywhere (not that I know of),
then it's currently wrong and needs currecting.
It would be nice and simple if it worked, but I couldn't get it to work
with mlock/munlock batching, so it seemed better to give up on the
pretence.
And one of the points of using a pagevec used to be precisely that a
page could exist on more than one at a time (unlike threading via lru).
>
> >
> > If it's the "spare room in struct page to keep the address of that one
> > batch entry ... efficiently extract ..." that I was dreaming of: that
> > has to be a future thing, when perhaps memdescs will allow an extensible
> > structure to be attached, and extending it for an mlocked folio (to hold
> > the mlock_count instead of squeezing it into lru.prev) would not need
> > mlock/munlock batching at all (I guess: far from uppermost in my mind!),
> > and including a field for "efficiently extract" from LRU batch would be
> > nice.
> >
> > But, memdescs or not, there will always be pressure to keep the common
> > struct as small as possible, so I don't know if we would actually go
> > that way.
> >
> > But I suspect that was not your reason at all: please illuminate.
>
> You are very right :)
OK, thanks, I'll stop reading further now :)
>
> Regarding the issue at hand:
>
> There were discussions at LSF/MM about also putting (some) large folios onto
> the LRU caches.
>
> In that context, GUP could take multiple references on the same folio, and a
> simple folio_expected_ref_count() + 1 would no longer do the trick.
>
> I thought about this today, and likely it could be handled by scanning the
> page array for same folios etc. A bit nasty when wanting to cover all corner
> cases (folio pages must not be consecutive in the passed array ) ...
I haven't thought about that problem at all (except when working around
a similar issue in mm/mempolicy.c's folio queueing), but can sympathize.
It had irritated me to notice how the flush-immediately code for 512-page
folios now extends to 2-page folios: you've enlightened me why that remains
so, I hadn't thought of the implications. Perhaps even more reason to
allow for multiple references on the pagevecs/batches?
>
> Apart from that issue, I liked the idea of a "single entry in the cache" for
> other reasons: it gives clear semantics. We cannot end up in a scenario where
> someone performs OPX and later someone OPY on a folio, but the way the lru
> caches are flushed we might end up processing OPX after OPY -- this should be
> able to happen in case of local or remote flushes IIRC.
It's been that way for many years, that's how they are.
>
> Anyhow, I quickly scanned your code. The folio_expected_ref_count() should
> solve the issue for now. It's quite unfortunate that any raised reference will
> make us now flush all remote LRU caches.
There will be more false positives (drains which drain nothing relevant)
by relying on ref_count rather than test_lru, yes. But certainly there
were already false positives by relying on test_lru. And I expect the
preparatory local lru_add_drain() to cut out a lot of true positives.
>
> Maybe we just want to limit it to !folio_test_large(), because flushing there
> really doesn't give us any chance in succeeding right now? Not sure if it
> makes any difference in practice, though, we'll likely be flushing remote LRU
> caches now more frequently either way.
Ah, good idea. with or without my changes. Maybe material for a separate
patch. I wonder if we would do better to add a folio_is_batchable() and
use that consistently in all of the places which are refusing to batch
when folio_test_large() - I wonder if a !folio_test_large() here will
get missed if there's a change there.
Thanks,
Hugh
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-29 1:58 ` Hugh Dickins
@ 2025-08-29 8:56 ` David Hildenbrand
0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-08-29 8:56 UTC (permalink / raw)
To: Hugh Dickins
Cc: Will Deacon, linux-mm, linux-kernel, Keir Fraser, Jason Gunthorpe,
John Hubbard, Frederick Mayle, Andrew Morton, Peter Xu,
Rik van Riel, Vlastimil Babka, Ge Yang
>>>
>>>> :/ If so, I really rally
>>>> hope that we can find another way and not go back to that old handling.
>>>
>>> For what reason? It sounded like a nice "invariant" to keep in mind,
>>> but it's a limitation, and what purpose was it actually serving?
>>
>> I liked the semantics that if !lru, there could be at most one reference from
>> the LRU caches.
>>
>> That is, if there are two references, you don't even have to bother with
>> flushing anything.
>
> If that assumption is being put into practice anywhere (not that I know of),
> then it's currently wrong and needs currecting.
We do make use of that property in wp_can_reuse_anon_folio(), to just
fail fast -- that's what Linus argued for back then: give up fast and
just create a page copy.
Essentially hidden in the "folio_ref_count(folio) > 3" and
"!folio_test_lru(folio)" code path. mlock caches would not be considered
right now.
Now, in contrast to page pinning, that code is pure best effort, and we
don't care that much about always getting it right (e.g., no remote
draining, no draining if there is a clear indication that it might help).
>
> It would be nice and simple if it worked, but I couldn't get it to work
> with mlock/munlock batching, so it seemed better to give up on the
> pretence.
Thanks for trying!
[...]
>> Regarding the issue at hand:
>>
>> There were discussions at LSF/MM about also putting (some) large folios onto
>> the LRU caches.
>>
>> In that context, GUP could take multiple references on the same folio, and a
>> simple folio_expected_ref_count() + 1 would no longer do the trick.
>>
>> I thought about this today, and likely it could be handled by scanning the
>> page array for same folios etc. A bit nasty when wanting to cover all corner
>> cases (folio pages must not be consecutive in the passed array ) ...
>
> I haven't thought about that problem at all (except when working around
> a similar issue in mm/mempolicy.c's folio queueing), but can sympathize.
>
> It had irritated me to notice how the flush-immediately code for 512-page
> folios now extends to 2-page folios: you've enlightened me why that remains
> so, I hadn't thought of the implications. Perhaps even more reason to
> allow for multiple references on the pagevecs/batches?
Not sure TBH if multiple references from pagevecs/batches really play a
role here :)
>
>>
>> Apart from that issue, I liked the idea of a "single entry in the cache" for
>> other reasons: it gives clear semantics. We cannot end up in a scenario where
>> someone performs OPX and later someone OPY on a folio, but the way the lru
>> caches are flushed we might end up processing OPX after OPY -- this should be
>> able to happen in case of local or remote flushes IIRC.
>
> It's been that way for many years, that's how they are.
Yeah, but I enjoyed clearer semantics: like a folio_activate() actually
resulting in an activation, not getting swallowed by a
folio_deactivate() stranded somewhere.
>>
>> Maybe we just want to limit it to !folio_test_large(), because flushing there
>> really doesn't give us any chance in succeeding right now? Not sure if it
>> makes any difference in practice, though, we'll likely be flushing remote LRU
>> caches now more frequently either way.
>
> Ah, good idea. with or without my changes. Maybe material for a separate
> patch. I wonder if we would do better to add a folio_is_batchable() and
> use that consistently in all of the places which are refusing to batch
> when folio_test_large() - I wonder if a !folio_test_large() here will
> get missed if there's a change there.
You read my mind.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-28 8:47 ` Hugh Dickins
2025-08-28 8:59 ` David Hildenbrand
@ 2025-08-29 11:57 ` Will Deacon
2025-08-29 13:21 ` Will Deacon
2025-08-29 15:46 ` Hugh Dickins
1 sibling, 2 replies; 20+ messages in thread
From: Will Deacon @ 2025-08-29 11:57 UTC (permalink / raw)
To: Hugh Dickins
Cc: David Hildenbrand, linux-mm, linux-kernel, Keir Fraser,
Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton,
Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang
Hi Hugh,
On Thu, Aug 28, 2025 at 01:47:14AM -0700, Hugh Dickins wrote:
> On Sun, 24 Aug 2025, Hugh Dickins wrote:
> > On Mon, 18 Aug 2025, Will Deacon wrote:
> > > On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote:
> > > > On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote:
> > > > > I think replace the folio_test_mlocked(folio) part of it by
> > > > > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
> > > > > That should reduce the extra calls to a much more reasonable
> > > > > number, while still solving your issue.
> > > >
> > > > Alas, I fear that the folio may be unevictable by this point (which
> > > > seems to coincide with the readahead fault adding it to the LRU above)
> > > > but I can try it out.
> > >
> > > I gave this a spin but I still see failures with this change.
> >
> > Many thanks, Will, for the precisely relevant traces (in which,
> > by the way, mapcount=0 really means _mapcount=0 hence mapcount=1).
> >
> > Yes, those do indeed illustrate a case which my suggested
> > (folio_test_mlocked(folio) && !folio_test_unevictable(folio))
> > failed to cover. Very helpful to have an example of that.
> >
> > And many thanks, David, for your reminder of commit 33dfe9204f29
> > ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
> >
> > Yes, I strongly agree with your suggestion that the mlock batch
> > be brought into line with its change to the ordinary LRU batches,
> > and agree that doing so will be likely to solve Will's issue
> > (and similar cases elsewhere, without needing to modify them).
> >
> > Now I just have to cool my head and get back down into those
> > mlock batches. I am fearful that making a change there to suit
> > this case will turn out later to break another case (and I just
> > won't have time to redevelop as thorough a grasp of the races as
> > I had back then). But if we're lucky, applying that "one batch
> > at a time" rule will actually make it all more comprehensible.
> >
> > (I so wish we had spare room in struct page to keep the address
> > of that one batch entry, or the CPU to which that one batch
> > belongs: then, although that wouldn't eliminate all uses of
> > lru_add_drain_all(), it would allow us to efficiently extract
> > a target page from its LRU batch without a remote drain.)
> >
> > I have not yet begun to write such a patch, and I'm not yet sure
> > that it's even feasible: this mail sent to get the polite thank
> > yous out of my mind, to help clear it for getting down to work.
>
> It took several days in search of the least bad compromise, but
> in the end I concluded the opposite of what we'd intended above.
>
> There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
> ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
> and Ge Yang's 6.11 33dfe9204f29
> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
That's actually pretty good news, as I was initially worried that we'd
have to backport a fix all the way back to 6.1. From the above, the only
LTS affected is 6.12.y.
> It turns out that the mm/swap.c folio batches (apart from lru_add)
> are all for best-effort, doesn't matter if it's missed, operations;
> whereas mlock and munlock are more serious. Probably mlock could
> be (not very satisfactorily) converted, but then munlock? Because
> of failed folio_test_clear_lru()s, it would be far too likely to
> err on either side, munlocking too soon or too late.
>
> I've concluded that one or the other has to go. If we're having
> a beauty contest, there's no doubt that 33dfe9204f29 is much nicer
> than 2fbb0c10d1e8 (which is itself far from perfect). But functionally,
> I'm afraid that removing the mlock/munlock batching will show up as a
> perceptible regression in realistic workloadsg; and on consideration,
> I've found no real justification for the LRU flag clearing change.
>
> Unless I'm mistaken, collect_longterm_unpinnable_folios() should
> never have been relying on folio_test_lru(), and should simply be
> checking for expected ref_count instead.
>
> Will, please give the portmanteau patch (combination of four)
> below a try: reversion of 33dfe9204f29 and a later MGLRU fixup,
> corrected test in collect...(), preparatory lru_add_drain() there.
>
> I hope you won't be proving me wrong again, and I can move on to
> writing up those four patches (and adding probably three more that
> make sense in such a series, but should not affect your testing).
>
> I've tested enough to know that it's not harmful, but am hoping
> to take advantage of your superior testing, particularly in the
> GUP pin area. But if you're uneasy with the combination, and would
> prefer to check just the minimum, then ignore the reversions and try
> just the mm/gup.c part of it - that will probably be good enough for
> you even without the reversions.
Thanks, I'll try to test the whole lot. I was geographically separated
from my testing device yesterday but I should be able to give it a spin
later today. I'm _supposed_ to be writing my KVM Forum slides for next
week, so this offers a perfect opportunity to procrastinate.
> Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12
> (or an intervening release), I already did the backport so please just
> ask.
We've got 6.15 working well at the moment, so I'll backport your diff
to that.
One question on the diff below:
> Thanks!
>
> mm/gup.c | 5 ++++-
> mm/swap.c | 50 ++++++++++++++++++++++++++------------------------
> mm/vmscan.c | 2 +-
> 3 files changed, 31 insertions(+), 26 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index adffe663594d..9f7c87f504a9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> struct folio *folio;
> long i = 0;
>
> + lru_add_drain();
> +
> for (folio = pofs_get_folio(pofs, i); folio;
> folio = pofs_next_folio(folio, pofs, &i)) {
>
> @@ -2307,7 +2309,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> continue;
> }
>
> - if (!folio_test_lru(folio) && drain_allow) {
> + if (drain_allow && folio_ref_count(folio) !=
> + folio_expected_ref_count(folio) + 1) {
> lru_add_drain_all();
How does this synchronise with the folio being added to the mlock batch
on another CPU?
need_mlock_drain(), which is what I think lru_add_drain_all() ends up
using to figure out which CPU batches to process, just looks at the
'nr' field in the batch and I can't see anything in mlock_folio() to
ensure any ordering between adding the folio to the batch and
incrementing its refcount.
Then again, my hack to use folio_test_mlocked() would have a similar
issue because the flag is set (albeit with barrier semantics) before
adding the folio to the batch, meaning the drain could miss the folio.
I guess there's some higher-level synchronisation making this all work,
but it would be good to understand that as I can't see that
collect_longterm_unpinnable_folios() can rely on much other than the pin.
Will
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-29 11:57 ` Will Deacon
@ 2025-08-29 13:21 ` Will Deacon
2025-08-29 16:04 ` Hugh Dickins
2025-08-29 15:46 ` Hugh Dickins
1 sibling, 1 reply; 20+ messages in thread
From: Will Deacon @ 2025-08-29 13:21 UTC (permalink / raw)
To: Hugh Dickins
Cc: David Hildenbrand, linux-mm, linux-kernel, Keir Fraser,
Jason Gunthorpe, John Hubbard, Frederick Mayle, Andrew Morton,
Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang
On Fri, Aug 29, 2025 at 12:57:43PM +0100, Will Deacon wrote:
> On Thu, Aug 28, 2025 at 01:47:14AM -0700, Hugh Dickins wrote:
> > Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12
> > (or an intervening release), I already did the backport so please just
> > ask.
>
> We've got 6.15 working well at the moment, so I'll backport your diff
> to that.
Notwithstanding my question about the synchronisation, I cherry-picked
86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count
calculation") to my 6.15-based Android tree and applied your diff on top.
With that, I've not managed to reproduce the original failure and
haven't observed any migration failures on the GUP path.
Cheers,
Will
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-29 11:57 ` Will Deacon
2025-08-29 13:21 ` Will Deacon
@ 2025-08-29 15:46 ` Hugh Dickins
1 sibling, 0 replies; 20+ messages in thread
From: Hugh Dickins @ 2025-08-29 15:46 UTC (permalink / raw)
To: Will Deacon
Cc: Hugh Dickins, David Hildenbrand, linux-mm, linux-kernel,
Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle,
Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang
On Fri, 29 Aug 2025, Will Deacon wrote:
> On Thu, Aug 28, 2025 at 01:47:14AM -0700, Hugh Dickins wrote:
...
> >
> > It took several days in search of the least bad compromise, but
> > in the end I concluded the opposite of what we'd intended above.
> >
> > There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
> > ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
> > and Ge Yang's 6.11 33dfe9204f29
> > ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
>
> That's actually pretty good news, as I was initially worried that we'd
> have to backport a fix all the way back to 6.1. From the above, the only
> LTS affected is 6.12.y.
I'm not so sure of that. I think the 6.11 change tickled a particular
sequence that showed up in your testing, but the !folio_test_lru()
was never an adequate test for whether lru_add_drain_all() might help.
I can't try to estimate the probabilities, you'll have to make your
own decision, whether it's worth going back to change a release which
did not (I presume) show problems in real life.
...
> > Unless I'm mistaken, collect_longterm_unpinnable_folios() should
> > never have been relying on folio_test_lru(), and should simply be
> > checking for expected ref_count instead.
> >
> > Will, please give the portmanteau patch (combination of four)
> > below a try: reversion of 33dfe9204f29 and a later MGLRU fixup,
> > corrected test in collect...(), preparatory lru_add_drain() there.
> >
> > I hope you won't be proving me wrong again, and I can move on to
> > writing up those four patches (and adding probably three more that
> > make sense in such a series, but should not affect your testing).
> >
> > I've tested enough to know that it's not harmful, but am hoping
> > to take advantage of your superior testing, particularly in the
> > GUP pin area. But if you're uneasy with the combination, and would
> > prefer to check just the minimum, then ignore the reversions and try
> > just the mm/gup.c part of it - that will probably be good enough for
> > you even without the reversions.
>
> Thanks, I'll try to test the whole lot. I was geographically separated
> from my testing device yesterday but I should be able to give it a spin
> later today. I'm _supposed_ to be writing my KVM Forum slides for next
> week, so this offers a perfect opportunity to procrastinate.
Well understood :) And you've already reported on the testing, thanks.
>
> > Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12
> > (or an intervening release), I already did the backport so please just
> > ask.
>
> We've got 6.15 working well at the moment, so I'll backport your diff
> to that.
>
> One question on the diff below:
>
> > Thanks!
> >
> > mm/gup.c | 5 ++++-
> > mm/swap.c | 50 ++++++++++++++++++++++++++------------------------
> > mm/vmscan.c | 2 +-
> > 3 files changed, 31 insertions(+), 26 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index adffe663594d..9f7c87f504a9 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> > struct folio *folio;
> > long i = 0;
> >
> > + lru_add_drain();
> > +
> > for (folio = pofs_get_folio(pofs, i); folio;
> > folio = pofs_next_folio(folio, pofs, &i)) {
> >
> > @@ -2307,7 +2309,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> > continue;
> > }
> >
> > - if (!folio_test_lru(folio) && drain_allow) {
> > + if (drain_allow && folio_ref_count(folio) !=
> > + folio_expected_ref_count(folio) + 1) {
> > lru_add_drain_all();
>
> How does this synchronise with the folio being added to the mlock batch
> on another CPU?
>
> need_mlock_drain(), which is what I think lru_add_drain_all() ends up
> using to figure out which CPU batches to process, just looks at the
> 'nr' field in the batch and I can't see anything in mlock_folio() to
> ensure any ordering between adding the folio to the batch and
> incrementing its refcount.
>
> Then again, my hack to use folio_test_mlocked() would have a similar
> issue because the flag is set (albeit with barrier semantics) before
> adding the folio to the batch, meaning the drain could miss the folio.
>
> I guess there's some higher-level synchronisation making this all work,
> but it would be good to understand that as I can't see that
> collect_longterm_unpinnable_folios() can rely on much other than the pin.
No such strict synchronization: you've been misled if people have told
you that this pinning migration stuff is deterministically successful:
it's best effort - or will others on the Cc disagree?
Just as there's no synchronization between the calculation inside
folio_expected_ref_count() and the reading of folio's refcount.
It wouldn't make sense for this unpinnable collection to anguish over
such synchronization, when a moment later the migration is liable to
fail (on occasion) for other transient reasons. All ending up reported
as -ENOMEM apparently? that looks unhelpful.
There is a heavy hammer called lru_cache_disable() in mm/swap.c,
stopping the batching and doing its own lru_add_drain_all(): that is
used by CMA and a few others, but not by this pinning (and I do not
want to argue that it should be).
Hugh
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
2025-08-29 13:21 ` Will Deacon
@ 2025-08-29 16:04 ` Hugh Dickins
0 siblings, 0 replies; 20+ messages in thread
From: Hugh Dickins @ 2025-08-29 16:04 UTC (permalink / raw)
To: Will Deacon
Cc: Hugh Dickins, David Hildenbrand, linux-mm, linux-kernel,
Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle,
Andrew Morton, Peter Xu, Rik van Riel, Vlastimil Babka, Ge Yang
On Fri, 29 Aug 2025, Will Deacon wrote:
> On Fri, Aug 29, 2025 at 12:57:43PM +0100, Will Deacon wrote:
> > On Thu, Aug 28, 2025 at 01:47:14AM -0700, Hugh Dickins wrote:
> > > Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12
> > > (or an intervening release), I already did the backport so please just
> > > ask.
> >
> > We've got 6.15 working well at the moment, so I'll backport your diff
> > to that.
>
> Notwithstanding my question about the synchronisation, I cherry-picked
> 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count
> calculation") to my 6.15-based Android tree and applied your diff on top.
Yes, I cherry-picked exactly that into my 6.12. Not a big deal,
but a word of advance warning: the first patch in my series will
(I believe) be a fix to that folio_expected_ref_count(), to allow
for PG_private_2, which implies +1 on refcount (I've not yet
researched whether it's +1 or +2 if PG_private and PG_private_2
are both set - comments I've seen imply +1 but I need to check).
I thought of that when doing the cherry-pick because I thought that
PG_private_2 had already been killed off by now in the latest tree:
but was surprised to find that it is still there in 6.17-rc.
>
> With that, I've not managed to reproduce the original failure and
> haven't observed any migration failures on the GUP path.
Great, many thanks to you, Will.
Hugh
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-08-29 16:04 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-15 10:18 [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration Will Deacon
2025-08-16 1:03 ` John Hubbard
2025-08-16 4:33 ` Hugh Dickins
2025-08-18 13:38 ` Will Deacon
2025-08-16 4:14 ` Hugh Dickins
2025-08-16 8:15 ` David Hildenbrand
2025-08-18 13:31 ` Will Deacon
2025-08-18 14:31 ` Will Deacon
2025-08-25 1:25 ` Hugh Dickins
2025-08-25 16:04 ` David Hildenbrand
2025-08-28 8:47 ` Hugh Dickins
2025-08-28 8:59 ` David Hildenbrand
2025-08-28 16:12 ` Hugh Dickins
2025-08-28 20:38 ` David Hildenbrand
2025-08-29 1:58 ` Hugh Dickins
2025-08-29 8:56 ` David Hildenbrand
2025-08-29 11:57 ` Will Deacon
2025-08-29 13:21 ` Will Deacon
2025-08-29 16:04 ` Hugh Dickins
2025-08-29 15:46 ` Hugh Dickins
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).