From: Hugh Dickins <hughd@google.com>
To: Will Deacon <will@kernel.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickins <hughd@google.com>, Keir Fraser <keirf@google.com>,
Jason Gunthorpe <jgg@ziepe.ca>,
David Hildenbrand <david@redhat.com>,
John Hubbard <jhubbard@nvidia.com>,
Frederick Mayle <fmayle@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Peter Xu <peterx@redhat.com>, Rik van Riel <riel@surriel.com>,
Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
Date: Fri, 15 Aug 2025 21:14:48 -0700 (PDT) [thread overview]
Message-ID: <c5bac539-fd8a-4db7-c21c-cd3e457eee91@google.com> (raw)
In-Reply-To: <20250815101858.24352-1-will@kernel.org>
On Fri, 15 Aug 2025, Will Deacon wrote:
> When taking a longterm GUP pin via pin_user_pages(),
> __gup_longterm_locked() tries to migrate target folios that should not
> be longterm pinned, for example because they reside in a CMA region or
> movable zone. This is done by first pinning all of the target folios
> anyway, collecting all of the longterm-unpinnable target folios into a
> list, dropping the pins that were just taken and finally handing the
> list off to migrate_pages() for the actual migration.
>
> It is critically important that no unexpected references are held on the
> folios being migrated, otherwise the migration will fail and
> pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
> relatively easy to observe migration failures when running pKVM (which
> uses pin_user_pages() on crosvm's virtual address space to resolve
> stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
> this results in the VM terminating prematurely.
>
> In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
> mapping of guest memory prior to the pinning. Subsequently, when
> pin_user_pages() walks the page-table, the relevant 'pte' is not
> present and so the faulting logic allocates a new folio, mlocks it
> with mlock_folio() and maps it in the page-table.
>
> Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
> batch by pagevec"), mlock/munlock operations on a folio (formerly page),
> are deferred. For example, mlock_folio() takes an additional reference
> on the target folio before placing it into a per-cpu 'folio_batch' for
> later processing by mlock_folio_batch(), which drops the refcount once
> the operation is complete. Processing of the batches is coupled with
> the LRU batch logic and can be forcefully drained with
> lru_add_drain_all() but as long as a folio remains unprocessed on the
> batch, its refcount will be elevated.
>
> This deferred batching therefore interacts poorly with the pKVM pinning
> scenario as we can find ourselves in a situation where the migration
> code fails to migrate a folio due to the elevated refcount from the
> pending mlock operation.
Thanks for the very full description, Will, that helped me a lot
(I know very little of the GUP pinning end).
But one thing would help me to understand better: are the areas being
pinned anonymous or shmem or file memory (or COWed shmem or file)?
From "the faulting logic allocates a new folio" I first assumed
anonymous, but later came to think "mlocks it with mlock_folio()"
implies they are shmem or file folios (which, yes, can also be
allocated by fault).
IIRC anonymous and COW faults would go the mlock_new_folio() way,
where the folio goes on to the mlock folio batch without having yet
reached LRU: those should be dealt with by the existing
!folio_test_lru() check.
>
> Extend the existing LRU draining logic in
> collect_longterm_unpinnable_folios() so that unpinnable mlocked folios
> on the LRU also trigger a drain.
>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Keir Fraser <keirf@google.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Frederick Mayle <fmayle@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Xu <peterx@redhat.com>
> Fixes: 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>
> This has been quite unpleasant to debug and, as I'm not intimately
> familiar with the mm internals, I've tried to include all the relevant
> details in the commit message in case there's a preferred alternative
> way of solving the problem or there's a flaw in my logic.
>
> mm/gup.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index adffe663594d..656835890f05 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> continue;
> }
>
> - if (!folio_test_lru(folio) && drain_allow) {
> + if (drain_allow &&
> + (!folio_test_lru(folio) || folio_test_mlocked(folio))) {
> lru_add_drain_all();
> drain_allow = false;
> }
Hmm. That is going to call lru_add_drain_all() whenever any of the
pages in the list is mlocked, and lru_add_drain_all() is a function
we much prefer to avoid calling (it's much better than the old days
when it could involve every CPU IPIing every other CPU at the same
time; but it's still raising doubts to this day, and best avoided).
(Not as bad as I first thought: those unpinnably-placed mlocked
folios will get migrated, not appearing again in repeat runs.)
I think replace the folio_test_mlocked(folio) part of it by
(folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
That should reduce the extra calls to a much more reasonable
number, while still solving your issue.
But in addition, please add an unconditional lru_add_drain()
(the local CPU one, not the inter-CPU _all) at the head of
collect_longterm_unpinnable_folios(). My guess is that that
would eliminate 90% of the calls to the lru_add_drain_all() below:
not quite enough to satisfy you, but enough to be a good improvement.
I realize that there has been a recent move to cut down even on
unjustified lru_add_drain()s; but an lru_add_drain() to avoid an
lru_add_drain_all() is a good trade.
(Vlastimil, yes, I've Cc'ed you because this reminds me of my
"Agreed" in that "Realtime threads" thread two or three weeks
ago: I haven't rethought it through again, and will probably
still agree with your "should be rare", but answering this mail
forces me to realize that I was thinking there of the folio being
mlocked before it reaches LRU, forgetting this case of the folio
already on LRU being mlocked.)
Hugh
next prev parent reply other threads:[~2025-08-16 4:15 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-15 10:18 [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration Will Deacon
2025-08-16 1:03 ` John Hubbard
2025-08-16 4:33 ` Hugh Dickins
2025-08-18 13:38 ` Will Deacon
2025-08-16 4:14 ` Hugh Dickins [this message]
2025-08-16 8:15 ` David Hildenbrand
2025-08-18 13:31 ` Will Deacon
2025-08-18 14:31 ` Will Deacon
2025-08-25 1:25 ` Hugh Dickins
2025-08-25 16:04 ` David Hildenbrand
2025-08-28 8:47 ` Hugh Dickins
2025-08-28 8:59 ` David Hildenbrand
2025-08-28 16:12 ` Hugh Dickins
2025-08-28 20:38 ` David Hildenbrand
2025-08-29 1:58 ` Hugh Dickins
2025-08-29 8:56 ` David Hildenbrand
2025-08-29 11:57 ` Will Deacon
2025-08-29 13:21 ` Will Deacon
2025-08-29 16:04 ` Hugh Dickins
2025-08-29 15:46 ` Hugh Dickins
2025-09-09 11:39 ` Will Deacon
2025-09-09 11:50 ` David Hildenbrand
2025-09-10 0:24 ` John Hubbard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c5bac539-fd8a-4db7-c21c-cd3e457eee91@google.com \
--to=hughd@google.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=fmayle@google.com \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=keirf@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=peterx@redhat.com \
--cc=riel@surriel.com \
--cc=vbabka@suse.cz \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.