linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Hugh Dickins <hughd@google.com>, Will Deacon <will@kernel.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Keir Fraser <keirf@google.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	John Hubbard <jhubbard@nvidia.com>,
	Frederick Mayle <fmayle@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Xu <peterx@redhat.com>, Rik van Riel <riel@surriel.com>,
	Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
Date: Sat, 16 Aug 2025 10:15:59 +0200	[thread overview]
Message-ID: <92b16e68-0083-48aa-b281-5c235fd91bfe@redhat.com> (raw)
In-Reply-To: <c5bac539-fd8a-4db7-c21c-cd3e457eee91@google.com>

On 16.08.25 06:14, Hugh Dickins wrote:
> On Fri, 15 Aug 2025, Will Deacon wrote:
> 
>> When taking a longterm GUP pin via pin_user_pages(),
>> __gup_longterm_locked() tries to migrate target folios that should not
>> be longterm pinned, for example because they reside in a CMA region or
>> movable zone. This is done by first pinning all of the target folios
>> anyway, collecting all of the longterm-unpinnable target folios into a
>> list, dropping the pins that were just taken and finally handing the
>> list off to migrate_pages() for the actual migration.
>>
>> It is critically important that no unexpected references are held on the
>> folios being migrated, otherwise the migration will fail and
>> pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
>> relatively easy to observe migration failures when running pKVM (which
>> uses pin_user_pages() on crosvm's virtual address space to resolve
>> stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
>> this results in the VM terminating prematurely.
>>
>> In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
>> mapping of guest memory prior to the pinning. Subsequently, when
>> pin_user_pages() walks the page-table, the relevant 'pte' is not
>> present and so the faulting logic allocates a new folio, mlocks it
>> with mlock_folio() and maps it in the page-table.
>>
>> Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
>> batch by pagevec"), mlock/munlock operations on a folio (formerly page),
>> are deferred. For example, mlock_folio() takes an additional reference
>> on the target folio before placing it into a per-cpu 'folio_batch' for
>> later processing by mlock_folio_batch(), which drops the refcount once
>> the operation is complete. Processing of the batches is coupled with
>> the LRU batch logic and can be forcefully drained with
>> lru_add_drain_all() but as long as a folio remains unprocessed on the
>> batch, its refcount will be elevated.
>>
>> This deferred batching therefore interacts poorly with the pKVM pinning
>> scenario as we can find ourselves in a situation where the migration
>> code fails to migrate a folio due to the elevated refcount from the
>> pending mlock operation.
> 
> Thanks for the very full description, Will, that helped me a lot
> (I know very little of the GUP pinning end).
> 
> But one thing would help me to understand better: are the areas being
> pinned anonymous or shmem or file memory (or COWed shmem or file)?
> 
>  From "the faulting logic allocates a new folio" I first assumed
> anonymous, but later came to think "mlocks it with mlock_folio()"
> implies they are shmem or file folios (which, yes, can also be
> allocated by fault).
> 
> IIRC anonymous and COW faults would go the mlock_new_folio() way,
> where the folio goes on to the mlock folio batch without having yet
> reached LRU: those should be dealt with by the existing
> !folio_test_lru() check.

Very right. Because I don't remember easily running into that in 
wp_can_reuse_anon_folio(), where we also drain the lru if we spot a !lru 
folio.

Having to enlighten all such code to check for

	if (!folio_test_lru(folio) || folio_test_mlocked(folio))

is a bit suboptimal, but we could add a helper for that. (we might end 
up draining for actually mlocked folios though :( )


We recently changed in 33dfe9204f29 ("mm/gup: clear the LRU flag of a 
page before adding to LRU batch") that a folio can really only be once 
in an LRU batch.

So if we spot !folio_test_lru(folio) there might be at most one 
reference from LRU batches.


So naturally I wonder if we could somehow make it work that a folio is 
either on the ordinary LRU batch or on the mlock batch, but not on both. 
And that the folio LRU flag would be sufficient to check.

That would mean that we would clear the LRU flag when adding a folio to 
the mlock batch. But not sure how feasible that is (e.g., folio already 
isolated).

-- 
Cheers

David / dhildenb


  reply	other threads:[~2025-08-16  8:16 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-15 10:18 [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration Will Deacon
2025-08-16  1:03 ` John Hubbard
2025-08-16  4:33   ` Hugh Dickins
2025-08-18 13:38   ` Will Deacon
2025-08-16  4:14 ` Hugh Dickins
2025-08-16  8:15   ` David Hildenbrand [this message]
2025-08-18 13:31   ` Will Deacon
2025-08-18 14:31     ` Will Deacon
2025-08-25  1:25       ` Hugh Dickins
2025-08-25 16:04         ` David Hildenbrand
2025-08-28  8:47         ` Hugh Dickins
2025-08-28  8:59           ` David Hildenbrand
2025-08-28 16:12             ` Hugh Dickins
2025-08-28 20:38               ` David Hildenbrand
2025-08-29  1:58                 ` Hugh Dickins
2025-08-29  8:56                   ` David Hildenbrand
2025-08-29 11:57           ` Will Deacon
2025-08-29 13:21             ` Will Deacon
2025-08-29 16:04               ` Hugh Dickins
2025-08-29 15:46             ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=92b16e68-0083-48aa-b281-5c235fd91bfe@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=fmayle@google.com \
    --cc=hughd@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=keirf@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).