All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Harry Yoo <harry.yoo@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Pedro Falcato <pfalcato@suse.de>, Rik van Riel <riel@surriel.com>,
	Zi Yan <ziy@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Nico Pache <npache@redhat.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
	Jakub Matena <matenajakub@gmail.com>,
	Wei Yang <richard.weiyang@gmail.com>,
	Barry Song <baohua@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
Date: Tue, 17 Jun 2025 13:49:21 +0200	[thread overview]
Message-ID: <7beef290-e68b-4599-aedb-994c2e9fa237@redhat.com> (raw)
In-Reply-To: <7f22dec0-680b-4e3d-9aab-cd516dda8ed7@lucifer.local>

On 17.06.25 13:24, Lorenzo Stoakes wrote:
> On Tue, Jun 17, 2025 at 08:15:52PM +0900, Harry Yoo wrote:
>> On Mon, Jun 09, 2025 at 02:26:35PM +0100, Lorenzo Stoakes wrote:
>>> When mremap() moves a mapping around in memory, it goes to great lengths to
>>> avoid having to walk page tables as this is expensive and
>>> time-consuming.
>>>
>>> Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
>>> page offset stored in the VMA at vma->vm_pgoff will remain the same, as
>>> well all the folio indexes pointed at the associated anon_vma object.
>>>
>>> This means the VMA and page tables can simply be moved and this affects the
>>> change (and if we can move page tables at a higher page table level, this
>>> is even faster).
>>>
>>> While this is efficient, it does lead to big problems with VMA merging - in
>>> essence it causes faulted anonymous VMAs to not be mergeable under many
>>> circumstances once moved.
>>>
>>> This is limiting and leads to both a proliferation of unreclaimable,
>>> unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
>>> impact on further use of mremap(), which has a requirement that the VMA
>>> moved (which can also be a partial range within a VMA) may span only a
>>> single VMA.
>>>
>>> This makes the mergeability or not of VMAs in effect a uAPI concern.
>>>
>>> In some use cases, users may wish to accept the overhead of actually going
>>> to the trouble of updating VMAs and folios to affect mremap() moves. Let's
>>> provide them with the choice.
>>>
>>> This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
>>> attempts to perform such an operation. If it is unable to do so, it cleanly
>>> falls back to the usual method.
>>>
>>> It carefully takes the rmap locks such that at no time will a racing rmap
>>> user encounter incorrect or missing VMAs.
>>>
>>> It is also designed to interact cleanly with the existing mremap() error
>>> fallback mechanism (inverting the remap should the page table move fail).
>>>
>>> Also, if we could merge cleanly without such a change, we do so, avoiding
>>> the overhead of the operation if it is not required.
>>>
>>> In the instance that no merge may occur when the move is performed, we
>>> still perform the folio and VMA updates to ensure that future mremap() or
>>> mprotect() calls will result in merges.
>>>
>>> In this implementation, we simply give up if we encounter large folios. A
>>> subsequent commit will extend the functionality to allow for these cases.
>>>
>>> We restrict this flag to purely anonymous memory only.
>>>
>>> we separate out the vma_had_uncowed_parents() helper function for checking
>>> in should_relocate_anon() and introduce a new function
>>> vma_maybe_has_shared_anon_folios() which combines a check against this and
>>> any forked child anon_vma's.
>>>
>>> We carefully check for pinned folios in case a caller who holds a pin might
>>> make assumptions about index, mapping fields which we are about to
>>> manipulate.
>>>
>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> ---
>>>   include/linux/rmap.h             |   4 +
>>>   include/uapi/linux/mman.h        |   1 +
>>>   mm/internal.h                    |   1 +
>>>   mm/mremap.c                      | 403 +++++++++++++++++++++++++++++--
>>>   mm/vma.c                         |  77 ++++--
>>>   mm/vma.h                         |  36 ++-
>>>   tools/testing/vma/vma.c          |   5 +-
>>>   tools/testing/vma/vma_internal.h |  38 +++
>>>   8 files changed, 520 insertions(+), 45 deletions(-)
>>
>> [...snip...]
>>
>>> @@ -754,6 +797,209 @@ static unsigned long pmc_progress(struct pagetable_move_control *pmc)
>>>   	return old_addr < orig_old_addr ? 0 : old_addr - orig_old_addr;
>>>   }
>>>
>>> +/*
>>> + * If the folio mapped at the specified pte entry can have its index and mapping
>>> + * relocated, then do so.
>>> + *
>>> + * Returns the number of pages we have traversed, or 0 if the operation failed.
>>> + */
>>> +static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
>>> +		struct pte_state *state, bool undo)
>>> +{
>>> +	struct folio *folio;
>>> +	struct vm_area_struct *old, *new;
>>> +	pgoff_t new_index;
>>> +	pte_t pte;
>>> +	unsigned long ret = 1;
>>> +	unsigned long old_addr = state->old_addr;
>>> +	unsigned long new_addr = state->new_addr;
>>> +
>>> +	old = pmc->old;
>>> +	new = pmc->new;
>>> +
>>> +	pte = ptep_get(state->ptep);
>>> +
>>> +	/* Ensure we have truly got an anon folio. */
>>> +	folio = vm_normal_folio(old, old_addr, pte);
>>> +	if (!folio)
>>> +		return ret;
>>> +
>>> +	folio_lock(folio);
>>> +
>>> +	/* No-op. */
>>> +	if (!folio_test_anon(folio) || folio_test_ksm(folio))
>>> +		goto out;
>>
>> I think the kernel should not observe any KSM pages during mremap
>> because it breaks KSM pages in prep_move_vma()?

Ah, that's the maigc bit, thanks!

> 
> Right, nor should we observe !anon pages here since we already checked for
> that...
> 
> This is belt + braces. Maybe we should replace with VM_WARN_ON_ONCE()'s...?

Sure. Anything you can throw out probably reduces the overhead :)

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2025-06-17 11:49 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-09 13:26 [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 01/11] " Lorenzo Stoakes
2025-06-16 20:58   ` David Hildenbrand
2025-06-17  6:37     ` Harry Yoo
2025-06-17  9:52       ` Lorenzo Stoakes
2025-06-17 10:01         ` David Hildenbrand
2025-06-17 10:07     ` Lorenzo Stoakes
2025-06-17 12:07       ` David Hildenbrand
2025-06-17 11:15   ` Harry Yoo
2025-06-17 11:24     ` Lorenzo Stoakes
2025-06-17 11:49       ` David Hildenbrand [this message]
2025-06-17 20:09   ` Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 02/11] mm/mremap: add MREMAP_MUST_RELOCATE_ANON Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 03/11] mm/mremap: add MREMAP[_MUST]_RELOCATE_ANON support for large folios Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 04/11] tools UAPI: Update copy of linux/mman.h from the kernel sources Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 05/11] tools/testing/selftests: add sys_mremap() helper to vm_util.h Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 06/11] tools/testing/selftests: add mremap() cases that merge normally Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 07/11] tools/testing/selftests: add MREMAP_RELOCATE_ANON merge test cases Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 08/11] tools/testing/selftests: expand mremap() tests for MREMAP_RELOCATE_ANON Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 09/11] tools/testing/selftests: have CoW self test use MREMAP_RELOCATE_ANON Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 10/11] tools/testing/selftests: test relocate anon in split huge page test Lorenzo Stoakes
2025-06-09 13:26 ` [PATCH 11/11] tools/testing/selftests: add MREMAP_RELOCATE_ANON fork tests Lorenzo Stoakes
2025-06-16 20:24 ` [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON David Hildenbrand
2025-06-16 20:41   ` David Hildenbrand
2025-06-17  8:34     ` Pedro Falcato
2025-06-17  8:45       ` David Hildenbrand
2025-06-17 10:57         ` Lorenzo Stoakes
2025-06-17 11:58           ` David Hildenbrand
2025-06-17 12:47             ` Lorenzo Stoakes
2025-06-20 18:59           ` Pedro Falcato
2025-06-20 19:28             ` Lorenzo Stoakes
2025-06-24  9:38               ` David Hildenbrand
2025-06-24 10:19                 ` Lorenzo Stoakes
2025-06-24 12:05                   ` David Hildenbrand
2025-06-17 10:20       ` Lorenzo Stoakes
2025-06-17 10:50   ` Lorenzo Stoakes
2025-06-17  5:42 ` Lai, Yi
2025-06-17  6:45   ` Harry Yoo
2025-06-17  9:33     ` Lorenzo Stoakes
2025-06-25 15:44 ` Lorenzo Stoakes
2025-06-25 15:58   ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7beef290-e68b-4599-aedb-994c2e9fa237@redhat.com \
    --to=david@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=dev.jain@arm.com \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matenajakub@gmail.com \
    --cc=npache@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=richard.weiyang@gmail.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.