From: David Hildenbrand <david@redhat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Dev Jain <dev.jain@arm.com>,
akpm@linux-foundation.org, Liam.Howlett@oracle.com,
vbabka@suse.cz, jannh@google.com, pfalcato@suse.de,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
peterx@redhat.com, ryan.roberts@arm.com, mingo@kernel.org,
libang.li@antgroup.com, maobibo@loongson.cn,
zhengqi.arch@bytedance.com, baohua@kernel.org,
anshuman.khandual@arm.com, willy@infradead.org,
ioworker0@gmail.com, yang@os.amperecomputing.com,
baolin.wang@linux.alibaba.com, ziy@nvidia.com, hughd@google.com
Subject: Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
Date: Mon, 16 Jun 2025 18:13:04 +0200 [thread overview]
Message-ID: <d7c54bbf-ed99-43c7-99a7-a9be70071b4a@redhat.com> (raw)
In-Reply-To: <bc25dd02-6ace-45e6-9d3b-50f9c06aef98@lucifer.local>
[...]
>>> }
>>> - set_pte_at(mm, new_addr, new_ptep, pte);
>>> + set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>>
>>
>> What I dislike is that some paths work on a single PTE, and we implicitly have to know
>> that they don't apply for !pte_present.
>
> I hate any kind of implicit knowledge like this.
>
>>
>> Like
>> if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>
> I also despise [with words I cannot use on-list] how uffd is implemented.
>
> It's _nothing but_ ad-hoc stuff like this spawned all around the place.
>
> It's hateful.
>
>>
>> Will not get batched yet. And that is hidden inside the pte_marker_uffd_wp check ...
>>
>> Should we properly separate both paths (present vs. !present), and while at it, do
>> some more cleanups? I'm thinking of the following on top (only compile-tested)
>
> I'd like to see that, but I think maybe better as a follow up series?
I'd do the split as a cleanup patch upfront.
Then add the batching for the pte_present() case.
>
> On the other hand, this does improve this quite a bit. Could also be another
> patch in the series.
>
>>
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 18b215521adae..b88abf02b34e0 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -155,21 +155,6 @@ static void drop_rmap_locks(struct vm_area_struct *vma)
>> i_mmap_unlock_write(vma->vm_file->f_mapping);
>> }
>> -static pte_t move_soft_dirty_pte(pte_t pte)
>> -{
>> - /*
>> - * Set soft dirty bit so we can notice
>> - * in userspace the ptes were moved.
>> - */
>> -#ifdef CONFIG_MEM_SOFT_DIRTY
>> - if (pte_present(pte))
>> - pte = pte_mksoft_dirty(pte);
>> - else if (is_swap_pte(pte))
>> - pte = pte_swp_mksoft_dirty(pte);
>> -#endif
>> - return pte;
>> -}
>> -
>> static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>> pte_t *ptep, pte_t pte, int max_nr)
>> {
>> @@ -260,7 +245,6 @@ static int move_ptes(struct pagetable_move_control *pmc,
>> VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>> nr_ptes = 1;
>> - max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>> old_pte = ptep_get(old_ptep);
>> if (pte_none(old_pte))
>> continue;
>> @@ -277,24 +261,34 @@ static int move_ptes(struct pagetable_move_control *pmc,
>> * flushed.
>> */
>> if (pte_present(old_pte)) {
>> + max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>> nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>> old_pte, max_nr_ptes);
>> force_flush = true;
>> - }
>> - pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>> - pte = move_pte(pte, old_addr, new_addr);
>> - pte = move_soft_dirty_pte(pte);
>> -
>> - if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>> - pte_clear(mm, new_addr, new_ptep);
>> - else {
>> - if (need_clear_uffd_wp) {
>> - if (pte_present(pte))
>> - pte = pte_clear_uffd_wp(pte);
>> - else if (is_swap_pte(pte))
>> +
>> + pte = get_and_clear_full_ptes(mm, old_addr, old_ptep,
>> + nr_ptes, 0);
>> + /*
>> + * Moving present PTEs requires special care on some
>> + * archs.
>> + */
>> + pte = move_pte(pte, old_addr, new_addr);
>
> I guess we're good with only doing this in pte_present() case because the only
> arch that implements this, sparc, does a present check anyway.
Yes, we could then remove the call from pte_present(), see below.
>
>> + /* make userspace aware that this pte moved. */
>> + pte = pte_mksoft_dirty(pte);
>> + if (need_clear_uffd_wp)
>> + pte = pte_clear_uffd_wp(pte);
>> + set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>> + } else if (need_clear_uffd_wp && pte_marker_uffd_wp(pte)) {
>> + pte_clear(mm, old_addr, old_ptep);
>
> Same comment as below re: pte_clear().
>
> I see you've dropped pte_clear(mm, new_addr, new_ptep) which I guess is
> purposefully?
>
> I do think that it is pointless yes.
Yeah, that's what I raised below.
>
>> + } else {
>> + pte_clear(mm, old_addr, old_ptep);
>
> I guess this is intended to replace ptep_get_and_clear_full_ptes() above in the
> single PTE case... no? Is this sufficient?
>
> In the original code we'd always do ptep_get_and_clear().
>
> I think the key difference is page_table_check_pte_clear().
>
> I notice, hilariously, that there is a ptep_clear() that _does_ call this. What
> a mess.
ptep_get_and_clear() is only relevant when something is present and
could change concurrently (especially A/D bits managed by HW).
We already obtained the pte, it's not present, and now just want to
clear it.
>
>
>> + if (is_swap_pte(pte)) {
>> + if (need_clear_uffd_wp)
>> pte = pte_swp_clear_uffd_wp(pte);
>> + /* make userspace aware that this pte moved. */
>> + pte = pte_swp_mksoft_dirty(pte);
>> }
>> - set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>> + set_pte_at(mm, new_addr, new_ptep, pte);
>> }
>> }
>>
>>
>> Note that I don't know why we had the existing
>>
>> - if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>> - pte_clear(mm, new_addr, new_ptep);
>>
>>
>> I thought we would always expect that the destination pte is already pte_none() ?
>
> I think this is because we already did the move_pte() call in the original code
> before checking this:
Oh, maybe that's why.
>
> pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
> pte = move_pte(pte, old_addr, new_addr);
> pte = move_soft_dirty_pte(pte);
>
> if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> pte_clear(mm, new_addr, new_ptep);
>
> But maybe it's because there was a presumption move_pte() would like you know,
> move a PTE entry? Which it doesn't, it - only on SPARC - does a hook to flush
> the dcache.
I wish we could remove move_pte() completely. Or just notify about the
move of a present pte ... because it doesn't ever modify the pte val.
So renaming it to "arch_notify_move_pte()" or sth. like that that is a
void function might even be better.
... I think we could go one step further if we already have the folio:
we could call it arch_notify_move_folio_pte(), and simplify the sparc
implementation ...
Anyhow, the move_pte() cleanup can be done separately. Splitting the
present from !present case here should probably be done as a cleanup
upfront.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2025-06-16 16:13 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-10 3:50 [PATCH v4 0/2] Optimize mremap() for large folios Dev Jain
2025-06-10 3:50 ` [PATCH v4 1/2] mm: Call pointers to ptes as ptep Dev Jain
2025-06-11 13:23 ` David Hildenbrand
2025-06-11 13:25 ` Dev Jain
2025-06-11 13:29 ` Lorenzo Stoakes
2025-06-11 13:31 ` David Hildenbrand
2025-06-12 12:05 ` Pedro Falcato
2025-06-10 3:50 ` [PATCH v4 2/2] mm: Optimize mremap() by PTE batching Dev Jain
2025-06-10 7:03 ` Barry Song
2025-06-10 7:44 ` Dev Jain
2025-06-10 8:11 ` Barry Song
2025-06-16 21:27 ` Ryan Roberts
2025-06-10 8:37 ` Barry Song
2025-06-10 13:18 ` Lorenzo Stoakes
2025-06-11 14:00 ` David Hildenbrand
2025-06-13 4:24 ` Dev Jain
2025-06-17 8:02 ` David Hildenbrand
2025-06-13 12:32 ` Lorenzo Stoakes
2025-06-16 16:13 ` David Hildenbrand [this message]
2025-06-12 12:13 ` Pedro Falcato
2025-06-10 12:11 ` [PATCH v4 0/2] Optimize mremap() for large folios Lorenzo Stoakes
2025-06-10 12:33 ` Dev Jain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d7c54bbf-ed99-43c7-99a7-a9be70071b4a@redhat.com \
--to=david@redhat.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=dev.jain@arm.com \
--cc=hughd@google.com \
--cc=ioworker0@gmail.com \
--cc=jannh@google.com \
--cc=libang.li@antgroup.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=maobibo@loongson.cn \
--cc=mingo@kernel.org \
--cc=peterx@redhat.com \
--cc=pfalcato@suse.de \
--cc=ryan.roberts@arm.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=yang@os.amperecomputing.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).