From: Vernon Yang <vernon2gm@gmail.com>
To: David Hildenbrand <david@redhat.com>
Cc: akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
ziy@nvidia.com, baolin.wang@linux.alibaba.com,
Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
dev.jain@arm.com, baohua@kernel.org, glider@google.com,
elver@google.com, dvyukov@google.com, vbabka@suse.cz,
rppt@kernel.org, surenb@google.com, mhocko@suse.com,
muchun.song@linux.dev, osalvador@suse.de, shuah@kernel.org,
richardcochran@gmail.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 6/7] mm: memory: add mTHP support for wp
Date: Fri, 15 Aug 2025 23:30:55 +0800 [thread overview]
Message-ID: <aJ9RLmuEj4U7JN_7@vernon-pc> (raw)
In-Reply-To: <b607985d-d319-4b47-9365-0595f7d87f28@redhat.com>
On Thu, Aug 14, 2025 at 02:57:34PM +0200, David Hildenbrand wrote:
> On 14.08.25 13:38, Vernon Yang wrote:
> > Currently pagefaults on anonymous pages support mthp, and hardware
> > features (such as arm64 contpte) can be used to store multiple ptes in
> > one TLB entry, reducing the probability of TLB misses. However, when the
> > process is forked and the cow is triggered again, the above optimization
> > effect is lost, and only 4KB is requested once at a time.
> >
> > Therefore, make pagefault write-protect copy support mthp to maintain the
> > optimization effect of TLB and improve the efficiency of cow pagefault.
> >
> > vm-scalability usemem shows a great improvement,
> > test using: usemem -n 32 --prealloc --prefault 249062617
> > (result unit is KB/s, bigger is better)
> >
> > | size | w/o patch | w/ patch | delta |
> > |-------------|-----------|-----------|---------|
> > | baseline 4K | 723041.63 | 717643.21 | -0.75% |
> > | mthp 16K | 732871.14 | 799513.18 | +9.09% |
> > | mthp 32K | 746060.91 | 836261.83 | +12.09% |
> > | mthp 64K | 747333.18 | 855570.43 | +14.48% |
> >
> > Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
> > ---
> > include/linux/huge_mm.h | 3 +
> > mm/memory.c | 174 ++++++++++++++++++++++++++++++++++++----
> > 2 files changed, 163 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 2f190c90192d..d1ebbe0636fb 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -132,6 +132,9 @@ enum mthp_stat_item {
> > MTHP_STAT_SHMEM_ALLOC,
> > MTHP_STAT_SHMEM_FALLBACK,
> > MTHP_STAT_SHMEM_FALLBACK_CHARGE,
> > + MTHP_STAT_WP_FAULT_ALLOC,
> > + MTHP_STAT_WP_FAULT_FALLBACK,
> > + MTHP_STAT_WP_FAULT_FALLBACK_CHARGE,
> > MTHP_STAT_SPLIT,
> > MTHP_STAT_SPLIT_FAILED,
> > MTHP_STAT_SPLIT_DEFERRED,
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8dd869b0cfc1..ea84c49cc975 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3344,6 +3344,21 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
> > return ret;
> > }
> > +static inline int __wp_folio_copy_user(struct folio *dst, struct folio *src,
> > + unsigned int offset,
> > + struct vm_fault *vmf)
> > +{
> > + struct vm_area_struct *vma = vmf->vma;
> > + void __user *uaddr;
> > +
> > + if (likely(src))
> > + return copy_user_large_folio(dst, src, offset, vmf->address, vma);
> > +
> > + uaddr = (void __user *)ALIGN_DOWN(vmf->address, folio_size(dst));
> > +
> > + return copy_folio_from_user(dst, uaddr, 0);
> > +}
> > +
> > static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
> > {
> > struct file *vm_file = vma->vm_file;
> > @@ -3527,6 +3542,119 @@ vm_fault_t __vmf_anon_prepare(struct vm_fault *vmf)
> > return ret;
> > }
> > +static inline unsigned long thp_wp_suitable_orders(struct folio *old_folio,
> > + unsigned long orders)
> > +{
> > + int order, max_order;
> > +
> > + max_order = folio_order(old_folio);
> > + order = highest_order(orders);
> > +
> > + /*
> > + * Since need to copy content from the old folio to the new folio, the
> > + * maximum size of the new folio will not exceed the old folio size,
> > + * so filter the inappropriate order.
> > + */
> > + while (orders) {
> > + if (order <= max_order)
> > + break;
> > + order = next_order(&orders, order);
> > + }
> > +
> > + return orders;
> > +}
> > +
> > +static bool pte_range_readonly(pte_t *pte, int nr_pages)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < nr_pages; i++) {
> > + if (pte_write(ptep_get_lockless(pte + i)))
> > + return false;
> > + }
> > +
> > + return true;
> > +}
> > +
> > +static struct folio *alloc_wp_folio(struct vm_fault *vmf, bool pfn_is_zero)
> > +{
> > + struct vm_area_struct *vma = vmf->vma;
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > + unsigned long orders;
> > + struct folio *folio;
> > + unsigned long addr;
> > + pte_t *pte;
> > + gfp_t gfp;
> > + int order;
> > +
> > + /*
> > + * If uffd is active for the vma we need per-page fault fidelity to
> > + * maintain the uffd semantics.
> > + */
> > + if (unlikely(userfaultfd_armed(vma)))
> > + goto fallback;
> > +
> > + if (pfn_is_zero || !vmf->page)
> > + goto fallback;
> > +
> > + /*
> > + * Get a list of all the (large) orders below folio_order() that are enabled
> > + * for this vma. Then filter out the orders that can't be allocated over
> > + * the faulting address and still be fully contained in the vma.
> > + */
> > + orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> > + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > + orders = thp_wp_suitable_orders(page_folio(vmf->page), orders);
> > +
> > + if (!orders)
> > + goto fallback;
> > +
> > + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> > + if (!pte)
> > + return ERR_PTR(-EAGAIN);
> > +
> > + /*
> > + * Find the highest order where the aligned range is completely readonly.
> > + * Note that all remaining orders will be completely readonly.
> > + */
> > + order = highest_order(orders);
> > + while (orders) {
> > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > + if (pte_range_readonly(pte + pte_index(addr), 1 << order))
> > + break;
> > + order = next_order(&orders, order);
> > + }
> > +
> > + pte_unmap(pte);
> > +
> > + if (!orders)
> > + goto fallback;
> > +
> > + /* Try allocating the highest of the remaining orders. */
> > + gfp = vma_thp_gfp_mask(vma);
> > + while (orders) {
> > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > + folio = vma_alloc_folio(gfp, order, vma, addr);
> > + if (folio) {
> > + if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> > + count_mthp_stat(order, MTHP_STAT_WP_FAULT_FALLBACK_CHARGE);
> > + folio_put(folio);
> > + goto next;
> > + }
> > + folio_throttle_swaprate(folio, gfp);
> > + return folio;
> > + }
>
> I might be missing something, but besides the PAE issue I think there are
> more issues lurking here:
>
> * Are you scanning outside of the current VMA, and some PTEs might
> actually belong to a !writable VMA?
In thp_vma_suitable_order(), it not exceed the size of the current VMA,
and all PTEs belong to current writable VMA.
> * Are you assuming that the R/O PTE range is actually mapping all-pages
> from the same large folio?
Yes, is there a potential problem with this assumption? maybe I'm missing
something.
>
> I am not sure if you are assuming some natural alignment of the old folio.
> Due to mremap() that must not be the case.
Here it is assumed that the virtual address aligns the old folio size,
the mremap would break that assumption, right?
>
> Which stresses my point: khugepaged might be the better place to re-collapse
> where reasonable, avoiding further complexity in our CoW handling.
>
> --
> Cheers
>
> David / dhildenb
>
next prev parent reply other threads:[~2025-08-15 15:31 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-14 11:38 [RFC PATCH 0/7] add mTHP support for wp Vernon Yang
2025-08-14 11:38 ` [RFC PATCH 1/7] mm: memory: replace single-operation with multi-operation in wp Vernon Yang
2025-08-14 11:38 ` [RFC PATCH 2/7] mm: memory: add ptep_clear_flush_range function Vernon Yang
2025-08-15 5:33 ` kernel test robot
2025-08-14 11:38 ` [RFC PATCH 3/7] mm: memory: add kmsan_copy_pages_meta function Vernon Yang
2025-08-14 11:38 ` [RFC PATCH 4/7] mm: memory: add offset to start copy for copy_user_gigantic_page Vernon Yang
2025-08-14 11:38 ` [RFC PATCH 5/7] mm: memory: improve wp_page_copy readability Vernon Yang
2025-08-14 11:38 ` [RFC PATCH 6/7] mm: memory: add mTHP support for wp Vernon Yang
2025-08-14 11:58 ` David Hildenbrand
2025-08-15 15:20 ` Vernon Yang
2025-08-16 6:40 ` David Hildenbrand
2025-08-14 12:57 ` David Hildenbrand
2025-08-15 15:30 ` Vernon Yang [this message]
2025-08-15 6:26 ` kernel test robot
2025-08-14 11:38 ` [RFC PATCH 7/7] selftests: mm: support wp mTHP collapse testing Vernon Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aJ9RLmuEj4U7JN_7@vernon-pc \
--to=vernon2gm@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=dvyukov@google.com \
--cc=elver@google.com \
--cc=glider@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=npache@redhat.com \
--cc=osalvador@suse.de \
--cc=richardcochran@gmail.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=shuah@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.